2018-05-04

Terraform Datadog Provider を使用した Monitor のテンプレート化

Datadog

Datadog Monitor の定義を Terraform で管理できます。

Provider: Datadog

が、Datadog側がJSONで定義されており少々書き難いのと、 Monitor毎に同じ記載を繰り返す部分(通知本文や通知先)をテンプレート化できないものかと思い、考えてみた結果をメモ。

Terraform の Template Provider で実現します。バージョンによっては動作しないですし、もっと良い方法・書き方が有りそうではあります。

.
├── datadog_key.auto.tfvars      # APIKey定義
├── datadog_monitor.auto.tfvars  # 通知先定義
├── datadog_monitor.tf           # provider定義
├── datadog_monitor_template.tf  # template定義
├── ec2.tf
├── templates                    # テンプレートファイル群
│   ├── message.tmpl             # 通知本文用
│   └── notify.tmpl              # 通知先用
├── terraform.tfstate
└── terraform.tfstate.backup

以下、上から順にファイル内容の説明です。

datadog_key.auto.tfvars

Datadog の API Key、 Application Key 値を設定します。 git 管理する場合等は .gitignore に入れる候補になるかと思います。

datadog_api_key=""
datadog_app_key=""

datadog_monitor.auto.tfvars

通知先のリストを設定します。 @slack-〜 が通知先です。(例ではSlackのみですがメールアドレス等や他インテグレーションでも良いです)
アラートレベルにより通知先が変更できるようにします。

# all
notify_all = [
  "@slack-alert0",
  "@slack-alert-all",
]
# only alert
notify_is_alert = [
  "@slack-alert1",
  "@slack-alert-only",
]
notify_is_alert_recovery = []
・・・

datadog_monitor.tf

Datadog Provider を定義します。公式Doc通りです。

Provider: Datadog - Terraform by HashiCorp

# Variables
variable "datadog_api_key" {}
variable "datadog_app_key" {}

# Configure the Datadog provider
provider "datadog" {
  version = "~> 1.0"
  api_key = "${var.datadog_api_key}"
  app_key = "${var.datadog_app_key}"
}

datadog_monitor_template.tf

今回の肝になるテンプレート定義です。定義した通知先リスト変数を並べて、テンプレートに渡します。

# 通知先リスト変数定義
variable "notify_all" { type = "list" }
variable "notify_is_alert" { type = "list" }
variable "notify_is_alert_recovery" { type = "list" }
・・・
# 通知先リストを文字列置換
locals {
  notify_all_join = "${ length(var.notify_all) == 0 ? "" : join(" ", var.notify_all) }"
  notify_all = " ${local.notify_all_join} "
  # is
  notify_is_alert_join = "${ length(var.notify_is_alert) == 0 ? "" : join(" ", var.notify_is_alert) }"
  notify_is_alert = " {{#is_alert}} ${local.notify_is_alert_join} {{/is_alert}} "
  notify_is_alert_recovery_join = "${ length(var.notify_is_alert) == 0 ? "" : join(" ", var.notify_is_alert) }"
  notify_is_alert_recovery = " {{#is_alert_recovery}} ${local.notify_is_alert_recovery_join} {{/is_alert_recovery}} "
・・・
}
# テンプレートファイルを定義
## 変数渡し無し
data "template_file" "message" {
  template = "${file("./templates/message.tmpl")}"
}
## 変数渡し有り
data "template_file" "notify" {
  template = "${file("./templates/notify.tmpl")}"
  vars {
    notify_all = "${ local.notify_all_join == "" ? "" : local.notify_all }"
    # is
    notify_is_alert = "${ local.notify_is_alert_join == "" ? "" : local.notify_is_alert }"
    notify_is_alert_recovery = "${ local.notify_is_alert_recovery_join == "" ? "" : local.notify_is_alert_recovery }"
・・・
  }
}

templates/

例では2つしか置いてませんが、定義を増やす事でテンプレートは増やせます。

message.tmpl

通知本文用のテンプレート、変数引渡し無しverの例になります。

Metric Value: {{value}} {{comparator}} threshold: {{#is_warning}}{{warn_threshold}}{{/is_warning}}{{#is_warning_recovery}}{{warn_threshold}}{{/is_warning_recovery}}{{#is_alert}}{{threshold}}{{/is_alert}}{{#is_alert_recovery}}{{threshold}}{{/is_alert_recovery}}

- Host: {{host.name}}

notify.tmpl

通知先用のテンプレート例です。変数引渡しを行い、アラートレベルにより通知先リストを設定させます。

More information: [Ops Guide](http://example.com)

Notify:${notify_all}${notify_is_alert}${notify_is_alert_recovery}${notify_is_warning}${notify_is_warning_recovery}${notify_is_recovery}${notify_is_no_data}${notify_is_not_alert}${notify_is_not_alert_recovery}${notify_is_not_warning}${notify_is_not_warning_recovery}${notify_is_not_recovery}${notify_is_not_no_data}

作成例

ec2.tf に datadog_monitor の resource 定義をします。
message、escalation_message にテンプレートを設定(data.template_file.〜.rendered 部分)します。

例では2つの resource を定義し、message に個別の記述と、テンプレート記述設定をしています。

resource "datadog_monitor" "ec2_cpuutilization" {
  type = "query alert"
  name = "[TEST] EC2 CPU Utilization"
  query = "max(last_5m):max:aws.ec2.cpuutilization{name:test-instance-al2-0} by {host,name,region} > 99"
  message = "# [TEST] EC2 CPU Utilization\n${data.template_file.message.rendered}${data.template_file.notify.rendered}"
  escalation_message = "**Renotify**\n# [TEST] EC2 CPU Utilization\n${data.template_file.message.rendered}${data.template_file.notify.rendered}"
・・・
}

resource "datadog_monitor" "ec2_status_check_failed" {
  type = "query alert"
  name = "[TEST] EC2 StatusCheckFailed"
  query = "min(last_15m):max:aws.ec2.status_check_failed{name:test-instance-al2-0} by {host,name,region} > 0"
  message = "# [TEST] EC2 StatusCheckFailed\n${data.template_file.message.rendered}${data.template_file.notify.rendered}"
  escalation_message = "**Renotify**\n# [hoge] EC2 StatusCheckFailed\n${data.template_file.message.rendered}${data.template_file.notify.rendered}"
・・・
}

作成結果

f:id:htnosm:20180503180952p:plain f:id:htnosm:20180503180953p:plain

通知本文の一部を共通化し、一括更新を行うことができるようになりました。
懸念は Terraform の更新頻度が高いため、仕様変更により動作しなくなる可能性でしょうか。
Datadog側標準機能でテンプレート化、通知先グループの設定などを実装して欲しい所です。

2018-05-01

Nagios ntpチェック

Nagios

Nagios を利用して時刻同期の監視を行う場合にプラグインが複数有り、ヘルプのみだと腑に落ちなかったので簡単にまとめます。

f:id:htnosm:20180430175450j:plain

公式プラグイン集をベースに確認します。

Nagios Plugins | The home of the official Nagios® Plugins

check_ntp_peer = ntpサーバの正常性チェック
check_ntp_time = ntpプロトコルを利用して時刻同期のチェック
となります。
check_time というプラグインもありますが、こちらは timeプロトコルを利用して時刻同期のチェックを行うようです。 (timeサービスでの時刻同期を行っている環境に出会ったことが無いので今回は割愛)

check_ntp_peer

nagios-plugins/check_ntp_peer.c at master · nagios-plugins/nagios-plugins
- NTPサーバの正常性をチェック
- localhost(監視対象ホスト)とNTPサーバ間の時刻はチェックしない

/usr/lib64/nagios/plugins/check_ntp_peer -H localhost -w 1 -c 2

-H には監視対象のntpdが動作しているホストを指定します。上記の場合 localhost 上で動作している ntpd の同期状態をチェックします。
ntpd が動作していること が前提です。ntpdが同期対象としているNTPサーバとの比較になります。

独自NTPサーバを参照している等で不正な値(実際の時刻とはずれている)を返してきている場合でも、参照しているNTPサーバとの同期が取れている状態であれば正常と判断されます。

chrony 未サポート

chrony は未サポートです。 check_ntp_peer は mode 6 で実装されており、 chronyd は mode 6 をサポートしません。
現在公式プラグインでは chronyd の正常性チェックプラグインは無いようです。

          +-------------------+-------------------+------------------+
          |  Association Mode | Assoc. Mode Value | Packet Mode Value|
          +-------------------+-------------------+------------------+
          | Symmetric Active  |         1         | 1 or 2           |
          | Symmetric Passive |         2         | 1                |
          | Client            |         3         | 4                |
          | Server            |         4         | 3                |
          | Broadcast Server  |         5         | 5                |
          | Broadcast Client  |         6         | N/A              |
          +-------------------+-------------------+------------------+

check_ntp_time (check_ntp)

nagios-plugins/check_ntp_time.c at master · nagios-plugins/nagios-plugins
- localhost(監視対象ホスト)とNTPサーバ間の時刻差をチェック

/usr/lib64/nagios/plugins/check_ntp_time -H ntp.nict.jp -w 1 -c 2

-H には監視対象ホストと比較するNTPサーバを指定します。 localhost を指定した場合は自分自身のNTPサーバとの比較となるため、殆ど意味を成しません。

監視対象ホスト上で時刻同期サービス(ntpd、chronyd、etc...)の起動有無は問いません。

参考URL

2018-04-16

Datadog AWSインテグレーション用 CloudFormationテンプレート

Datadog AWS

ありそうでなかったので作成。(見つけられないだけでしょうか)

github.com

雑記

Datadog AWS Integration 設定(IAM Role) - Qiita の焼き直しです。権限部分をコピペして作れるようにしたかったのと、更新箇所把握しておきたかったのでリポジトリ化。ほぼ自分用です。

AWS、Datadog双方の都合で付与権限は変わるようなので、それなりの頻度で権限部分の更新が入ります。公式でドキュメント更新だけでなく、テンプレートなりポリシードキュメントなり配布するようになると良いと思います。

CloudFormation YAMLの関数名は短縮形の構文が使用できますが、サードパーティ系のツールが非対応の物があるため敢えて使っていません。短縮形だと想定した動作にならず小一時間悩みました。

2018-04-14

Datadog で AWS SNS を受け取る (RDS/ElastiCacheイベント)

Datadog AWS

AWS の SNS トピックを Datadog で直接サブスクライブできます。ドキュメント通りなんですが、どのような感じで通知されるのかを残しておきます。

公式 AWS SNS

一応受信用 Email を払い出して受信することもできます。

過去記事 AWS RDSイベント通知を受け取る - vague memory

設定
- SNS
Event設定例
- RDS Events
- ElastiCache Events
受信例
- Event Monitor から Slack への通知例

設定

前提
- Datadog上でAWSインテグレーションが設定済みであること
  - https://app.datadoghq.com/account/settings#integrations/amazon_web_services

SNS

AWS SNS で Topic と Subscription を作成します。
Endpoint には Datadog の Webhook URL を指定します。

https://app.datadoghq.com/intake/webhook/sns?api_key=<API KEY>

API Key は Datadog の [Integrations]->[APIs] で取得
- https://app.datadoghq.com/account/settings#api

f:id:htnosm:20180413151001p:plain

Event設定例

RDS/ElastiCache の Event を飛ばしてみます。 (Datadog AWSインテグレーションで既に Events に通知されていますが、明示的に SNS -> Datadog への通知を行います)

f:id:htnosm:20180413151002p:plain

RDS Events

[Event subscriptions] に Topic を設定します。

f:id:htnosm:20180413151003p:plain

ElastiCache Events

RDSと異なり全体の通知設定は無く、各 Cluster 個別に Topic を設定します。

f:id:htnosm:20180413151004p:plain

受信例

f:id:htnosm:20180413152131p:plain

Event Monitor から Slack への通知例

Monitor

確認用に変数埋め込んでいますが、無い方が見易いです。 f:id:htnosm:20180413151006p:plain

f:id:htnosm:20180413151007p:plain

ElastiCache

f:id:htnosm:20180413151008p:plain

2018-03-30

Datadog Logs(パブリックベータ) ログ解析と通知

Datadog

Datadog でのログ管理機能(パブリックベータ版)での検証履歴です。今回は[Logs]->[Explorer]、[Logs]->[Pipelines] 部分を確認します。

また、公開当時は無かった Log Monitor (通知機能) も実装されていますので併せて試します。

Explorer

検索用の画面です。ここでの条件付与の為に後述 Pipeline で、表示列(column)や絞り込み条件(facet)を作成します。

公式 Search & Graph

Pipelines

デフォルト状態で Apache、Nginx、Java の Pipeline が設定されていました。今後標準的なログの Pipeline は増えていくのかもしれません。

既存 Pipeline からの複製(clone)が行えます。clone元は無効化された状態となります。 (新規作成ももちろん可能です)

f:id:htnosm:20180329015647p:plain

Pipeline filters

facets や tag を利用して、Pipeline の対象とするログの絞り込み設定をします。

Processors

Pipeline filters で絞り込みをしたログに対する変換処理を定義します。

公式 Parsing

Grok で解析・抽出し、リマッパーに渡して属性に割り当てるが基本的な流れになります。詳細は公式参照で、以下メモです。

Grok Parser
- Logstash で利用される Grok filter を用いて解析を行う。
Log Date Remapper
- ログのタイムスタンプを定義
- 未定義の場合はDatadogがログを受信した日時となる
Log Status Remapper
- level を定義
- Explorer上の Status に反映される
- 整数は syslog Severity_level に対応
- 文字列は後述 Status Remapper 対応表 参照
Attribute Remapper
- 任意の属性を別属性に再割当て
URL Parser
- URLを解析しパラメータ抽出
- Gork Parser 内でも url フィルタとして実装
Useragent parser
- User Agent を解析しパラメータ抽出
- Gork Parser 内でも useragent フィルタとして実装

Status Remapper 対応表

大文字小文字の区別無し
該当しないものは全て info(6) にマップ

開始文字列	割当キーワード	対応値	重大度(Severity)
emerg または f	emerg	0	緊急(Emergency)
a	alert	1	アラート(Alert)
c	critical	2	クリティカル(Critical)
err	error	3	エラー(Error)
w	warning	4	警告(Warning)
n	notice	5	通知(Notice)
i	info	6	情報(Informational)
d または trace または verbose	debug	7	デバッグ(Debug)
ok または success に一致	OK	ー	ー

制限

サイズ制限等があるため下記最新情報を参照。

Pipelines Technical limits

解析(ParsingRule)例

既存のログを使用し、いくつか解析を試してみます。

尚、 Pipeline 設定を更新した場合、過去に取り込み済みのログには適用されません。 Pipeline 更新以降に取り込まれたログから適用されます。

jenkins.log

JenkinsParsingRule %{date("MMM dd, yyyy H:mm:ss"):datestr} %{regex("[AP]M"):meridian} %{notSpace:class} %{word:method}[\n| ]%{word:level}: %{data:message}

f:id:htnosm:20180329015649p:plain

messages

messages %{date("MMM d HH:mm:ss"):datestr} %{notSpace:host} %{word:program}(\[%{integer:processid}\])?:%{data:message}

f:id:htnosm:20180329015648p:plain

maillog(stat=を含む行のみ)

maillogstat %{date("MMM d HH:mm:ss"):datestr} %{notSpace:host} %{data:program}(\[%{integer:processid}\])?: %{word:messageid}: %{data::keyvalue("="," /()\\[\\]:")}stat=%{word:stat}: %{data:message}

f:id:htnosm:20180329015650p:plain

Monitor

公式 Log monitor

閾値超過でのアラート通知を行えます。

"ERROR" を含むログが n 件出力されたらアラート の通知はできますが、 "ERROR" を含むログの内容 の通知機能は今の所ありません。

[Monitors]->[New Monitor]->[Logs]

f:id:htnosm:20180329015651p:plain

設定自体は他のMonitorと同様です。

f:id:htnosm:20180329015652p:plain

尚、バグなのか仕様なのか不明ですが、現在、閾値(threshold)は 1 以上しか設定できない(0は入力を受け付けない)ため、1件での検知をしたい場合は above or equal to 1 で設定する必要があります。

Slackへの通知例

クエリが通知先と認識されてしまっていますが、通知イメージです。

f:id:htnosm:20180329015653p:plain

LogMonitor はまだテスト実装な感じを受けました。

2018-03-29

Datadog Logs(パブリックベータ) ログ収集設定

Datadog

Datadog logs(パブリックベータ) を試してみるの続きで、 Datadog でのログ管理機能(パブリックベータ版)での検証履歴です。

htnosm.hatenablog.com

今回はログを送信する側での除外、置換、複数行を確認します。

公式 Log Management

ログ収集設定

基本的な設定値

  # 必須
  - type: 入力タイプ (tcp/udp/file)
    # 入力タイプにより port/path のいずれか
    #port: tcp/udpの場合、ポート指定
    path: file の場合、対象ログファイルのフルパス
    service: 所有サービス名
    source: インテグレーション名、カスタムログの場合は任意文字列(カスタムメトリクス名に合わせるのが推奨との記載有り)
  # オプション
    sourcecategory: 絞り込み用オプション
    tags: タグ付け(カンマ区切り)

ログのタグには、収集対象のホストに付与されているタグも自動的に付与されます。

f:id:htnosm:20180328220426p:plain

インテグレーションのログ収集設定

Datadog コンソール上で、設定方法と設定ファイルの例が参照できます。

[Logs]->[Docs]

f:id:htnosm:20180328220421p:plain

もしくは、Datadog Agent インストールした際に作成される各exampleにも記載があります。

/etc/datadog-agent/conf.d/apache.d/conf.yaml.example 等

未サポートのログ収集設定

Datadog コンソール上で、今後実装予定のインテグレーションが参照できます。

[Logs]->[Docs]->[Server]->[Other] 等

インテグレーション追加のリクエストを送る事もできるようです。

f:id:htnosm:20180328220422p:plain

Advanced log collection functions (収集ルール)

log_processing_rules ディレクティブでDatadogへ転送するログの詳細設定を行います。

以下、各ruleの利用例です。ログファイルは dd-agent ユーザでの読み込みが行える状態にしてあります。

exclude_at_match

除外設定です。パターンに一致するログの送信を行いません。 debug,info レベルのログは送信しない等で利用できます。

/var/log/messages から ansible を含む行を除外する例

logs:
  - type: file
    path: /var/log/messages
    service: syslog
    source: os
    sourcecategory: system
    tags: log_type:file,rule_type:exclude_at_match
    log_processing_rules:
      - type: exclude_at_match
        name: exclude_ansible
        ## Regexp can be anything
        pattern: \sansible.*:\s

送受信例

ログ例

Mar 28 hh:mm:ss ip-xxx-xxx-xxx-xxx ansible-setup: Invoked with filter=* gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10

Datadog 側
- 除外されるため表示されない

include_at_match

ログ抽出設定です。 exclude_at_matchの逆で、パターンに一致するログのみを送信します。
例は mask_sequences の項へ記載します。

mask_sequences

パターンに一致する文字列のマスクを行います。

/var/log/maillog から stat= を含む行のみを抽出、メールアドレスをマスクする例

logs:
  - type: file
    path: /var/log/maillog
    service: maillog
    source: os
    sourcecategory: system
    tags: log_type:file,rule_type:include_at_match,rule_type:mask_sequences
    log_processing_rules:
      - type: include_at_match
        name: include_maillog_stat
        ## Regexp can be anything
        pattern: \sstat=.*?\s
      - type: mask_sequences
        name: mask_mailaddress
        replace_placeholder: " to=[mask_mailaddress], "
        ##One pattern that contains capture groups
        pattern: \sto=.*?,\s

送受信例

ログ例

Mar 28 hh:mm:ss ip-xxx-xxx-xxx-xxx sendmail[24023]: xxxxxxxx024023: to=root, ctladdr=root (0/0), delay=00:00:00, xdelay=00:00:00, mailer=relay, pri=31183, relay=[127.0.0.1] [127.0.0.1], dsn=4.0.0, stat=Deferred: Connection refused by [127.0.0.1]

Datadog 側

f:id:htnosm:20180329024755p:plain

multi_line

複数行を1行のログに集約します。

jenkins.log を送信する例
- 今後実装予定のようですが、Jenkinsのログ収集は現時点ではサポートされていません。

logs:
  - type: file
    path: /var/log/jenkins/jenkins.log
    service: jenkins
    source: java
    sourcecategory: sourcecode
    tags: log_type:file,rule_type:multi_line
    #For multiline logs, if they start with a timestamp with format yyyy-mm-dd uncomment the below processing rule
    log_processing_rules:
      - type: multi_line
        pattern: \w{3}\s(0?[1-9]|[1-3][0-9]),\s\d{4}
        name: new_log_start_with_date

送受信例

ログ例

Mar 28, 2018 4:53:51 PM hudson.model.AsyncPeriodicWork$1 run
INFO: Started Fingerprint cleanup

Datadog 側

f:id:htnosm:20180328220424p:plain

StackTraceも同様

f:id:htnosm:20180328220425p:plain

ワイルドカードでの収集は割愛。送信イメージは掴めた気がします。

2018-03-01

AWS Athena 別リージョンS3でのエラー

AWS

QuickSight を使おうとした際に当たったエラーです。 QuickSight は現在東京リージョンでの提供が無いため、オレゴンリージョンから東京リージョンのS3を参照しました。

Athenaでのクエリ実行時に以下エラーメッセージが出力されます。(同一リージョンであれば問題無いです)

HIVE_CURSOR_ERROR: The bucket is in this region: null. 
Please use this region to retry the request
(Service: Amazon S3; Status Code: 301; Error Code: PermanentRedirect;

f:id:htnosm:20180301073027p:plain

暗号化されている場合は、S3とAthenaは同一リージョンにする必要があります。

Tables and Databases Creation Process in Athena

If the data is encrypted in Amazon S3, it must be stored in the same region, and the user or principal who creates the table in Athena must have the appropriate permissions to decrypt the data.

仕様です。
が、エラーメッセージが解り難いと感じました。

目次

要件

環境

ファイル構成

datadog_key.auto.tfvars

datadog_monitor.auto.tfvars

datadog_monitor.tf

datadog_monitor_template.tf

templates/

message.tmpl

notify.tmpl

作成例

作成結果

目次

Nagios ntp plugins

check_ntp_peer

chrony 未サポート

check_ntp_time (check_ntp)

参考URL

雑記

設定

Event設定例

RDS Events

ElastiCache Events

受信例

Event Monitor から Slack への通知例

目次

Pipelines

Pipeline filters

Processors

Status Remapper 対応表

制限

解析(ParsingRule)例

jenkins.log

messages

maillog(stat=を含む行のみ)

Monitor

Slackへの通知例

目次

ログ収集設定

基本的な設定値

インテグレーションのログ収集設定

未サポートのログ収集設定

Advanced log collection functions (収集ルール)

exclude_at_match

送受信例

include_at_match

mask_sequences

送受信例

multi_line

送受信例