ElasticSearch Alert Rule Configuratio

This document provides detailed instructions on configuring alert rules for ElasticSearch data sources in Monitors. Monitors uses ElasticSearch SQL functionality to monitor log and metric data, supporting flexible aggregation queries and alert evaluation.

Core concepts

Version requirements: Due to SQL feature dependencies, only ElasticSearch 6.3 and above are supported.

Query language: Currently only SQL syntax is supported.

Field handling: The alert engine automatically converts all field names in query results to lowercase. When configuring value fields and label fields, always use lowercase letters.

1. Threshold mode

This mode is suitable for scenarios requiring threshold comparisons on aggregated values, such as monitoring "error log count in the last 5 minutes".

Configuration

Query: Write a SQL aggregation query that returns numeric columns and (optional) grouping columns.

Example: Count error logs per service in the last 5 minutes.

Field mapping:

Label fields: Fields used to distinguish different alert objects. In the example above, this is service_name. This field can be left empty, and Monitors will automatically treat all fields except value fields as label fields.

Value fields: Numeric fields used for threshold evaluation. In the example above, this is error_cnt.

Threshold conditions:

Use $A.field_name to reference values.

Example: Critical: $A.error_cnt > 50, Warning: $A.error_cnt > 10.

Shorthand: If only one value field is configured, you can use $A directly, e.g., $A > 50.

How it works

The engine executes the SQL query and retrieves tabular data. It groups data by label fields, then extracts value field values for comparison against threshold expressions.

Note: The label field combination uniquely identifies an alert object. Query results cannot have multiple rows with the same label field value combination. Ensure each alert object corresponds to exactly one row of data. In the example above, service_name values must be unique. If two rows have the same service_name, the alert engine cannot properly distinguish between alert objects.

Recovery logic

Similar to Prometheus data sources, ElasticSearch threshold mode also supports flexible recovery strategies:

Auto recovery: When the latest SQL query result shows that a data group's value no longer meets any alert threshold (Critical/Warning), a recovery event is automatically generated.

Specific recovery conditions: Configure additional recovery expressions (e.g., $A.error_cnt < 5). Recovery is only confirmed when the value falls below this threshold, preventing alert flapping.

Recovery query:

Scenario: Sometimes alert queries and recovery queries have different logic. For example, the alert checks for "error log count > 10", while recovery might check for "success log count > 100" or query a different status index.

Configuration: Write an independent SQL statement for recovery evaluation.

Variable support: Recovery SQL supports using ${label_name} to reference alert event label values.

Example: The alert SQL found that network card with network_host="a", interface="b" is down. The recovery SQL can be:

The engine replaces ${network_host} and ${interface} with actual values before executing the query. If data is found, recovery is confirmed.

2. Data exists mode

This mode is suitable for scenarios where filtering logic is written directly in SQL, or when you only need to check "whether any data is returned".

Configuration

Query: Use a HAVING clause in SQL to directly filter out anomalous data.

Example: Directly query services with more than 50 errors.

Field mapping:

In this mode, label fields and value fields are optional. If both are left empty, the engine treats all fields in the query result as label fields, which can be referenced in rule descriptions.

Recovery logic

Data disappearance means recovery: When the SQL query result is empty (i.e., the HAVING condition is no longer met), the engine determines the incident has recovered. This is the most common recovery method.

Recovery query:

Scenario: Sometimes "no data found" doesn't mean recovery (it could be that log collection failed), or stricter recovery conditions are needed (e.g., no errors for N consecutive minutes).

Configuration: Write an independent SQL statement for recovery evaluation. If this query finds data, the incident is considered recovered.

Variable support: Recovery SQL supports using ${label_name} to reference alert event label values for precise recovery detection.

Pros and cons

Pros: Leverages the ES cluster's computing power for filtering, reducing network transmission and improving performance.

Cons: Cannot distinguish between multiple severity levels (e.g., Info/Warning), because SQL can only return data meeting specific conditions.

3. No data mode

This mode monitors scenarios where data is expected but actually missing, commonly used to monitor log collection pipeline interruptions or scheduled tasks not executing.

Configuration

Query: Write a SQL query that should continuously return data.

Example: Query heartbeat logs from all hosts.

Evaluation rule:

The engine periodically executes this SQL.

If a host_name appeared in previous cycles but no longer appears in the current cycle (and N consecutive cycles), a "no data" alert is triggered.

Note: This is the opposite of data exists mode. Data exists triggers alerts when data is found; no data triggers alerts when data is not found.

Recovery logic

Data appearance means recovery: Once the host_name reappears in query results, the alert automatically recovers.

Auto recovery time: Configure an auto recovery time (e.g., 24 hours). If not recovered after this time, the engine automatically closes the alert. This is typically used for handling decommissioned machines that no longer need monitoring.

4. Use case example

Log alerting often requires: counting ERROR logs in the last 5 minutes, triggering an alert if the count exceeds a threshold, and displaying the most recent ERROR log as a sample in the alert message. Here's the configuration:

Main alert condition: Use threshold mode with a SQL statement counting ERROR logs in the last 5 minutes and configure threshold conditions.

Enrichment query: Configure an enrichment query with a SQL statement that retrieves the most recent ERROR log, using variables like ${service_name} to limit to specific services.

Rule description: Reference enrichment query results in the alert rule's description field using the $relates variable to render the original log content.

ElasticSearch

Core concepts#

1. Threshold mode#

Configuration#

How it works#

Recovery logic#

2. Data exists mode#

Configuration#

Recovery logic#

Pros and cons#

3. No data mode#

Configuration#

Recovery logic#

4. Use case example#

Core concepts

1. Threshold mode

Configuration

How it works

Recovery logic

2. Data exists mode

Configuration

Recovery logic

Pros and cons

3. No data mode

Configuration

Recovery logic

4. Use case example