Query Statement: Write SQL aggregate query, returning value columns and (optional) grouping columns.
Example: Count error log quantity by service in the last 5 minutes.
Copy
SELECT service_name, count(*) AS error_cnt FROM "app-logs-*"WHERE "@timestamp" > now() - INTERVAL 5 MINUTES AND log_level = 'ERROR'GROUP BY service_name
Field Mapping:
Label Fields: Fields used to distinguish different alert objects. In the above example, it’s service_name. Can be left empty; Monitors will automatically treat all fields except value fields as label fields.
Value Fields: Numeric fields used for threshold evaluation. In the above example, it’s error_cnt.
The engine executes SQL query and gets two-dimensional table data. It groups data by “label fields”, then extracts “value fields” values to compare against threshold expressions.
Label field combination uniquely identifies an alert object. Query results cannot have multiple rows with the same label field value combination.
Write independent SQL for recovery evaluation, supports ${label_name} variables
Recovery Query Example
If the alert SQL found that network card with network_host="a", interface="b" is down, the recovery SQL can be written as:
Copy
SELECT network_host, interface, status FROM "network-status-*"WHERE "@timestamp" > now() - INTERVAL 5 MINUTES AND network_host = '${network_host}' AND interface = '${interface}' AND status = 'UP'
The engine will replace variables with actual values before executing the query. If data is found, recovery is determined.
Query Statement: Use HAVING clause in SQL to directly filter out anomalous data.
Example: Directly query services with error count exceeding 50.
Copy
SELECT service_name, count(*) AS error_cnt FROM "app-logs-*"WHERE "@timestamp" > now() - INTERVAL 5 MINUTES AND log_level = 'ERROR'GROUP BY service_nameHAVING count(*) > 50
Field Mapping:
In this mode, label fields and value fields are optional. If both are left empty, the engine will treat all fields in query results as label fields, which can be referenced in rule notes.
Recovery When Data Disappears: When SQL query result is empty (i.e., no longer satisfies HAVING condition), the engine determines incident recovery. This is the most commonly used recovery method.
Recovery Query:
Scenario: Sometimes “no data found” doesn’t mean recovery (might be log collection down), or need stricter recovery conditions (like no errors for N consecutive minutes).
Configuration: Write an independent SQL statement for recovery evaluation. As long as that query can find data, the incident is considered recovered.
Variable Support: Supports using ${label_name} in recovery SQL to reference alert event label values for precise recovery detection.
This mode is used to monitor scenarios where “data is expected but actually missing”, commonly used to monitor log collection pipeline interruption or periodic task non-execution.
Log alerting often encounters this requirement: count ERROR logs in the last 5 minutes, alert if exceeding threshold, and display the most recent ERROR log as a sample in the alert message. Configuration approach:
Main Alert Condition: Use Threshold mode, SQL statement counts ERROR logs in the last 5 minutes, configure threshold conditions.
Related Query: Configure a related query, SQL statement queries the most recent ERROR log, using ${service_name} and other variables to limit to specific service.
Rule Notes Description: Reference related query results in alert rule’s notes description, using $relates variable to render the original log.