Monitors implements log and metric data monitoring through ElasticSearch SQL feature, supporting flexible aggregate queries and alert evaluation.Documentation Index
Fetch the complete documentation index at: https://docs.flashcat.cloud/llms.txt
Use this file to discover all available pages before exploring further.
Core Concepts
| Config Item | Description |
|---|---|
| Query Language | Currently only supports SQL syntax |
| Field Processing | All field names are automatically converted to lowercase; please use lowercase letters when configuring |
1. Threshold Evaluation Mode
This mode is suitable for scenarios requiring threshold comparison on aggregated values, such as monitoring “error log count in the last 5 minutes”.Configuration
- Query Statement: Write SQL aggregate query, returning value columns and (optional) grouping columns.
- Example: Count error log quantity by service in the last 5 minutes.
- Field Mapping:
- Label Fields: Fields used to distinguish different alert objects. In the above example, it’s
service_name. Can be left empty; Monitors will automatically treat all fields except value fields as label fields. - Value Fields: Numeric fields used for threshold evaluation. In the above example, it’s
error_cnt.
- Threshold Conditions:
- Use
$A.field_nameto reference values. - Example:
Critical: $A.error_cnt > 50,Warning: $A.error_cnt > 10. - Shorthand: If only one value field is configured, you can directly use
$A, like$A > 50.
How It Works
The engine executes SQL query and gets two-dimensional table data. It groups data by “label fields”, then extracts “value fields” values to compare against threshold expressions.Recovery Logic
| Strategy | Description |
|---|---|
| Auto Recovery | When values no longer satisfy any alert threshold, automatically generates recovery event |
| Specific Recovery Condition | Configure recovery expression (e.g., $A.error_cnt < 5) to prevent alert flapping |
| Recovery Query | Write independent SQL for recovery evaluation, supports ${label_name} variables |
Recovery Query Example
Recovery Query Example
If the alert SQL found that network card with The engine will replace variables with actual values before executing the query. If data is found, recovery is determined.
network_host="a", interface="b" is down, the recovery SQL can be written as:2. Data Exists Mode
This mode is suitable for scenarios where filter logic is written directly in SQL, or when you only care about “whether data is returned”.Configuration
- Query Statement: Use
HAVINGclause in SQL to directly filter out anomalous data.
- Example: Directly query services with error count exceeding 50.
- Field Mapping:
- In this mode, label fields and value fields are optional. If both are left empty, the engine will treat all fields in query results as label fields, which can be referenced in rule notes.
Recovery Logic
- Recovery When Data Disappears: When SQL query result is empty (i.e., no longer satisfies HAVING condition), the engine determines incident recovery. This is the most commonly used recovery method.
- Recovery Query:
- Scenario: Sometimes “no data found” doesn’t mean recovery (might be log collection down), or need stricter recovery conditions (like no errors for N consecutive minutes).
- Configuration: Write an independent SQL statement for recovery evaluation. As long as that query can find data, the incident is considered recovered.
- Variable Support: Supports using
${label_name}in recovery SQL to reference alert event label values for precise recovery detection.
Pros and Cons Analysis
| Type | Description |
|---|---|
| Pros | Leverages ES cluster’s computing power for filtering, reducing network transmission with better performance |
| Cons | Cannot differentiate multi-level alerts (like Info/Warning); SQL can only return data satisfying specific conditions |
3. No Data Mode
This mode is used to monitor scenarios where “data is expected but actually missing”, commonly used to monitor log collection pipeline interruption or periodic task non-execution.Configuration
- Query Statement: Write a SQL query that is expected to continuously return data.
- Example: Query log reporting heartbeat from all hosts.
- Evaluation Rules:
- The engine periodically executes this SQL.
- If a
host_nameappeared in previous cycles but no longer appears in current cycle (and N consecutive cycles) query results, triggers “No Data” alert. - Note: This is opposite to Data Exists mode. Data Exists means “alert when data found”, No Data means “alert when data not found”.
Recovery Logic
| Strategy | Description |
|---|---|
| Recovery When Data Appears | Once that host_name reappears in query results, alert automatically recovers |
| Auto Recovery Timeout | Configurable timeout (like 24 hours), automatically closes alert after timeout |
4. Use Case
Log alerting often encounters this requirement: count ERROR logs in the last 5 minutes, alert if exceeding threshold, and display the most recent ERROR log as a sample in the alert message. Configuration approach:- Main Alert Condition: Use Threshold mode, SQL statement counts ERROR logs in the last 5 minutes, configure threshold conditions.
- Related Query: Configure a related query, SQL statement queries the most recent ERROR log, using
${service_name}and other variables to limit to specific service. - Rule Notes Description: Reference related query results in alert rule’s notes description, using
$relatesvariable to render the original log.