Flashduty Docs
中文English
RoadmapAPI官网控制台
中文English
RoadmapAPI官网控制台
  1. Alert Rules
  • Introduction
  • On-call
    • Getting Started
      • Quick start
      • FAQ
      • Product Comparison
    • Incidents
      • What is an Incident
      • View Incidents
      • Handle Incidents
      • Escalations and Assignments
      • Custom Fields
      • Custom Actions
      • Alert Noise Reduction
      • Past Incidents
      • Outlier Incidents
      • Status Pages
    • Configure On-call
      • Channels
      • Integrate Alerts
      • Alert Noise Reduction
      • Escalation Rules
      • Label Enrichment
      • Schedules
      • Templates
      • Service Calendars
      • Preferences
      • Alert Routing
      • Silence and Inhibition
      • Filters
      • Notifications
      • Alert Pipeline
    • Advanced Features
      • Referencing Variables
      • Dynamic Assignment
      • Insights
      • War-room
    • Integrations
      • Alerts integration
        • Standard Alert Integration
        • Email Integration
        • Nightingale/FlashCat Integration
        • Prometheus Integration
        • Grafana Integration
        • Zabbix Integration
        • Uptime Kuma Integration
        • Alibaba Cloud ARMS Integration
        • Alibaba Cloud Monitor CM Event Integration
        • Alibaba Cloud Monitor CM Metrics Integration
        • Alibaba Cloud SLS Integration
        • AWS CloudWatch Integration
        • Azure Monitor Integration
        • Baidu Cloud BCM Integration
        • Huawei Cloud CES Integration
        • Influxdata Integration
        • Open Falcon Integration
        • PagerDuty Integration
        • Tencent BlueKing Integration
        • Tencent Cloud CLS Integration
        • Tencent Cloud Monitor CM Integration
        • Tencent Cloud EventBridge
        • OceanBase Integration
        • Graylog Integration
        • Skywalking Integration
        • Sentry Integration
        • Jiankongbao Integration
        • AWS EventBridge Integration
        • Dynatrace Integration
        • Huawei Cloud LTS Integration
        • GCP Integration
        • Splunk Alert Events Integration
        • AppDynamics Alert Integration
        • SolarWinds Alert Events Integration
        • Volcengine CM Alert Events Integration
        • Volcengine CM Event Center Integration
        • Volcengine TLS Integration
        • OpManager Integration
        • Meraki Integration
        • Keep Integration
        • ElastAlert2 Alert Integration
        • StateCloud Alert Events
        • Guance Alert Events
        • Zilliz Alert Events
        • Huawei Cloud APM Alerts
        • zstack integration
        • Monit Alert Integration
        • RUM Alert Integration
      • Change integration
        • Standard Change Event
        • Jira Issue Events
      • IM integration
        • Feishu (Lark) Integration Guide
        • Dingtalk Integration
        • WeCom Integration
        • Slack Integration
        • Microsoft Teams Integration
      • Single Sign-On
        • Authing Integration
        • Keycloak Guide
        • OpenLDAP Guide
      • Webhooks
        • Alert webhook
        • Incident webhook
        • Costom action
        • ServiceNow Sync
        • Jira Sync
      • Other
        • Link Integration
  • RUM
    • Getting Started
      • Introduction
      • Quick start
      • FAQ
    • Applications
      • Applications
      • SDK Integration
      • Advanced Configuration
      • Analysis Dashboard
    • Performance Monitoring
      • Overview
      • Metrics
      • Performance Analysis
      • Performance Optimize
    • Error Tracking
      • Overview
      • Error Reporting
      • Issues
      • Source Mapping
      • Error Grouping
      • Issue States
      • Issue Alerting
    • Session Explorer
      • Overview
      • Data Query
    • Session Replay
      • View Session Replay
      • Overview
      • SDK Configuration
      • Privacy Protection
    • Best Practice
      • Distributed Tracing
    • Others
      • Terminology
      • Data Collection
      • Data Security
  • Monitors
    • Getting Started
      • Introduction
      • Quick Start
    • FAQ
      • FAQ
    • Alert Rules
      • Prometheus
      • ElasticSearch
      • Loki
      • ClickHouse
      • MySQL
      • Oracle
      • PostgreSQL
      • Aliyun SLS
  • Platform
    • Teams and Members
    • Permissions
    • Single Sign-On
  • Terms
    • Terms of Service
    • User Agreement/Privary Policy
    • SLA
    • Data Security
中文English
RoadmapAPI官网控制台
中文English
RoadmapAPI官网控制台
  1. Alert Rules

ElasticSearch

This document provides detailed instructions on configuring alert rules for ElasticSearch data sources in Monitors. Monitors uses ElasticSearch SQL functionality to monitor log and metric data, supporting flexible aggregation queries and alert evaluation.

Core concepts#

Version requirements: Due to SQL feature dependencies, only ElasticSearch 6.3 and above are supported.
Query language: Currently only SQL syntax is supported.
Field handling: The alert engine automatically converts all field names in query results to lowercase. When configuring value fields and label fields, always use lowercase letters.

1. Threshold mode#

This mode is suitable for scenarios requiring threshold comparisons on aggregated values, such as monitoring "error log count in the last 5 minutes".

Configuration#

1.
Query: Write a SQL aggregation query that returns numeric columns and (optional) grouping columns.
Example: Count error logs per service in the last 5 minutes.
2.
Field mapping:
Label fields: Fields used to distinguish different alert objects. In the example above, this is service_name. This field can be left empty, and Monitors will automatically treat all fields except value fields as label fields.
Value fields: Numeric fields used for threshold evaluation. In the example above, this is error_cnt.
3.
Threshold conditions:
Use $A.field_name to reference values.
Example: Critical: $A.error_cnt > 50, Warning: $A.error_cnt > 10.
Shorthand: If only one value field is configured, you can use $A directly, e.g., $A > 50.

How it works#

The engine executes the SQL query and retrieves tabular data. It groups data by label fields, then extracts value field values for comparison against threshold expressions.
Note: The label field combination uniquely identifies an alert object. Query results cannot have multiple rows with the same label field value combination. Ensure each alert object corresponds to exactly one row of data. In the example above, service_name values must be unique. If two rows have the same service_name, the alert engine cannot properly distinguish between alert objects.

Recovery logic#

Similar to Prometheus data sources, ElasticSearch threshold mode also supports flexible recovery strategies:
Auto recovery: When the latest SQL query result shows that a data group's value no longer meets any alert threshold (Critical/Warning), a recovery event is automatically generated.
Specific recovery conditions: Configure additional recovery expressions (e.g., $A.error_cnt < 5). Recovery is only confirmed when the value falls below this threshold, preventing alert flapping.
Recovery query:
Scenario: Sometimes alert queries and recovery queries have different logic. For example, the alert checks for "error log count > 10", while recovery might check for "success log count > 100" or query a different status index.
Configuration: Write an independent SQL statement for recovery evaluation.
Variable support: Recovery SQL supports using ${label_name} to reference alert event label values.
Example: The alert SQL found that network card with network_host="a", interface="b" is down. The recovery SQL can be:
The engine replaces ${network_host} and ${interface} with actual values before executing the query. If data is found, recovery is confirmed.

2. Data exists mode#

This mode is suitable for scenarios where filtering logic is written directly in SQL, or when you only need to check "whether any data is returned".

Configuration#

1.
Query: Use a HAVING clause in SQL to directly filter out anomalous data.
Example: Directly query services with more than 50 errors.
2.
Field mapping:
In this mode, label fields and value fields are optional. If both are left empty, the engine treats all fields in the query result as label fields, which can be referenced in rule descriptions.

Recovery logic#

Data disappearance means recovery: When the SQL query result is empty (i.e., the HAVING condition is no longer met), the engine determines the incident has recovered. This is the most common recovery method.
Recovery query:
Scenario: Sometimes "no data found" doesn't mean recovery (it could be that log collection failed), or stricter recovery conditions are needed (e.g., no errors for N consecutive minutes).
Configuration: Write an independent SQL statement for recovery evaluation. If this query finds data, the incident is considered recovered.
Variable support: Recovery SQL supports using ${label_name} to reference alert event label values for precise recovery detection.

Pros and cons#

Pros: Leverages the ES cluster's computing power for filtering, reducing network transmission and improving performance.
Cons: Cannot distinguish between multiple severity levels (e.g., Info/Warning), because SQL can only return data meeting specific conditions.

3. No data mode#

This mode monitors scenarios where data is expected but actually missing, commonly used to monitor log collection pipeline interruptions or scheduled tasks not executing.

Configuration#

1.
Query: Write a SQL query that should continuously return data.
Example: Query heartbeat logs from all hosts.
2.
Evaluation rule:
The engine periodically executes this SQL.
If a host_name appeared in previous cycles but no longer appears in the current cycle (and N consecutive cycles), a "no data" alert is triggered.
Note: This is the opposite of data exists mode. Data exists triggers alerts when data is found; no data triggers alerts when data is not found.

Recovery logic#

Data appearance means recovery: Once the host_name reappears in query results, the alert automatically recovers.
Auto recovery time: Configure an auto recovery time (e.g., 24 hours). If not recovered after this time, the engine automatically closes the alert. This is typically used for handling decommissioned machines that no longer need monitoring.

4. Use case example#

Log alerting often requires: counting ERROR logs in the last 5 minutes, triggering an alert if the count exceeds a threshold, and displaying the most recent ERROR log as a sample in the alert message. Here's the configuration:
Main alert condition: Use threshold mode with a SQL statement counting ERROR logs in the last 5 minutes and configure threshold conditions.
Enrichment query: Configure an enrichment query with a SQL statement that retrieves the most recent ERROR log, using variables like ${service_name} to limit to specific services.
Rule description: Reference enrichment query results in the alert rule's description field using the $relates variable to render the original log content.

添加官方技术支持微信

在这里,获得使用上的任何帮助,快速上手FlashDuty

微信扫码交流
修改于 2025-12-31 06:13:00
上一页
Prometheus
下一页
Loki
Built with