Flashduty Docs
中文English
RoadmapAPI官网控制台
中文English
RoadmapAPI官网控制台
  1. Alert Rules
  • Introduction
  • On-call
    • Getting Started
      • Quick start
      • FAQ
      • Product Comparison
    • Incidents
      • What is an Incident
      • View Incidents
      • Handle Incidents
      • Escalations and Assignments
      • Custom Fields
      • Custom Actions
      • Alert Noise Reduction
      • Past Incidents
      • Outlier Incidents
      • Status Pages
    • Configure On-call
      • Channels
      • Integrate Alerts
      • Alert Noise Reduction
      • Escalation Rules
      • Label Enrichment
      • Schedules
      • Templates
      • Service Calendars
      • Preferences
      • Alert Routing
      • Silence and Inhibition
      • Filters
      • Notifications
      • Alert Pipeline
    • Advanced Features
      • Referencing Variables
      • Dynamic Assignment
      • Insights
      • War-room
    • Integrations
      • Alerts integration
        • Standard Alert Integration
        • Email Integration
        • Nightingale/FlashCat Integration
        • Prometheus Integration
        • Grafana Integration
        • Zabbix Integration
        • Uptime Kuma Integration
        • Alibaba Cloud ARMS Integration
        • Alibaba Cloud Monitor CM Event Integration
        • Alibaba Cloud Monitor CM Metrics Integration
        • Alibaba Cloud SLS Integration
        • AWS CloudWatch Integration
        • Azure Monitor Integration
        • Baidu Cloud BCM Integration
        • Huawei Cloud CES Integration
        • Influxdata Integration
        • Open Falcon Integration
        • PagerDuty Integration
        • Tencent BlueKing Integration
        • Tencent Cloud CLS Integration
        • Tencent Cloud Monitor CM Integration
        • Tencent Cloud EventBridge
        • OceanBase Integration
        • Graylog Integration
        • Skywalking Integration
        • Sentry Integration
        • Jiankongbao Integration
        • AWS EventBridge Integration
        • Dynatrace Integration
        • Huawei Cloud LTS Integration
        • GCP Integration
        • Splunk Alert Events Integration
        • AppDynamics Alert Integration
        • SolarWinds Alert Events Integration
        • Volcengine CM Alert Events Integration
        • Volcengine CM Event Center Integration
        • Volcengine TLS Integration
        • OpManager Integration
        • Meraki Integration
        • Keep Integration
        • ElastAlert2 Alert Integration
        • StateCloud Alert Events
        • Guance Alert Events
        • Zilliz Alert Events
        • Huawei Cloud APM Alerts
        • zstack integration
        • Monit Alert Integration
        • RUM Alert Integration
      • Change integration
        • Standard Change Event
        • Jira Issue Events
      • IM integration
        • Feishu (Lark) Integration Guide
        • Dingtalk Integration
        • WeCom Integration
        • Slack Integration
        • Microsoft Teams Integration
      • Single Sign-On
        • Authing Integration
        • Keycloak Guide
        • OpenLDAP Guide
      • Webhooks
        • Alert webhook
        • Incident webhook
        • Costom action
        • ServiceNow Sync
        • Jira Sync
      • Other
        • Link Integration
  • RUM
    • Getting Started
      • Introduction
      • Quick start
      • FAQ
    • Applications
      • Applications
      • SDK Integration
      • Advanced Configuration
      • Analysis Dashboard
    • Performance Monitoring
      • Overview
      • Metrics
      • Performance Analysis
      • Performance Optimize
    • Error Tracking
      • Overview
      • Error Reporting
      • Issues
      • Source Mapping
      • Error Grouping
      • Issue States
      • Issue Alerting
    • Session Explorer
      • Overview
      • Data Query
    • Session Replay
      • View Session Replay
      • Overview
      • SDK Configuration
      • Privacy Protection
    • Best Practice
      • Distributed Tracing
    • Others
      • Terminology
      • Data Collection
      • Data Security
  • Monitors
    • Getting Started
      • Introduction
      • Quick Start
    • FAQ
      • FAQ
    • Alert Rules
      • Prometheus
      • ElasticSearch
      • Loki
      • ClickHouse
      • MySQL
      • Oracle
      • PostgreSQL
      • Aliyun SLS
  • Platform
    • Teams and Members
    • Permissions
    • Single Sign-On
  • Terms
    • Terms of Service
    • User Agreement/Privary Policy
    • SLA
    • Data Security
中文English
RoadmapAPI官网控制台
中文English
RoadmapAPI官网控制台
  1. Alert Rules

Prometheus

This document provides detailed instructions on configuring alert rules for Prometheus data sources in Monitors. Monitors is compatible with Prometheus PromQL query syntax and offers flexible alert evaluation and recovery mechanisms to meet various monitoring requirements.

Core concepts#

Before configuring alert rules, it's essential to understand how Monitors processes Prometheus data. The alert engine supports three core evaluation modes:
1.
Threshold: The query returns raw metric values, and the alert engine performs threshold comparisons in memory.
2.
Data exists: The query itself contains filter conditions and only returns anomalous data. The engine triggers an alert when data is found.
3.
No data: Used to monitor scenarios where data reporting is interrupted.

1. Threshold mode#

This mode is suitable for scenarios requiring multi-level alerts on the same metric (e.g., Info/Warning/Critical) or when precise recovery values are needed.

Configuration#

Query (PromQL): Write a PromQL query without comparison operators that only returns metric values.
Example: Query memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
Threshold conditions: Define threshold expressions for different severity levels in the rule configuration. The variable $A represents the query result value.
Critical: $A > 90 (triggers critical alert when memory usage exceeds 90%)
Warning: $A > 80 (triggers warning alert when memory usage exceeds 80%)

Multiple queries and data correlation#

Monitors supports configuring multiple query statements in a single alert rule (named A, B, C...) and referencing these query results simultaneously in threshold expressions (e.g., $A > 90 and $B < 50).
Auto join: The alert engine automatically correlates results from different queries based on labels.
Alignment requirements: Two queries can only be correlated within the same context when their returned data contains exactly the same label sets.
Example: If query A returns cpu_usage_percent{instance="host-1", job="node"} and query B returns mem_usage_percent{instance="host-1", job="node"}, then $A and $B can be successfully correlated.
Note: If query A has an additional label (e.g., disk="/"), but query B does not, they cannot be correlated. It's recommended to use aggregation operations like sum by (...) or avg by (...) in PromQL to explicitly control returned labels and ensure label consistency across multiple queries.

How it works#

The alert engine periodically executes PromQL and retrieves current values for all Series. Then it iterates through each Series, matching Critical, Warning, and Info conditions in order. Once a condition is met, an alert event of the corresponding severity is generated.

Recovery logic#

Multiple recovery strategies are supported:
Auto recovery: When the latest query result no longer meets any alert threshold, a recovery event is automatically generated.
Specific recovery conditions: Configure additional recovery expressions (e.g., $A < 75). Recovery is only confirmed when the value falls back to the specified level, preventing frequent flapping near the threshold.
Recovery query: Allows users to define a custom PromQL query for recovery evaluation.
Principle: After an alert is triggered, the engine periodically executes this recovery query. If the query returns data (i.e., the result is not empty), the incident is considered recovered.
Variable support: The recovery query supports embedded variables (in the format ${label_name}), which are automatically replaced with corresponding label values from the alert event. This enables the recovery query to perform precise detection for specific alert objects (such as a specific instance or device).

2. Data exists mode#

This mode behaves consistently with Prometheus native alerting rules. It's suitable for users who prefer defining thresholds directly in PromQL or scenarios requiring high-performance processing of large numbers of Series.

Configuration#

Query (PromQL): Write a PromQL query with comparison operators that only filters out anomalous data.
Example: Query nodes with memory usage exceeding 90%
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
Evaluation rule: No additional threshold expressions are needed. The engine triggers an alert as soon as it gets query results. The number of alert events equals the number of data rows returned.

Recovery logic#

Data disappearance means recovery: The engine queries periodically. If some data can no longer be found, the engine determines that the corresponding alert has recovered. Note: Data identification is based on label sets.
Recovery query (optional): Configure an independent query for recovery evaluation (e.g., up{instance="${instance}"} == 1 to confirm service recovery). Recovery is confirmed only when data is found. The variable ${instance} in this query will be replaced with the actual label value from the alert event.

Pros and cons#

Pros:
Better performance: Filtering logic is pushed down to the Prometheus server, reducing data transmitted to the alert engine.
Low migration cost: Existing Prometheus rule statements can be directly reused.
Cons:
Single severity level: One rule typically corresponds to one severity level. To distinguish between >90 and >80, two rules must be configured.
Recovery value retrieval: During recovery, since Prometheus no longer returns data (because the threshold is no longer met), the alert engine cannot directly obtain the specific value at recovery time. However, you can use enrichment queries to get real-time values.

3. No data mode#

This mode is specifically designed to monitor whether monitored objects are alive or whether data reporting pipelines are functioning properly.

Configuration#

Query: Write a query for metrics that should always exist.
Example: up{job="my-service"}
Evaluation rule: If no data for a Series can be queried for N consecutive cycles (note: this means completely no data, not a value of 0), a "no data" alert is triggered.
Engine restart: If the alert engine (monitedge) restarts, the in-memory state is lost. If data that was queryable before the restart happens to be unavailable after restart, no alert will be triggered. The missing cycle counter restarts after the engine restarts.

Typical applications#

Monitor Exporter crashes.
Monitor instrumentation reporting service interruptions.
Monitor batch jobs not executing on schedule.

Comparison with Prometheus native absent() function#

Prometheus's absent() function requires listing every combination of identifying labels for a metric, e.g., absent(up{instance="host-1"}). Multiple statements are needed for multiple instances.
Monitors' no data mode only requires one query. The alert engine automatically monitors all returned Series, and any Series with missing data triggers an alert.

Advanced configuration#

Labels and variables#

Monitors automatically parses labels returned by Prometheus. In recovery queries or enrichment queries, you can use ${label_name} to reference label values.
For example, if a query result contains labels instance="host-1" and job="node", you can write the recovery query as:
up{instance="${instance}", job="${job}"} == 1
When executing, the alert engine replaces ${instance} with host-1 and ${job} with node, which are the actual values from the alert event labels.

Enrichment query#

To enrich alert notification content, you can configure enrichment queries. Enrichment queries do not participate in alert evaluation and are only used to retrieve additional information.
Scenario: During a CPU alert, simultaneously query the machine's mem_usage_percent to display in the alert details for troubleshooting assistance.
Variables: Enrichment queries support ${label_name} variables for querying specific alert objects.
Result display: Enrichment query results can be referenced in the alert rule's description field using the $relates variable. Please refer to the description field documentation for detailed usage instructions.

添加官方技术支持微信

在这里,获得使用上的任何帮助,快速上手FlashDuty

微信扫码交流
修改于 2025-12-31 06:12:17
上一页
FAQ
下一页
ElasticSearch
Built with