Prometheus Alert Rule Configuration

This document provides detailed instructions on configuring alert rules for Prometheus data sources in Monitors. Monitors is compatible with Prometheus PromQL query syntax and offers flexible alert evaluation and recovery mechanisms to meet various monitoring requirements.

Core concepts

Before configuring alert rules, it's essential to understand how Monitors processes Prometheus data. The alert engine supports three core evaluation modes:

Threshold: The query returns raw metric values, and the alert engine performs threshold comparisons in memory.

Data exists: The query itself contains filter conditions and only returns anomalous data. The engine triggers an alert when data is found.

No data: Used to monitor scenarios where data reporting is interrupted.

1. Threshold mode

This mode is suitable for scenarios requiring multi-level alerts on the same metric (e.g., Info/Warning/Critical) or when precise recovery values are needed.

Configuration

Query (PromQL): Write a PromQL query without comparison operators that only returns metric values.

Example: Query memory usage

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Threshold conditions: Define threshold expressions for different severity levels in the rule configuration. The variable $A represents the query result value.

Critical: $A > 90 (triggers critical alert when memory usage exceeds 90%)

Warning: $A > 80 (triggers warning alert when memory usage exceeds 80%)

Multiple queries and data correlation

Monitors supports configuring multiple query statements in a single alert rule (named A, B, C...) and referencing these query results simultaneously in threshold expressions (e.g., $A > 90 and $B < 50).

Auto join: The alert engine automatically correlates results from different queries based on labels.

Alignment requirements: Two queries can only be correlated within the same context when their returned data contains exactly the same label sets.

Example: If query A returns cpu_usage_percent{instance="host-1", job="node"} and query B returns mem_usage_percent{instance="host-1", job="node"}, then $A and $B can be successfully correlated.

Note: If query A has an additional label (e.g., disk="/"), but query B does not, they cannot be correlated. It's recommended to use aggregation operations like sum by (...) or avg by (...) in PromQL to explicitly control returned labels and ensure label consistency across multiple queries.

How it works

The alert engine periodically executes PromQL and retrieves current values for all Series. Then it iterates through each Series, matching Critical, Warning, and Info conditions in order. Once a condition is met, an alert event of the corresponding severity is generated.

Recovery logic

Multiple recovery strategies are supported:

Auto recovery: When the latest query result no longer meets any alert threshold, a recovery event is automatically generated.

Specific recovery conditions: Configure additional recovery expressions (e.g., $A < 75). Recovery is only confirmed when the value falls back to the specified level, preventing frequent flapping near the threshold.

Recovery query: Allows users to define a custom PromQL query for recovery evaluation.

Principle: After an alert is triggered, the engine periodically executes this recovery query. If the query returns data (i.e., the result is not empty), the incident is considered recovered.

Variable support: The recovery query supports embedded variables (in the format ${label_name}), which are automatically replaced with corresponding label values from the alert event. This enables the recovery query to perform precise detection for specific alert objects (such as a specific instance or device).

2. Data exists mode

This mode behaves consistently with Prometheus native alerting rules. It's suitable for users who prefer defining thresholds directly in PromQL or scenarios requiring high-performance processing of large numbers of Series.

Configuration

Query (PromQL): Write a PromQL query with comparison operators that only filters out anomalous data.

Example: Query nodes with memory usage exceeding 90%

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90

Evaluation rule: No additional threshold expressions are needed. The engine triggers an alert as soon as it gets query results. The number of alert events equals the number of data rows returned.

Recovery logic

Data disappearance means recovery: The engine queries periodically. If some data can no longer be found, the engine determines that the corresponding alert has recovered. Note: Data identification is based on label sets.

Recovery query (optional): Configure an independent query for recovery evaluation (e.g., up{instance="${instance}"} == 1 to confirm service recovery). Recovery is confirmed only when data is found. The variable ${instance} in this query will be replaced with the actual label value from the alert event.

Pros and cons

Pros:

Better performance: Filtering logic is pushed down to the Prometheus server, reducing data transmitted to the alert engine.

Low migration cost: Existing Prometheus rule statements can be directly reused.

Cons:

Single severity level: One rule typically corresponds to one severity level. To distinguish between >90 and >80, two rules must be configured.

Recovery value retrieval: During recovery, since Prometheus no longer returns data (because the threshold is no longer met), the alert engine cannot directly obtain the specific value at recovery time. However, you can use enrichment queries to get real-time values.

3. No data mode

This mode is specifically designed to monitor whether monitored objects are alive or whether data reporting pipelines are functioning properly.

Configuration

Query: Write a query for metrics that should always exist.

Example: up{job="my-service"}

Evaluation rule: If no data for a Series can be queried for N consecutive cycles (note: this means completely no data, not a value of 0), a "no data" alert is triggered.

Engine restart: If the alert engine (monitedge) restarts, the in-memory state is lost. If data that was queryable before the restart happens to be unavailable after restart, no alert will be triggered. The missing cycle counter restarts after the engine restarts.

Typical applications

Monitor Exporter crashes.

Monitor instrumentation reporting service interruptions.

Monitor batch jobs not executing on schedule.

Comparison with Prometheus native `absent()` function

Prometheus's absent() function requires listing every combination of identifying labels for a metric, e.g., absent(up{instance="host-1"}). Multiple statements are needed for multiple instances.

Monitors' no data mode only requires one query. The alert engine automatically monitors all returned Series, and any Series with missing data triggers an alert.

Advanced configuration

Labels and variables

Monitors automatically parses labels returned by Prometheus. In recovery queries or enrichment queries, you can use ${label_name} to reference label values.

For example, if a query result contains labels instance="host-1" and job="node", you can write the recovery query as:

up{instance="${instance}", job="${job}"} == 1

When executing, the alert engine replaces ${instance} with host-1 and ${job} with node, which are the actual values from the alert event labels.

Enrichment query

To enrich alert notification content, you can configure enrichment queries. Enrichment queries do not participate in alert evaluation and are only used to retrieve additional information.

Scenario: During a CPU alert, simultaneously query the machine's mem_usage_percent to display in the alert details for troubleshooting assistance.

Variables: Enrichment queries support ${label_name} variables for querying specific alert objects.

Result display: Enrichment query results can be referenced in the alert rule's description field using the $relates variable. Please refer to the description field documentation for detailed usage instructions.

Prometheus

Core concepts#

1. Threshold mode#

Configuration#

Multiple queries and data correlation#

How it works#

Recovery logic#

2. Data exists mode#

Configuration#

Recovery logic#

Pros and cons#

3. No data mode#

Configuration#

Typical applications#

Comparison with Prometheus native absent() function#

Advanced configuration#

Labels and variables#

Enrichment query#

Core concepts

1. Threshold mode

Configuration

Multiple queries and data correlation

How it works

Recovery logic

2. Data exists mode

Configuration

Recovery logic

Pros and cons

3. No data mode

Configuration

Typical applications

Comparison with Prometheus native `absent()` function

Advanced configuration

Labels and variables

Enrichment query