Loki Alert Rule Configuration

This document provides detailed instructions on configuring alert rules for Loki data sources in Monitors. Monitors supports Loki's LogQL query syntax and can perform aggregation analysis on log data to trigger alerts.

Core concepts

Loki's query language LogQL is divided into two categories:

Log queries: Return log line content (Stream).

Metric queries: Count or aggregate logs, returning numeric values (Vector).

Monitors alert engine primarily uses metric queries. Always use functions like count_over_time, rate, sum to convert logs into numeric series for threshold evaluation.

1. Threshold mode

This mode is suitable for scenarios requiring multi-level threshold evaluation on log aggregation values (e.g., Info/Warning/Critical).

Configuration

Query: Write a LogQL that returns numeric vectors.

Example: Count log entries containing the error keyword in the mysql job over the last 5 minutes.

count_over_time({job="mysql"} |= "error" [5m])

Threshold conditions:

Critical: $A > 50 (more than 50 error logs in 5 minutes)

Warning: $A > 10 (more than 10 error logs in 5 minutes)

How it works

The engine executes the LogQL query and retrieves time series data with labels (Vector). It iterates through each series, extracting values to compare against the configured threshold expressions.

Recovery logic

Auto recovery: Automatically recovers when query result values fall below the threshold.

Specific recovery conditions: Configure conditions like $A < 5 to prevent flapping near the threshold.

Recovery query:

Supports configuring an independent LogQL for recovery evaluation.

Supports ${label_name} variable substitution.

Example: Alert checks for error logs, recovery checks for specific recovery logs count_over_time({job="mysql"} |= "recovered" [5m]).

2. Data exists mode

This mode is suitable for users who prefer writing filter conditions directly in LogQL, or scenarios where you only care about "whether anomalous data exists".

Configuration

Query: Write a LogQL with comparison operators that only returns data meeting the conditions.

Example: Directly filter services with error rates exceeding 5%.

rate({job="ingress"} |= "500" [1m]) / rate({job="ingress"} [1m]) * 100 > 5

Evaluation rule: An alert is triggered as soon as the LogQL query returns data.

Pros and cons

Pros: Computation logic is pushed down to the Loki server, reducing data transmission.

Cons: Cannot distinguish between alert severity levels; can only trigger a single level alert.

Recovery logic

Data disappearance means recovery: Recovery is confirmed when the LogQL query result is empty (i.e., the > 5 condition is no longer met).

Recovery query: Supports configuring additional query statements to assist in determining recovery status.

3. No data mode

This mode monitors whether log reporting pipelines are interrupted, or whether logs that should be continuously generated have stopped.

Configuration

Query: Write a query that should always have data.

Example: Calculate log reporting rate for all hosts.

rate({job="node-logs"} [1m])

Evaluation rule: If a Series (uniquely identified by labels, e.g., instance="host-1") existed in previous cycles but cannot be found in the current and N consecutive cycles, a "no data" alert is triggered.

Typical applications

Monitor whether collection agents like Promtail/Fluentd have stopped working.

Monitor whether critical business logs (such as order creation logs) have been abnormally interrupted.

4. Best practices and considerations

Avoid querying raw logs

Do not use LogQL that only returns log streams in alert rules (e.g., {job="mysql"} |= "error").

Reason: The alert engine needs numeric values for calculations and evaluation. Raw log streams cannot be directly used for threshold comparisons.

Correct approach: Must wrap with aggregation functions like count_over_time(...).

Performance optimization

Time range: The time range in LogQL (e.g., [5m]) should be moderate. Too large a range leads to slow queries, while too small a range may cause high data volatility.

Label filtering: Use precise label filters in the LogQL Stream Selector section (within braces {...}) as much as possible to reduce the amount of data scanned.

Loki

Core concepts#

1. Threshold mode#

Configuration#

How it works#

Recovery logic#

2. Data exists mode#

Configuration#

Pros and cons#

Recovery logic#

3. No data mode#

Configuration#

Typical applications#

4. Best practices and considerations#

Avoid querying raw logs#

Performance optimization#

Core concepts

1. Threshold mode

Configuration

How it works

Recovery logic

2. Data exists mode

Configuration

Pros and cons

Recovery logic

3. No data mode

Configuration

Typical applications

4. Best practices and considerations

Avoid querying raw logs

Performance optimization