Flashduty Docs
中文English
RoadmapAPI官网控制台
中文English
RoadmapAPI官网控制台
  1. Alert Rules
  • Introduction
  • On-call
    • Getting Started
      • Quick start
      • FAQ
      • Product Comparison
    • Incidents
      • What is an Incident
      • View Incidents
      • Handle Incidents
      • Escalations and Assignments
      • Custom Fields
      • Custom Actions
      • Alert Noise Reduction
      • Past Incidents
      • Outlier Incidents
      • Status Pages
    • Configure On-call
      • Channels
      • Integrate Alerts
      • Alert Noise Reduction
      • Escalation Rules
      • Label Enrichment
      • Schedules
      • Templates
      • Service Calendars
      • Preferences
      • Alert Routing
      • Silence and Inhibition
      • Filters
      • Notifications
      • Alert Pipeline
    • Advanced Features
      • Referencing Variables
      • Dynamic Assignment
      • Insights
      • War-room
    • Integrations
      • Alerts integration
        • Standard Alert Integration
        • Email Integration
        • Nightingale/FlashCat Integration
        • Prometheus Integration
        • Grafana Integration
        • Zabbix Integration
        • Uptime Kuma Integration
        • Alibaba Cloud ARMS Integration
        • Alibaba Cloud Monitor CM Event Integration
        • Alibaba Cloud Monitor CM Metrics Integration
        • Alibaba Cloud SLS Integration
        • AWS CloudWatch Integration
        • Azure Monitor Integration
        • Baidu Cloud BCM Integration
        • Huawei Cloud CES Integration
        • Influxdata Integration
        • Open Falcon Integration
        • PagerDuty Integration
        • Tencent BlueKing Integration
        • Tencent Cloud CLS Integration
        • Tencent Cloud Monitor CM Integration
        • Tencent Cloud EventBridge
        • OceanBase Integration
        • Graylog Integration
        • Skywalking Integration
        • Sentry Integration
        • Jiankongbao Integration
        • AWS EventBridge Integration
        • Dynatrace Integration
        • Huawei Cloud LTS Integration
        • GCP Integration
        • Splunk Alert Events Integration
        • AppDynamics Alert Integration
        • SolarWinds Alert Events Integration
        • Volcengine CM Alert Events Integration
        • Volcengine CM Event Center Integration
        • Volcengine TLS Integration
        • OpManager Integration
        • Meraki Integration
        • Keep Integration
        • ElastAlert2 Alert Integration
        • StateCloud Alert Events
        • Guance Alert Events
        • Zilliz Alert Events
        • Huawei Cloud APM Alerts
        • zstack integration
        • Monit Alert Integration
        • RUM Alert Integration
      • Change integration
        • Standard Change Event
        • Jira Issue Events
      • IM integration
        • Feishu (Lark) Integration Guide
        • Dingtalk Integration
        • WeCom Integration
        • Slack Integration
        • Microsoft Teams Integration
      • Single Sign-On
        • Authing Integration
        • Keycloak Guide
        • OpenLDAP Guide
      • Webhooks
        • Alert webhook
        • Incident webhook
        • Costom action
        • ServiceNow Sync
        • Jira Sync
      • Other
        • Link Integration
  • RUM
    • Getting Started
      • Introduction
      • Quick start
      • FAQ
    • Applications
      • Applications
      • SDK Integration
      • Advanced Configuration
      • Analysis Dashboard
    • Performance Monitoring
      • Overview
      • Metrics
      • Performance Analysis
      • Performance Optimize
    • Error Tracking
      • Overview
      • Error Reporting
      • Issues
      • Source Mapping
      • Error Grouping
      • Issue States
      • Issue Alerting
    • Session Explorer
      • Overview
      • Data Query
    • Session Replay
      • View Session Replay
      • Overview
      • SDK Configuration
      • Privacy Protection
    • Best Practice
      • Distributed Tracing
    • Others
      • Terminology
      • Data Collection
      • Data Security
  • Monitors
    • Getting Started
      • Introduction
      • Quick Start
    • FAQ
      • FAQ
    • Alert Rules
      • Prometheus
      • ElasticSearch
      • Loki
      • ClickHouse
      • MySQL
      • Oracle
      • PostgreSQL
      • Aliyun SLS
  • Platform
    • Teams and Members
    • Permissions
    • Single Sign-On
  • Terms
    • Terms of Service
    • User Agreement/Privary Policy
    • SLA
    • Data Security
中文English
RoadmapAPI官网控制台
中文English
RoadmapAPI官网控制台
  1. Alert Rules

Loki

This document provides detailed instructions on configuring alert rules for Loki data sources in Monitors. Monitors supports Loki's LogQL query syntax and can perform aggregation analysis on log data to trigger alerts.

Core concepts#

Loki's query language LogQL is divided into two categories:
1.
Log queries: Return log line content (Stream).
2.
Metric queries: Count or aggregate logs, returning numeric values (Vector).
Monitors alert engine primarily uses metric queries. Always use functions like count_over_time, rate, sum to convert logs into numeric series for threshold evaluation.

1. Threshold mode#

This mode is suitable for scenarios requiring multi-level threshold evaluation on log aggregation values (e.g., Info/Warning/Critical).

Configuration#

Query: Write a LogQL that returns numeric vectors.
Example: Count log entries containing the error keyword in the mysql job over the last 5 minutes.
count_over_time({job="mysql"} |= "error" [5m])
Threshold conditions:
Critical: $A > 50 (more than 50 error logs in 5 minutes)
Warning: $A > 10 (more than 10 error logs in 5 minutes)

How it works#

The engine executes the LogQL query and retrieves time series data with labels (Vector). It iterates through each series, extracting values to compare against the configured threshold expressions.

Recovery logic#

Auto recovery: Automatically recovers when query result values fall below the threshold.
Specific recovery conditions: Configure conditions like $A < 5 to prevent flapping near the threshold.
Recovery query:
Supports configuring an independent LogQL for recovery evaluation.
Supports ${label_name} variable substitution.
Example: Alert checks for error logs, recovery checks for specific recovery logs count_over_time({job="mysql"} |= "recovered" [5m]).

2. Data exists mode#

This mode is suitable for users who prefer writing filter conditions directly in LogQL, or scenarios where you only care about "whether anomalous data exists".

Configuration#

Query: Write a LogQL with comparison operators that only returns data meeting the conditions.
Example: Directly filter services with error rates exceeding 5%.
rate({job="ingress"} |= "500" [1m]) / rate({job="ingress"} [1m]) * 100 > 5
Evaluation rule: An alert is triggered as soon as the LogQL query returns data.

Pros and cons#

Pros: Computation logic is pushed down to the Loki server, reducing data transmission.
Cons: Cannot distinguish between alert severity levels; can only trigger a single level alert.

Recovery logic#

Data disappearance means recovery: Recovery is confirmed when the LogQL query result is empty (i.e., the > 5 condition is no longer met).
Recovery query: Supports configuring additional query statements to assist in determining recovery status.

3. No data mode#

This mode monitors whether log reporting pipelines are interrupted, or whether logs that should be continuously generated have stopped.

Configuration#

Query: Write a query that should always have data.
Example: Calculate log reporting rate for all hosts.
rate({job="node-logs"} [1m])
Evaluation rule: If a Series (uniquely identified by labels, e.g., instance="host-1") existed in previous cycles but cannot be found in the current and N consecutive cycles, a "no data" alert is triggered.

Typical applications#

Monitor whether collection agents like Promtail/Fluentd have stopped working.
Monitor whether critical business logs (such as order creation logs) have been abnormally interrupted.

4. Best practices and considerations#

Avoid querying raw logs#

Do not use LogQL that only returns log streams in alert rules (e.g., {job="mysql"} |= "error").
Reason: The alert engine needs numeric values for calculations and evaluation. Raw log streams cannot be directly used for threshold comparisons.
Correct approach: Must wrap with aggregation functions like count_over_time(...).

Performance optimization#

Time range: The time range in LogQL (e.g., [5m]) should be moderate. Too large a range leads to slow queries, while too small a range may cause high data volatility.
Label filtering: Use precise label filters in the LogQL Stream Selector section (within braces {...}) as much as possible to reduce the amount of data scanned.

添加官方技术支持微信

在这里,获得使用上的任何帮助,快速上手FlashDuty

微信扫码交流
修改于 2025-12-31 06:13:32
上一页
ElasticSearch
下一页
ClickHouse
Built with