Presenters
Source
Grafana Alerting: Your Unified Engine for Rock-Solid Alerting ๐
Ever been jolted awake by a frantic alert notification? We’ve all been there! Sonia Aguilar and Alexander from Grafana Labs are here to share how Grafana Alerting has evolved into a powerful, unified engine ready to handle all your alerting needs. Forget juggling multiple systems; Grafana is making it easier than ever to manage, migrate, and understand your alerts.
From Dashboards to Dominance: The Evolution of Grafana Alerting ๐ก
Grafana’s journey began with a focus on stunning dashboards and visualizations. Alerting, initially, was an afterthought, relegated to external systems like Prometheus.
- Grafana 4.0: Introduced the first alerting engine, a basic, one-dimensional system primarily for Graphite.
- Unified Alerting: As Prometheus, Mimir, and Loki gained traction, Grafana introduced unified alerting, allowing users to manage alert rules from various systems in one place and enabling multi-dimensional alerts.
- The Challenge of Duality: The existence of two types of alerts โ Grafana-managed and data source-managed โ led to complexity and user confusion. Internally, Grafana itself was a mix of different alerting systems, while teams like security were adopting Grafana Alerting. This divergence begged the question: “Which system should I use?”
- The Big Decision: Grafana Labs decided to double down on Grafana Alerting, consolidating the best features of existing systems into a single, mature, scalable, and fully-featured engine.
Today, Grafana Alerting is the “big tent” of alerting, supporting over 50 different data sources and 22 types of integrations (and counting!).
Tackling the Big Challenges: Scaling and Organization ๐
To achieve this unified vision, Grafana Alerting had to overcome significant challenges, particularly around scaling and managing alerts across growing organizations.
Scaling the System: From Single Instance to High Availability ๐ฆพ
- The Problem: A single Grafana instance running alerting, querying data sources, and writing to a database is a single point of failure. High availability setups with multiple instances double or multiply the load on data sources and databases because each replica evaluates all rules independently.
- Early Optimizations: Grafana introduced features to mitigate database
pressure:
- Compressed State: Compresses the state of an entire rule for a single database write.
- Evaluation Jitter: Spreads rule evaluations evenly across an interval to avoid “thundering herds.”
- Periodic Saves: Buffers state in memory and saves to the database periodically (e.g., every 5 minutes), with the trade-off of potential data loss if Grafana crashes during that period.
- The Breakthrough (Grafana 13+): Grafana 13 introduced a revolutionary
scaling solution:
- Primary/Standby Replica Model: One replica evaluates alert rules while others stand by.
- Impact: Data source and database loads remain constant regardless of the number of replicas.
- Failover: Uses cluster membership to automatically switch to a standby replica if the primary fails.
- Trade-off: A brief gap in rule evaluations occurs during failover. Users must choose between zero-gap redundancy and constant load. This feature is open-source and can be enabled with a single configuration flag.
Scaling with Teams: Empowering Collaboration and Autonomy ๐จโ๐ฉโ๐งโ๐ฆ
- Role-Based Access Control (RBAC): Grafana already offered RBAC, allowing teams to own rules in folders with fine-grained permissions. Contact points also had access control.
- The Missing Piece: Notification Policies: Previously, all teams shared a single notification policy tree. At scale, this became a complex, unmanageable “gen tree” prone to accidental breakage.
- The Solution: Multi-Policy Trees:
- Each team now gets its own independent notification policy tree.
- This allows teams to manage their own routing, configurations, and changes without impacting others.
- The visual analogy of a complex, AI-generated tree highlights the previous difficulty in managing a single, shared policy.
Migrating Your Alerts: A Seamless Transition ๐ค
Grafana Alerting is ready for your existing alerts. You have two primary ways to get started:
- Create New Rules from Scratch:
- Utilize the intuitive UI.
- Leverage provisioning via files, APIs, or Terraform.
- Import Existing Alerts: If you’re already using Prometheus, Mimir, or
Loki for alerting, Grafana offers a powerful import tool.
- Prometheus Compatibility Layer: Grafana’s built-in compatibility means you can import alert rules and Alertmanager configurations directly. No copy-pasting or rewriting expressions required!
- Leveraging Multi-Policy Trees: During import, a new policy tree is created in Grafana for your imported policies, keeping your existing Grafana policies untouched.
Importing with Ease: API vs. UI Wizard ๐ ๏ธ
- API Import: Ideal for CI/CD pipelines or importing many data sources.
- Fully compatible with
mimirtoolandcortextool, allowing you to use existing workflows and commands. - Simply target the new import endpoint.
- Fully compatible with
- UI Import Wizard: Offers a guided, step-by-step experience for smaller migrations or those preferring a visual approach.
Live Demo: The Import Wizard in Action ๐ฌ
The import wizard simplifies the process into three intentional steps:
- Import Notification Resources:
- Choose your import source (YAML file or data source).
- Define a name for the new policy tree that will be created in Grafana.
- The UI automatically handles deduplication and renames resources to avoid conflicts with existing Grafana configurations.
- Import Alert Rules:
- Select the policy tree created in the previous step for your imported rules.
- Choose the import source for your alert rules (YAML or data source).
- Optionally filter by namespace and group.
- Select the target folder in Grafana where imported rules will reside.
- Key Feature: Imported rules are paused by default, giving you time to review before evaluation begins.
- Review and Confirm:
- Review all notification resources and alert rules to be imported.
- Click “Start Importing” and confirm.
Upon completion, you’re redirected to the alert list, showing your imported rules filtered by the target folder. All imported rules are initially paused. You can then bulk resume them at the folder level.
Crucially, imported rules are automatically associated with the newly created policy tree, ensuring they follow the imported routing logic. The policies view now lists multiple policy trees, allowing teams to create their own independent routing without interference.
Understanding Your Alerts: The Power of Alert Activity ๐๏ธ
With your alerts imported or created, how do you understand what’s happening? Grafana Alerting introduces powerful new features for visibility:
Alert Activity Page: Your Centralized Hub ๐ฏ
- Overview: Displays alert rules that fired in the last 15 minutes, along with an alert volume chart over time.
- Pattern Recognition: Easily spot patterns like alerts firing simultaneously or sudden spikes.
- Filtering: Filter by state, labels, or team names to focus on specific alerts.
Deep Dive into Alert Instances ๐
- Instance Details: Click into a specific alert to see:
- The actual query Grafana runs for evaluation.
- A history timeline combining state transitions and notification events.
- Confirmation of successful notification delivery or explicit error messages if delivery fails.
- Actions: Silence alerts or declare incidents directly from this view.
Notification History: Tracking Every Message โ๏ธ
- Contact Point History: Navigate to a contact point to view its notification history.
- Global History: See all notifications sent across your entire Grafana instance.
- Filtering & Details: Filter by status or outcome and expand to see the exact content of sent notifications. This provides clear insight into delivery success or failure.
AI-Powered Triage Assistant: Unlocking Insights ๐ค
- Leveraging System Data: Grafana 13+ can analyze alerts using AI, drawing on comprehensive system data.
- Pattern Analysis & Insights: The assistant identifies patterns, spots issues, and suggests investigation priorities.
- Actionable Recommendations: Receive concrete suggestions on which alerts require immediate attention and where to focus your investigation efforts.
The Future of Alerting is Unified! โจ
Grafana Alerting, with its import tool, multi-policy trees, and advanced triage features, is designed to be the only alerting system you’ll ever need. Whether you’re migrating from existing systems or starting fresh, Grafana provides the power, flexibility, and visibility to keep your systems running smoothly and your teams informed.
Embrace Grafana Alerting and experience the peace of mind that comes with truly unified and intelligent alerting!