Presenters

Source

Grafana’s Hard-Earned Lessons: How Incidents Shaped Our Databases ๐Ÿš€

Hey tech enthusiasts! Ever wondered what goes on behind the scenes of massive, scalable databases like Grafana Tempo and Mimir? Well, buckle up, because Marty and Marco from Grafana are pulling back the curtain to share some real incidents that led to some major improvements in their distributed tracing and time-series databases. It’s a story of learning the hard way, embracing transparency, and building more robust systems. โœจ

The “Query of Death” ๐Ÿ’€: When Regular Expressions Overload the System

Our first story, shared by Marco, takes us back to March 2023. Grafana Mimir, their distributed time-series database, experienced a critical incident. The largest Mimir cluster at the time saw its ingester components maxed out, leading to saturated CPUs, garbage collectors failing to keep up, and ultimately, ingesters going down with out-of-memory errors. ๐Ÿ’ฅ

The Old Architecture’s Achilles’ Heel ๐Ÿ”—

Marco explains that in the older architecture, both the write path (ingestion) and the read path (queries) directly impacted the ingesters. If ingesters were down, both ingestion and querying became unavailable. This was a significant bottleneck.

The Culprit: Gigantic Regular Expressions ๐Ÿคฏ

After digging into CPU profiles, the team discovered the unexpected culprit: the regular expression engine consuming all CPU and memory. It turned out a customer was running numerous PromQL queries with incredibly long regular expressions, each around 40 kilobytes!

The Unexpected Twist: Unfinished Work โณ

Even more surprising, when these queries were canceled or timed out, the regular expression engine continued to run until all label values were checked. This meant even a stopped query could tie up resources for a prolonged period. The initial fix involved a simple check for cancellation, but it highlighted a critical oversight in their initial design.

Long-Term Solutions for a Stronger Mimir ๐Ÿ’ช

To prevent future “query of death” scenarios, Grafana implemented several key improvements:

  • Regular Expression Unrolling ๐Ÿช„: Many complex regexes can be rewritten into simpler operations (equality, prefix, suffix matchers). This allows them to completely skip the regex engine, speeding up index lookups and drastically reducing resource usage. Today, an impressive 97% of regexes in PromQL skip the engine in Grafana Cloud!
  • Cost-Based Planner ๐Ÿง : Queries are now evaluated based on the cost of each label matcher. The cheapest and most selective matchers run first, allowing complex regexes to be evaluated later on a much smaller, filtered set of values.
  • Overload Protection Mechanism ๐Ÿ›ก๏ธ: Ingesters now monitor their CPU and memory. If overloaded, they can reject queries for a short period, preventing complete system failure and allowing for faster recovery.

Decoupling for Resilience: The Mimir 3.0 Revolution ๐Ÿ”„

The “query of death” incidents revealed a fundamental architectural flaw: ingestion and querying were too tightly coupled. Marty explains that the upcoming Mimir 3.0 release introduces a new Kafka-based architecture. This completely decouples ingestion from the read path, ensuring that a burst of queries can no longer take down ingestion, and vice versa. The new architecture relies on just two dependencies: object storage and Kafka (or a Kafka-compatible backend like WarpStream, which Grafana Cloud uses).

Tiny Traces, Gigantic Memory Leaks: The Tempo Incident ๐Ÿ‘ป

Next up, Marty shares a puzzling incident with Grafana Tempo, their distributed tracing database. Alerts fired indicating queriers were running out of memory and crashing (OOMing). The request rate and ingestion hadn’t changed, making the problem hard to pinpoint.

The Mystery of Large Memory Allocations ๐Ÿ”

Memory profiles revealed large allocations when reading Parquet dictionaries. Initially, they suspected very large traces, but upon investigation, they found the opposite: tiny traces with only about 500 spans.

The Slow Trickle and the JSON Blob ๐Ÿ’ง

The real kicker? These small traces weren’t received all at once but were trickled in over seven days, resulting in many small Parquet files. The “why” behind the massive memory allocations remained elusive until they examined the trace data itself.

The Hidden Cost of High-Cardinality JSON ๐Ÿ’Ž

The problematic traces contained very large and high-cardinality JSON attached to them. This pattern was widespread across their workload. When Tempo stored this data, it created huge dictionaries in every block. Consequently, to read even a tiny trace, the system had to unpack gigabytes of dictionaries in memory โ€“ the direct cause of the OOMs.

Stabilizing Tempo and Evolving Storage ๐Ÿ› ๏ธ

The immediate fixes involved blocking problematic queries and tuning server-side concurrency. However, this incident highlighted a lack of defined limits on attribute length in Tempo.

  • Attribute Length Limit ๐Ÿ“: An attribute length limit was introduced, truncating anything longer. This had the intentional side effect of keeping dictionaries small.
  • Tempo 3.0: A New Block Format for Precision ๐Ÿ’ก: The long-term solution is a new block format in Tempo 3.0, still based on Parquet. This provides more precise control over dictionary usage on a per-attribute basis. For the types of traces causing issues, this has resulted in a 95% memory reduction on lookups and would have prevented the incident entirely. It also reduces overreading from object storage. The ultimate goal is to support larger attributes and this workload more effectively.

Slow Queries Blocking Fast Queries: The Mimir Scheduler Bottleneck ๐Ÿšฆ

Marco returns to discuss another Mimir incident where a few slow queries ended up slowing down many fast queries. The problem manifested as high query latency for a specific customer, with most queries stuck waiting in the query scheduler queue.

The FIFO Trap ๐Ÿƒโ€โ™‚๏ธ๐Ÿ’จ

The query scheduler at the time used a per-tenant, first-in-first-out (FIFO) queue. While fair across tenants, within a single tenant, all queries were treated equally. This meant slow queries, even at a low rate, could occupy query workers indefinitely, causing even very fast subsequent queries to get stuck waiting. This issue, particularly with slow store gateway queries, had recurred.

Redesigning the Scheduler for Lanes ๐Ÿ›ฃ๏ธ

The solution was a significant redesign of the query scheduler queue, introducing a multidimensional queue. This splits the per-tenant queue into multiple lanes based on which Mimir component the query will hit (ingesters only, store gateways only, or both). This simple change dramatically improved performance, preventing slow store gateway queries from blocking fast ingester queries.

Tackling Store Gateway Slowness ๐Ÿข

The incident also spurred improvements to address why querying long-term data from store gateways was slow:

  • Eliminating Memory Mapping Issues ๐Ÿšซ: The store gateway’s on-disk cache used memory mapping. In Go, this could cause CPU goroutines to get stuck during page faults, leading to latency. Grafana replaced this with direct disk I/O syscalls.
  • Streaming Query Engine for Efficiency ๐ŸŒŠ: The Mimir query engine was completely redesigned from scratch to be streaming-based. Data is streamed from object storage or cache to store gateways, and then to queriers, which process the data as it arrives. This means data is not fully loaded into memory at query time. This significantly reduced resource utilization and dramatically reduced query latency for long time-range and high-cardinality queries.

Key Takeaways for Building Resilient Systems ๐Ÿ’ก

Marty wraps up by sharing overarching lessons learned:

  • Embrace Isolation and Minimize Blast Radius ๐ŸŽฏ: In the design phase, focus on architectural choices that strongly isolate components and limit the impact of failures. The new decoupled architectures in Mimir and Tempo are prime examples.
  • “You Build It, You Run It” Culture ๐Ÿง‘โ€๐Ÿ’ป: Grafana fosters a healthy on-call culture where the engineers who build the systems are also responsible for running them in the cloud. This includes follow-the-sun rotations and prioritizing on-call duties.
  • Equip Yourself for Incidents ๐Ÿงฐ: Have a robust set of tools and runbooks ready to go. Implement mechanisms to “stop the bleeding” quickly, such as feature flags, circuit breakers, or the ability to block problematic queries.
  • Deeply Understand Root Causes ๐Ÿ”Ž: Continuously ask questions to uncover the real root cause of incidents. Use these learnings to drive continuous improvement in your systems.

Grafana’s journey is a powerful reminder that even the most sophisticated systems face unexpected challenges. By embracing transparency, learning from failures, and investing in robust architecture and tooling, they continue to build and operate some of the world’s most scalable databases. Hats off to their dedication to improvement! ๐Ÿ‘

Appendix