Presenters

Source

Navigating the Observability Universe: Google’s Planet-Scale Dashboards & Grafana’s Scope Revolution 🚀✨

Ever felt lost in a sea of dashboards, desperately searching for the metrics that actually matter? At Google, this isn’t just a minor inconvenience; it’s a monumental challenge. With hundreds of thousands of systems and a codebase spanning billions of lines, maintaining visibility can become prohibitively expensive and overwhelmingly complex. But what if you could have a single pane of glass that intelligently surfaces exactly what you need, when you need it?

That’s the promise of Google’s internal “planet-scale dashboard” system, a journey that emphasizes reuse, scalability, and a glimpse into the future of observability. And the good news? This isn’t just a Google-only marvel; its core concepts are now making their way into the wider world through a powerful partnership with Grafana.

The Toil of Too Many Dashboards 😩

Imagine this: your team deploys a shiny new system, you’ve been pestered about setting up monitoring, but you put it off, confident in the system’s stability. Then, at 3 AM, your pager screams. You wake up in a panic, not knowing where to start, dreading the manual toil ahead to find the right dashboard and debug the issue.

This is the reality many engineers face. At Google, the scale is mind-boggling:

  • 190K+ Googlers: Equivalent to a large urban center.
  • Single Monolithic Codebase: Billions of lines of code, fostering code reuse.
  • Internet-Scale Applications: Billions of distributed apps globally.

Now, picture an engineer creating a dashboard for a specific metric for their front-end service. Simple enough. But then, they need a similar dashboard for their back-end, middleware, or any other component. Copy-pasting dashboards at this scale leads to an astronomical number of redundant dashboards – potentially 100,000s for a single metric! This is unsustainable and leads to engineers getting lost in the noise.

The Magic of Reusability: Enter Dimensions and Scopes 💡

The key to taming this complexity lies in reusability. Google’s solution hinges on a concept they call dimensions, which are similar to Grafana’s variables or template variables. By injecting these variables at runtime, dashboards become dynamically provisioned and tailored to specific needs.

But simply reusing dashboards isn’t enough. What if your system runs on a Java Virtual Machine (JVM), but the dashboard is designed for something else? You need to filter out irrelevant dashboards. This is where Planet Scale Dashboard truly shines.

The system elevates one dimension to a special status: scope. Think of scope as a powerful filter for your job, or whatever you’re investigating. When you select a scope, the dashboard automatically provisions itself with navigation fields relevant to that scope.

But how do you filter out the right dashboards from the vast list? The system ensures that jobs expose properties about themselves. For instance, a job running on a JVM will expose a metric like “runs on JVM.” When a scope is selected, only dashboards relevant to the JVM are displayed. This creates a filtered list of canonical dashboards, usable across the entire organization.

The Wins So Far:

  • Reusable Dashboards: A single dashboard serves any system, promoting a “single dashboard per concern” philosophy.
  • Navigable Dashboards: The list of dashboards is dynamically filtered based on the system you’re examining, making them generic and adaptable.

This reusability also offers practical benefits like deterministic links to dashboards via URL parameters, saving engineers valuable time and reducing maintenance overhead.

Grafana Steps In: Unifying Industry Standards 🤝

While Google built its observability infrastructure from the ground up, the world has evolved. Industry standards have emerged, with Grafana leading the charge. This has prompted a strategic shift for Google, recognizing the benefits of aligning with these standards:

  • Unified Experience: Merging Google’s internal insights with Grafana’s established platform creates a richer, more cohesive experience.
  • Enhanced Collaboration: Teams familiar with the same technology stack can cooperate more easily, share experiences, and collaborate on templates.
  • Simplified Onboarding: Engineers already acquainted with Grafana can onboard faster.

Google and Grafana have formed a partnership, with Google now running Grafana Enterprise for its internal usage. Grafana’s modular architecture, extensive visualization capabilities, and alignment with industry standards make it an excellent fit. This allows Google to focus on developing unique UI features while leveraging existing industry solutions and sharing its expertise with the broader community.

The Proof is in the Growth: As Katia Giarda highlights, the growth in the number of systems monitored at Google has been more than linear, approaching exponential growth. Without a system like planet-scale dashboards, managing this scale would be impossible.

Scopes in Action: A Grafana Deep Dive 🕵️‍♂️

Carl Bergquist dives into how Google’s “scope” concept has been implemented within Grafana. At its core, a scope in Grafana is a named set of filters that can be any combination of filters in Prometheus.

The user journey with scopes begins before opening a dashboard. Users select a scope (e.g., a namespace or a system they care about), and Grafana injects these filters into dashboard queries at runtime. This disconnects the dashboard’s design from the specific labels of the underlying metrics, making dashboards incredibly reusable.

Key Features of Scopes in Grafana:

  • Dynamic Filtering: Injecting selected scopes into queries ensures you only see telemetry data relevant to your chosen system.
  • Runtime Query Modification: Grafana parses Prometheus queries and injects labels based on the selected scope.
  • Ad Hoc Filters: A new filter box allows users to inject additional filters on the fly.
  • On-the-Fly Group By: Users can now group metrics by labels like “job” at runtime.
  • Scope-Aware Navigation: The left-hand navigation menu dynamically displays dashboards relevant to the selected scope. When you navigate between these dashboards, the scope and applied filters remain active.
  • Quick Navigation: Scopes can be found and applied through Grafana’s quick navigation, allowing for rapid drilling down into infrastructure.
  • Automated Scope Management: Scopes can be automatically generated based on metrics queries (e.g., clusters, services, namespaces) and can even be deleted based on TTL when infrastructure is decommissioned.

The Power of Scopes:

  • Zooms In: Enables users to see only the metrics and dashboards relevant to them.
  • Increases Reusability: Dashboards built by experts can be used by many without modification.
  • Higher-Order Management: Observability platform teams can focus on generating scopes and connecting them to dashboards, rather than managing each dashboard individually.

When Are Scopes the Right Fit? 🤔

Scopes are a feature designed for high scale. They are most suitable for organizations that:

  • Use a Single Metric Database: This simplifies data aggregation and query processing.
  • Have a Central Observability Team: This team can manage configurations and understand the infrastructure enough to define scopes.
  • Employ Expert Dashboard Builders: When experts create dashboards for other teams, the reusability of scopes becomes highly valuable.

While scopes are currently in an experimental stage and will be an Enterprise and Cloud feature, many of the underlying improvements (like improved ad hoc filters, variables, and group by functionality) are being integrated into open-source Grafana, benefiting the entire community.

This collaboration between Google and Grafana is a testament to the power of shared expertise. By bringing Google’s deep understanding of planet-scale observability into Grafana, both organizations and the broader community stand to benefit from significantly enhanced dashboarding capabilities, making observability more accessible and effective for everyone.

Appendix