Presenters

Source

Grafana Assistant: From Promising Prototype to Production Powerhouse 🚀

GrafanaCON attendees, get ready for a deep dive into the evolution of our AI agent, Grafana Assistant! Last year, we introduced this game-changer, capable of querying data, building dashboards from scratch, and generally making your life easier in Grafana, all through natural language. The response has been incredible, with thousands of you actively using Assistant, sharing your successes on social media, and even creating YouTube content showcasing its speed. We’ve seen hours of work slashed into mere minutes, transforming how users interact with Grafana.

The Growth Spurt: From 9 to 90+ Contributors 📈

A year later, Assistant has experienced phenomenal growth. We’ve gone from a single team of nine to over 90 contributors, merging nearly 4,000 pull requests. While PRs aren’t the perfect metric, they represent tangible improvements: enhanced tools, support for new data sources like AWS CloudWatch and SQL, and expanded capabilities. Assistant can now leverage knowledge graphs, assist with k6 load testing, and even help manage IRM incidents. And we’re not stopping here; its reach within Grafana continues to expand.

However, this rapid growth brings challenges. As Assistant became more complex, its behavior grew less predictable, and making changes became more difficult. Today, we’ll explore how we tackled three key challenges: making Assistant more useful with context engineering, iterating with confidence using coding agents and self-improvement loops, and understanding and trusting it in production with AI Observability.

Context is King: Making Assistant Smarter 🧠

An AI assistant’s usefulness hinges on the context it possesses. Without understanding your system, it wastes precious time deciphering what you have instead of helping you with what you need. We’ve invested heavily in context engineering, focusing on three crucial layers:

1. What You Have: Your Data and Infrastructure 📊

This layer encompasses your dashboards, data sources, services, and tables. We believe you shouldn’t have to re-explain your entire environment every time you interact with Assistant. Features like dashboard scans and knowledge graph-powered table discovery automatically bring this context.

Assistant Memories ✨: A groundbreaking new capability, Assistant regularly scans your observability data sources (Prometheus, Loki, Tempo) weekly. It discovers services, groups them into logical domains, and maintains persistent memory for each domain. This includes information on where services run, how they’re monitored, their dependencies, and top metrics. The next time you ask a question, Assistant doesn’t start from scratch; it consults its memories, directly querying logs, metrics, and traces to provide answers.

2. What You Know: Team and Company Knowledge 🤝

This is the invaluable context residing in your runbooks, triage guides, internal APIs, and documentation – the collective knowledge of your team and organization.

Skills 🛠️: To capture this, we built Skills. This feature allows you to embed this knowledge directly into Assistant and even import it from sources like GitHub repositories. Skills are more than just instructions; they are repeatable workflows that can be triggered as commands or utilized by the agent when needed. For instance, when asked about slow services, Assistant can search this shared knowledge base, find relevant runbooks (even if not explicitly mentioned), and execute tasks faster, more reliably, and repeatably. This empowers senior team members to document their expertise once, benefiting the entire on-call team. With MCP integrations, Skills can extend beyond Grafana, referencing AWS or internal APIs, or even taking actions like creating an entry in your internal wiki.

3. What You’re Doing: In-the-Moment Context 🎯

This layer focuses on your immediate task and objective when using Assistant.

Hooks and One-Click Actions 🖱️: Grafana app developers can use hooks to pass context automatically. For example, if you open Assistant from an explorer page, it already knows your selected data source and current query, allowing for seamless continuation. For users, we’ve added one-click actions like “analyze this trace” or “explain this log line.” On dashboards, a literal pointer lets you indicate a panel to update without description. You can even attach images, allowing Assistant to recreate dashboards from screenshots, whether they’re Grafana dashboards or found online.

Managing Context Windows 💾: While more context is beneficial, context windows are not unlimited. We employ several techniques:

  • Deferred Loading: Less commonly used tools are loaded on demand, saving tokens.
  • Context Compaction: For long conversations, we retain only the most relevant parts of the context.
  • Summarization: Large tool outputs are summarized to keep only essential information, again saving tokens.

More relevant context makes Assistant useful, but it also introduces complexity. Our goal is to translate this context into predictable and reliable behavior.

Iterating with Confidence: Self-Improvement Loops 🔄

What happens when an agent doesn’t behave as expected, despite all our efforts? We put our AI agents to the test with a suite of Observability tasks. Initially, an 82% pass rate seemed promising. However, focusing on consistency revealed a significant drop. This “flakiness” in AI makes iteration challenging, as changes can impact both behavior and reliability. Simply adding more instructions often fails, as programming behavior in natural language, especially at scale, is incredibly difficult.

The AI Improving AI Hypothesis 🤔

This led us to a bold idea: what if we could use AI to improve our AI?

The Self-Improvement Loop 💡: We devised a three-step self-improvement loop:

  1. Introspection: Discovering what’s working and what’s not.
  2. Reflection: Learning from experiences.
  3. Change: Making improvements.

This loop is powered by coding agents (like cloud code) paired with internal tools and custom agent skills, making the process of analyzing transcripts and code more efficient.

Introspection: Defining “Good” ✅ Evaluating agents, especially for subtle Observability task failures, is tough. We use three methods:

  • Deterministic Checks: Verifying outcomes like the exact number of dashboard panels or the existence of a trace ID.
  • LLM Rubrics: Using an LLM judge with task-specific criteria (e.g., did it follow instructions, find the root cause?) to validate semantic correctness.
  • Fact-Based LLM Rubrics: Defining known good queries, running them in a controlled environment, and comparing the results to the agent’s output. This verifies actual work and accuracy against data.

These tasks form a benchmark suite, leading to a leaderboard that allows us to compare models and reliably measure agent performance, moving beyond handpicked demos.

o11y-bench: Our Open Benchmark 🌐 We’re proud to announce o11y-bench, our first open benchmark for Observability agents. It includes tasks, environments, grading logic, and our own results, empowering the community to reproduce our findings, define what “good” looks like, and contribute to building better Observability agents.

Reflection: Learning from the Evidence 🧐 The benchmark results provide detailed insights. We can see which tasks failed, analyze agent steps, tool calls, conversation transcripts, and the reasons for success or failure. This data, processed by coding agents, maps the system’s strengths, weaknesses, and opportunities for improvement.

Change: Targeted Improvements 🛠️ Growth requires change. Based on our learnings, we can make targeted adjustments to prompts, instructions, tool layers, or fix bugs. These proposed changes are then fed back into the introspection step to verify their effectiveness. Humans retain the final say in reviewing and validating these changes, ensuring we improve agents with confidence. This continuous loop optimizes for cost, latency, quality, and reliability, powered by coding agents and feedback.

AI Observability: Trust in Production 📡

Even the best benchmarks operate in controlled environments. The real world, with its unpredictable user interactions, presents a final frontier.

Grafana AI Observability: Observing Our Agents 🧐 To conquer this, we’ve built Grafana AI Observability, our new platform for observing AI agents in production. It connects our agent-building experience with our Observability expertise, providing the insights we need to understand and manage agents like Grafana Assistant in the real world.

Public Preview Available Now! ✨ Grafana AI Observability is in public preview, allowing you to test it with your own agents today. Setting it up is easy with clear instructions and agent skills for popular tools like Cursor, Claude code, and Copilot, enabling instrumentation within minutes.

Key Features: What We’re Monitoring 📊 Grafana AI Observability tracks essential metrics like token usage, error rates, time to first token, and, crucially, the costs associated with your agents.

Online Evaluations: Real-User Behavior 🧑‍💻 A core feature is online evaluations, answering the critical question: how is Assistant behaving with real users? We set up evaluators to define correct behavior:

  • LLM Judge: Ensures Assistant grounds its answers in real data and avoids hallucinations.
  • Security Evaluator: Detects prompt injection attempts and ensures secure responses.
  • Heuristic Evaluator: Monitors response length, ensuring concise and readable output.

You can create custom evaluators or use templates. We then decide where these evaluations run, for example, on 20% of Grafana Assistant’s user-visible generations. Simple integration for alerts allows notifications if pass rates fall below a set threshold (e.g., 80%).

Measuring and Responding 📈 With evaluations running, we gain measurable data on Assistant’s behavior. We can monitor trends, receive alerts, and drill down into specific conversations that failed evaluations. The conversation view provides all details: user interaction, metadata, and evaluation results. If an evaluation fails (e.g., “groundedness”), we get insights into why. This allows us to determine if it’s a one-off issue or a pattern. If it’s a pattern, we turn it into a test case for our self-improvement loop.

This is how we observe our AI agents in Grafana in a truly useful way. We encourage you to test Grafana AI Observability and join us for AI demonstrations to explore more features.

The Future of Agentic Systems 🤖

In scaling up Assistant, we addressed three fundamental challenges: making it more useful with context engineering, iterating with confidence through self-improvement loops, and building trust in production with AI Observability.

The path to better agents and agentic systems isn’t solely about bigger, more expensive models. It’s about providing more relevant context, building and utilizing effective improvement loops for tuning, and diligently observing agents in production. This is how we transform impressive demos into truly useful tools. Thank you!

Appendix