Presenters

Source

From Chaos to Clarity: Revolutionizing Incident Response with AI and MCP ๐Ÿš€

In the world of modern software, the transition from monoliths to microservices has brought unparalleled scale but also a massive headache for on-call engineers. If you have ever been paged at 2:00 AM, you know the drill: a single user request might touch five to 10 microservices, each with its own database, cache, and deployment pipeline. When things break, the noise is deafening.

Makarand Gujarathi, a senior software engineer and researcher, proposes a game-changing shift: using Large Language Models (LLMs) and the Model Context Protocol (MCP) to transform how we handle distributed system failures.


๐Ÿ˜ซ The High Cost of Manual Investigation

Today, diagnosing an incident is a grueling manual process. Makarand highlights several critical pain points that plague modern SRE (Site Reliability Engineering) teams:

  • The Investigation Gap: Engineers often spend two to four hours just trying to locate the source of a non-trivial incident. This is time spent searching, not fixing.
  • Context Switching: On-call responders must manually correlate data across fragmented tools like Kibana, Splunk, Grafana, DataDog, and distributed tracing UIs.
  • The Expertise Bottleneck: Senior engineers possess the tribal knowledge of which dashboards to trust, while junior engineers often sit idle, waiting for an expert to join the call.
  • Data Overload: The problem is rarely a lack of data; it is the overwhelming volume of logs and metrics that lack a common mental model.

๐Ÿ› ๏ธ Enter the AI-Assisted Solution: LLMs + MCP

The proposed solution combines the reasoning power of LLMs with the Model Context Protocol (MCP). Think of MCP as the standardized, safe interface that allows an AI to “speak” to your telemetry systems without giving it the keys to the kingdom.

What is MCP? ๐ŸŒ

MCP acts as a bridge between the AI model and your data sources (like BigQuery, Snowflake, ClickHouse, or Prometheus). It provides:

  1. Schema Introspection: The AI asks what tables and metrics exist rather than guessing.
  2. Controlled Execution: It enforces read-only access and masks sensitive PII (Personally Identifiable Information).
  3. Standardization: It provides a consistent pattern across different tools, so you don’t have to rebuild the integration for every new database.

๐Ÿ”„ The New Incident Workflow: Step-by-Step ๐ŸŽฏ

With an AI agent powered by MCP, the investigation workflow changes fundamentally:

  1. Natural Language Input: An engineer describes the problem: We see a spike in 5xx errors in the US-West-2 region after the 10:15 UTC deployment.
  2. Query Generation: The AI extracts key details and generates SQL or telemetry queries automatically. It doesn’t need to remember table names; it looks them up via MCP.
  3. Iterative Analysis: The AI doesn’t just run one query. It analyzes the results, notices that errors are only on the /pay endpoint, and automatically runs a follow-up query to check the payment provider’s latency.
  4. Hypothesis Proposing: Finally, the AI presents a summary: The error spike correlates with version 1.2.4 and downstream timeout errors from the payment provider.

๐Ÿ—๏ธ The Secret Sauce: Context Engineering

Makarand emphasizes that an AI is only as good as the context you provide. To make this work, teams must invest in Context Engineering:

  • Schema Documentation: You must define what columns mean and what units (seconds vs. milliseconds) they use. ๐Ÿ“
  • Curated Query Libraries: Instead of letting the AI “hallucinate” queries, provide a library of pre-tested, reliable SQL patterns for common investigations. ๐Ÿ“š
  • Architecture Diagrams: When the AI understands service dependencies (e.g., Checkout depends on Inventory), it can suggest smarter investigative paths. ๐Ÿ—บ๏ธ

๐Ÿ“Š Impact: Before vs. After

Feature Before AI Assistance After AI Assistance
Investigative Effort 2 to 4 hours of manual searching. Significant reduction; hypothesis in minutes.
Tooling Manual context switching between 5+ tools. Unified natural language interface.
Knowledge Senior experts are the bottleneck. Junior engineers are empowered to lead.
Querying Manual SQL/Filter building. Automated SQL generation and refinement.

โš ๏ธ Challenges and Tradeoffs

While promising, this approach is not a “magic button.” Makarand identifies several hurdles:

  • Telemetry Heterogeneity: Different teams use different logging formats and naming conventions, making correlation difficult.
  • Data Quality: If metrics are aggregated too aggressively or logs are delayed, the AIโ€™s reasoning will suffer.
  • Automation vs. Judgment: We must balance automation with human oversight. The AI should recommend and explain, but humans must make the final call on high-risk actions like rolling back a deployment. โš–๏ธ

๐Ÿš€ How to Get Started

If you want to bring AI-assisted incident response to your organization, Makarand suggests a three-step path:

  1. Assess Telemetry Maturity: Fix your basic telemetry hygiene first. Ensure your logs, metrics, and traces are consistent. ๐Ÿงน
  2. Develop Context Materials: Build the “knowledge base” of schema docs and architecture diagrams that both humans and AI can use. ๐Ÿ“–
  3. Pilot AI Diagnostics: Start small. Give an MCP-enabled agent read-only access to one or two services. Collect feedback, iterate on the prompts, and expand as confidence grows. ๐Ÿงช

The Goal: We aren’t removing the human from the loop. We are making the loop faster, more consistent, and less stressful for everyone involved. ๐Ÿฆพโœจ

Appendix