Presenters

Source

🕵️‍♂️ When the Stack Lies: Unmasking Root Cause in Distributed Systems

In the high-stakes world of modern enterprise technology, the “stack” often tells half-truths. As we migrate from classic three-tiered architectures to hyper-distributed environments, the complexity of our systems is outstripping our ability to monitor them. Daniel Raskin (CMO at Vertana) and David McNerney (Director of Product Management at Vertana) recently sat down to dissect why traditional observability is failing and how a system-aware approach can save the day.

🏗️ The Deconstruction of the Modern Application

The enterprise landscape has shifted dramatically. We no longer manage self-contained applications; we manage ecosystems. Today’s applications live across every axis of the enterprise—spanning on-prem data centers, public clouds, and microservices architecture.

These systems rely on a web of dependencies:

  • Identity and Access Management (IAM) services.
  • AI Factories running machine learning models and Large Language Models (LLMs).
  • Distributed databases and third-party APIs.

When an airline reservation system or a global ATM network fails, the impact is catastrophic. It destroys brand equity and costs an exorbitant amount of funds. In these distributed environments, failures are systemic. A database latency issue or a pipeline stall might not technically “break” a component, but it can stop the entire system from functioning.

🏎️ The AI Paradox: 150 MPH into the Fog

There is a massive dissonance between the boardroom and the server room. Vertana commissioned a study of Global 2000 businesses that revealed a startling paradox:

  1. The Executive View: Leaders are plowing forward at 150 mph, refactoring applications for AI and assuming the organization is ready for “AI Factories.”
  2. The Practitioner View: SREs and IT professionals see visibility gaps, fragmented tools, and significant operational risk.

This dissonance creates a dangerous environment. If we lack the operational sophistication to handle current complexity, adding AI workloads will only accelerate our path toward disaster. We must close this gap by shifting our thinking from siloed monitoring to full-stack system-aware observability.

📉 Why Your Current Dashboards are Lying

Historically, observability has lived in silos: application observability, infrastructure observability, and now, AI factory observability.

The problem? Silos show symptoms, not causes. 🚫

If an application struggles, your Application Performance Monitoring (APM) tool might alert you that the app is down. However, it cannot see that the root cause is an LLM inference delay or a GPU resource contention. You end up with a “war room” full of teams pointing at green dashboards while the customer experience is failing.

To find the truth, we must move beyond code-level observability and embrace system-level observability. This means seeing the end-to-end journey from the application layer down through the service layer (API gateways), the AI factory (tokens and GPUs), and the underlying network and storage infrastructure.

🧠 The Architecture of Operational Truth

To build a system that actually works, we need more than just a data lake of Metrics, Events, Logs, and Traces (MELT). We need a System Dependency Graph.

  • The Logic Layer: A system dependency graph continually analyzes data to discover relationships across the stack.
  • Smarter AI: Autonomous IT operations require a high-fidelity data model. If you put an AI agent on top of a single silo, it will make the same mistakes as a human, just at machine speed.
  • Democratized Data: By using an MCP (Model Context Protocol) Server, organizations can expose system data to LLMs like Claude, OpenAI, or Gemini. This allows anyone on the team to use natural language to query complex system states.

🛠️ Real-World Resolution: The Vertana Approach

David McNerney demonstrated how this works in practice using the Vertana platform. Imagine a scenario where a checkout service is timing out.

1. Identifying the Breach 🚨

The system detects that response times have spiked to 3 seconds, while the Service Level Objective (SLO) is a crisp 0.09 seconds. Instead of a cryptic error code, the platform provides a plain-English summary of the breach.

2. Vertical Topology 🌐

Using Vertana’s Trace Topology, David visualized the application’s path across a hybrid environment—including GKE (Google Kubernetes Engine), Azure VMs, and on-prem storage. The vertical topology maps the application “hops” directly to the underlying Kubernetes namespaces and nodes.

3. Automated Root Cause Analysis (RCA) 🔍

Vertana’s RCA engine uses a Fishbone Diagram to separate symptoms from causes. In this demo:

  • Symptoms: Cart service and web server errors.
  • Root Cause: The DB Service experienced disk errors and a certificate expiration warning.
  • Impact: David identified the true cause in just 2 minutes, a process that typically takes hours across multiple teams.

4. Taking Action ⚡

Once the cause is known, the Action Engine takes over. You can:

  • Automate low-risk fixes (like deleting unattached EBS volumes).
  • Keep a Human in the Loop for production changes.
  • Use the Vertana Co-pilot to generate Jira tickets or Ansible playbooks using natural language. For example, David commanded the co-pilot to “Create a Jira ticket to update the certificate,” and the system instantly generated the YAML-based policy.

🤖 Bringing Intelligence to the Ecosystem

The future of observability is portable. By leveraging an MCP Server and open-source orchestrators like Goose, SREs can fetch Vertana’s intelligence directly into their preferred AI tools. This allows for cross-tool investigation, fetching alerts from Vertana and correlating them with data in ServiceNow or other platforms seamlessly.

The message is clear: To survive the era of distributed systems and AI, we must stop looking at the stack in pieces. We need to see the entire system to find the truth behind the lies. 🦾🌐🎯

Appendix