Presenters

Source

The Silent Killers: Unmasking Failures Beyond Your Dashboards 🕵️‍♀️

We’ve all been there. Alarms blaring, graphs spiking, the adrenaline rush of an incident. As engineers, we excel at fighting outages, diving into the chaos, and emerging victorious. But what if the most expensive reliability failures aren’t the ones that make noise? What if they’re the quiet, insidious ones that slow us down, all while our dashboards gleam with a reassuring green?

Abhimanyu Narwal, Engineering Team Lead at Bloomberg for Security Services, dives deep into this often-overlooked aspect of reliability in his compelling case study. He argues that the true cost of failure lies not in system downtime, but in the erosion of human certainty and the amplification of workflow latency.

The Fog of War: When Green Dashboards Lie 🌫️

Imagine a scenario where all your Service Level Indicators (SLIs) – latency, traffic, errors, saturation – look perfectly healthy. Your system is technically “up.” Yet, something feels off. This isn’t the panic of an outage; it’s the ambiguity of the “fog of war.” You’re not fighting a direct threat, but battling uncertainty. The hardest question becomes: “Are we seeing everything we should be seeing?” If the answer is “no” or “maybe,” every conclusion becomes provisional, and human response time stretches.

This talk isn’t about a breach, a hack, or proprietary secrets. It’s an anonymized case study and a reusable SRE pattern, extending familiar practices to complex, human-in-the-loop workflows.

The Case of the Diverging Counts: A Subtle Breakdown 📉

Abhimanyu’s team faced this exact challenge with a high-volume security pipeline. The workflow: events enter, get processed, results appear in a UI, and a human analyst validates and acts. For a long time, it ran smoothly. Then, a low-severity alert from an analyst flagged a divergence: the edge gateway reported receiving 100,000 events, but the downstream analysis engine was only showing 90,000.

Here’s the chilling part: nothing had crashed. The system was technically operational. But a gap existed, and in these workflows, uncertainty is operationally expensive. The crucial question arose: was this expected transformation lag, or a critical visibility gap? This ambiguity created two different realities for the same system.

The Truth Uncovered: A Silent Telemetry Failure 🤫

After a period of anxious investigation, the truth emerged. The processing engine was working flawlessly, and safety checks were in place. The culprit? A telemetry agent had silently failed, dropping logs before they reached the UI. Compute health was 100%, but the team’s cognitive load was overwhelmed. They couldn’t prove the system was working as intended, even though it technically was.

This incident sparked a critical question: Why didn’t standard SRE alarms catch this? The answer was stark: their alarms were looking for the wrong thing. The golden signals – latency, traffic, errors, saturation – were all fine. They confirmed the system was alive, but failed to address the most vital question for this workflow: Is it explainable?

Redefining Latency: From Milliseconds to Meaningful Action ⏱️

The team realized they were measuring the wrong type of latency. Service latency measures milliseconds between request and response. What they desperately needed was workflow latency: the time it takes for a human to go from seeing a signal to confidently taking action. When telemetry drops, workflow latency skyrockets.

The New Reliability Playbook: SLIs and Design Tactics 🛠️

To combat this, they shifted their mindset, treating the entire human workflow as a distributed trace. They implemented three specific SLIs and three design tactics:

New SLIs for True Workflow Health:

  1. End-to-End Duration: Measuring the P50 and P95 time from event injection to UI visibility, alerting on consistent degradation that impacts workflow. This catches slow degradation even when services appear healthy.
  2. Count Integrity: Comparing counts across pipeline stages with an explicit model of expected deltas and lags. They define expected lag windows, alerting only when divergences exceed defined thresholds for extended periods.
  3. Pipeline Observability (Synthetic Canary): Injecting a continuous, known-good event (a synthetic canary) to continuously prove the workflow is functioning end-to-end. Abhimanyu’s key takeaway: Introduce a canary and verify them at every stage. This proves not only that systems are up but also that they produce observable, trustworthy outputs. 🚀

Design Tactics for Explainable Systems:

  1. Invariants as Reliability Contracts: Treating “everything received is eventually surfaced within a known time window and with a known transformation” as a contract. Making these invariants observable turns ambiguity into a measurable state.
  2. Partial Degradation as a First-Class State: Moving beyond just “up” and “down.” If logs drop, the UI should surface “telemetry delayed” or “coverage reduced,” preventing false confidence and wasted time.
  3. Humans as the Reliability Endpoint: Measuring friction. Proxy metrics like repeated queries, pivots, escalations, and reopen rates reveal the true error rate. Before their fix, escalation loops multiplied as Tier 1 analysts couldn’t trust the UI and escalated everything. This friction was their key metric.

Your Actionable Checklist ✅

Abhimanyu offers a simple, actionable checklist you can apply starting next week:

  • Pick one key workflow.
  • Define your invariants.
  • Add one synthetic canary.
  • Set up workflow SLOs.

The True Cost of Silence 💔

Abhimanyu leaves us with a profound thought: Silent failures don’t always take the system down; they take certainty down. If dashboards only measure compute health, we miss cognitive failures. Reliability isn’t just about keeping systems running; it’s about keeping decisions fast, explainable, and trustworthy under pressure.

You don’t need a perfect system from start to finish. You just need one workflow that is measurably explainable.


Interested in learning more from Bloomberg? Check out tech.bloomberg.com. Looking to join their innovative team? Visit bloomberg.com/careers or connect with Abhimanyu on LinkedIn.

Appendix