Presenters
Source
Hello everyone! I’m Uudit Misra, and I was thrilled to speak at Con 42 Cloud Native 2026. My daily life at Salesforce’s Kubernetes platform team involves living and breathing Kubernetes and cloud-native tech, so today’s topic, “Beyond Dashboards: Deep Dive Network Observability with eBPF,” comes straight from the trenches.
Before joining Salesforce, I developed distributed and scalable applications in cloud networking at Microsoft. I know firsthand how incredibly hard it is to debug high-stakes on-call issues. Ever been on call, pulled up your dashboards, seen everything green, yet your users are having a terrible time? That frustrating gap – where your dashboards lie and something is clearly broken – is exactly what we’re tackling. By the end, you’ll understand why this happens at the network layer and what you can do about it. Let’s dive in!
The Dashboard Deception: When “Green” Means Trouble 🚨
Imagine this scenario: you’re on call, a critical support ticket screams in – users are hitting 500 errors, revenue is dropping. You rush to your dashboards: CPU looks normal, memory is fine, error rates are near zero. Everything is green.
You dig deeper. Logs reveal nothing obvious. Traces show requests entering the system, but then… silence. No response, no errors, no alerts. After hours of frantic searching, you discover the culprit: a network policy rule silently dropped traffic between two services. No logs emitted, no alert fired. The packet just vanished.
This isn’t a rare edge case. I’ve personally seen three common failure modes in production:
- Silent Drops: A policy blocks traffic, and there’s no log, no trace, nothing.
- DNS Timeouts: During pod startup, a race condition causes DNS lookups to fail for a few critical seconds, completely undetected by monitoring.
- Wrong Backends: A simple label change in a deployment accidentally routes traffic to the wrong pod, again, with zero information about the misrouting.
These aren’t bugs in Kubernetes; they are inherent limitations when your monitoring only watches what you told it to watch in advance. We need to tackle these major problems head-on.
Monitoring vs. Observability: A Crucial Distinction 💡
Before we jump to solutions, let’s clarify two often-interchanged terms: monitoring and observability.
- Monitoring is like the “check engine” light in your car. Someone decided in advance: if oil pressure drops below X, turn on the light. It works great for known, anticipated problems. But if something unexpected goes wrong that nobody predicted, the light stays off. You remain in the dark.
- Observability, on the other hand, is like having an expert mechanic who can look under the hood and tell you exactly what’s happening, even for the things you didn’t predict.
At the network layer, traditional monitoring provides aggregate counters: total requests per second, error rates, latency. While useful, it doesn’t tell you which specific pod made a request, why a packet was dropped, or the latency of a particular DNS query.
True network observability means flow-level records – every connection individually. It tells you the source pod, the destination pod, what happened, and if a packet was dropped, the exact reason why. That’s the bar we’re aiming for.
The IP Tables Bottleneck: Why Traditional Approaches Fall Short 🐢
The traditional approach to Kubernetes networking relies heavily on IP tables. If you’ve ever peered into an IP table rule, you know it’s a giant, linear list. Every packet walks down this list, one by one. This works fine at a small scale, but it breaks down as things grow.
- It’s Slow: More network policy rules mean longer lookups per packet. Published IEEE research confirms that latency measurably degrades as rule count increases. At production scale with hundreds of services, this matters immensely.
- It’s Shallow: IP tables only see IP addresses and port numbers – essentially Layer 3 and Layer 4 information. It’s completely blind to HTTP paths, DNS queries, gRPC service names, and all other Layer 7 application-level details.
- No Identity: IP tables see IPs, not pod names. Even if you know something was dropped, you don’t automatically know which pod was involved.
- Silent Drops: When IP tables drop a packet, there’s no log, no event, nothing. The packet just disappears, leaving you clueless during an incident.
These limitations make it impossible to answer critical incident questions like: “Why did traffic between these pods stop?” “What DNS queries are timing out?” or “Is my network policy actually doing what it needs to?” We need something that goes deep.
Enter eBPF: Your Deep Dive Network Superpower 🦸♂️
eBPF is the game-changer for network observability. It unlocks answers to seven previously impossible questions during an incident:
- Which pod initiated this connection? You can answer that now!
- Why did the packet get dropped? You get the exact kernel-level reason, not just a guess.
- What services does pod X depend on? You get a live flow map of what’s actually happening, not just what your config says.
- Is DNS the reason the call is slow? You can see individual DNS queries per pod with latency.
- Is my network policy working? Watch its verdict in real time: forwarded, dropped, or otherwise.
- What HTTP path is generating 500 errors? You get Layer 7 visibility without touching the app.
- Which pods are talking to the internet? Egress visibility is right there.
And here’s the best part: all of this happens without a sidecar on every pod, without changing application code, and without restarting anything! eBPF unlocks three fundamental building blocks: per-flow records, DNS visibility, and drop reason telemetry.
How eBPF Works its Magic 🪄
An eBPF agent lives as a daemon set in your cluster, meaning Kubernetes schedules one copy on every single node automatically. Below your application pods and the agent, within the Linux kernel, are hook points where lightweight eBPF programs attach. These programs watch everything flowing through the kernel without changing any kernel-level code.
The agent then sends this invaluable data to various places: flow records to
Hubble Relay (powering its CLI and UI), metrics to Prometheus (for
Grafana dashboards), and allows you to see live flows with commands like
Hubble observe. Your application pods remain completely oblivious, yet the
deep visibility simply appears.
eBPF in Action: Retina vs. Cilium 🛠️
Let’s look at two specific eBPF-based tools: Retina and Cilium. I ran benchmarks on a real AWS EKS cluster with almost 500 QPS of sustained data flow between pods to compare them.
Retina: The Lightweight Observer 👁️
- Open-sourced by Microsoft in 2024, Retina is designed only for observability. It doesn’t replace your CNI or enforce policies; it just watches traffic.
- It’s CNI agnostic, meaning it works with any existing CNI.
- Retina configures major kernel hooks: TC hook for every ingress/egress packet, kprobe for TCP events, socket filter for DNS, and XDP for packet drops.
- Data flows from the kernel to user space via a ring buffer, where IP-to-pod mapping occurs. Metrics are then aggregated and sent to Prometheus.
- Cleverly, Retina emits data in Hubble format, allowing you to use the Hubble CLI immediately.
- Key Design Trait & Tradeoff: IP-to-pod mapping happens in user space, not at the kernel level. This makes it lightweight and CNI agnostic but creates an identity gap – it cannot properly track identity for all flowing packets. For example, in my tests, it often showed “world” instead of a specific pod name for destination.
- L7 Visibility: Retina offers no Layer 7 visibility.
- Resource Cost: Retina uses 47% less memory per node and its CPU barely moves under 500 QPS load. It’s incredibly efficient.
Cilium: The Identity-Aware Powerhouse 💪
- Cilium can serve as a standalone CNI or an observability plugin when used in chaining mode with another CNI (e.g., AWS VPC CNI).
- Kernel-Space Identity: When a pod starts, Cilium assigns it a cryptographic identity based on its labels. Every eBPF program in the kernel knows this identity. When a packet moves between pods, the kernel already understands both sides of the flow. This eliminates the identity gap.
- L7 Visibility: Cilium includes an L7 protocol parser, providing rich HTTP method, path, status code, and gRPC service name information. This is where Cilium truly shines for application-layer debugging.
- Resource Cost: Cilium costs more in terms of resources. My tests showed a 31% CPU spike under the same 500 QPS load, and it uses more memory because it processes per-flow identity on every packet.
The Verdict: Choose Your Tool Wisely 🎯
| Feature | Retina | Cilium |
|---|---|---|
| Primary Role | Observability tool | CNI and/or Observability plugin |
| CNI Compatibility | CNI agnostic | Can be standalone CNI or CNI chaining mode |
| Identity Model | Correlates IPs in user space (partial) | Correlates identity in kernel (full) |
| L7 Visibility | None | HTTP, gRPC, etc. |
| Resource Cost (benchmark) | 47% less memory, flat CPU | 31% CPU spike, more memory |
The pattern is simple: both tools give you basics like flows, DNS visibility, and drop reasons. The gap opens at identity and L7.
- Pick Retina if: You’re already on Prometheus, need low-cost network traceability, and L7 visibility isn’t a primary requirement. It’s a great starting point, deployable in 5 minutes.
- Pick Cilium if: You need accurate pod identity on both sides of a flow, frequently debug HTTP/gRPC issues, and require deep L7 insights.
These aren’t permanent choices! You can start with Retina for baseline visibility and then migrate to Cilium as your requirements grow. Choose what’s right for you today.
Your Blueprint for Deep Network Observability ✨
Here’s a simple three-step plan to get deep network observability in your own clusters this week:
- Deploy Retina (5 minutes!): Just two basic Helm commands are all it takes. You’ll immediately start seeing TCP flows and DNS activity in Prometheus (remember to configure Prometheus to scrape Retina’s metrics).
- Uncover Silent Drops with Hubble: Run
Hubble observe from pod <your-pod-name>to trace a specific pod’s activity. During an incident, executeHubble observe verdict dropto instantly reveal any silently blocked packets, making debugging dropped packets incredibly easy. - Level Up with Cilium (in Chaining Mode): If you need to go deeper,
especially for identity-aware L7 visibility, add Cilium. If you already use
a different CNI (like AWS VPC CNI on EKS), deploy Cilium in chaining mode
(e.g.,
CNI chaining mode = AWS CNI). This keeps your existing CNI managing IP addresses while Cilium adds its powerful, identity-aware L7 layer on top.
The Future is Observability: A Necessity, Not an Option 🌐
eBPF is truly the only lightweight solution to see the full picture of the network layer without any performance degradation. Retina offers a fantastic, lightweight starting point for network observability. Cilium is smarter, more identity-aware, and provides crucial Layer 7 information when you need to go deeper into application-level debugging.
Network observability is not an option anymore. As clusters grow and scale, pre-built dashboards simply stop being enough. You need a deeper level of information. eBPF-based network observability is becoming a standalone infrastructure. The question isn’t whether you need it, but which tool fits where you are right now.
Thank you so very much, everyone! I hope you found this informative. If you have any questions, concerns, or want to discuss this further, please feel free to connect with me on LinkedIn or via email. Thanks again!