Presenters

Source

grpConf India 2025

🚀 Unmasking Kubernetes’ Hidden Conversations: Tracing gRPC for Crystal Clear Visibility

Hey tech enthusiasts! Ever wonder what truly goes on under the hood of your Kubernetes cluster? We’re Ajit Chaudhari and Sanskar Agorola, Site Reliability Engineers at Quatic AI (and formerly DevOps Engineers at Robq), and we’re here to pull back the curtain on one of Kubernetes’ most mysterious layers: its internal gRPC communications.

You deploy pods, manage workflows, and everything seems to run smoothly. But beneath that calm surface, Kubernetes components are constantly chattering away using a complex web of gRPC calls. Every kubectl command triggers a cascade of these internal interactions. When things go wrong, it often feels like a black box. You get generic error messages, and suddenly, you’re playing detective for hours. Sound familiar? Let’s dive in and see how we can make this invisible layer brilliantly transparent!

🕵️‍♂️ The Hidden World of Kubernetes gRPC Calls

Imagine deploying a pod with a mounted volume. If that volume mount fails, Kubernetes might simply tell you: “RPC error mount volume failed.” Frustrating, right? This single, vague message leaves you clueless. Was it an NFS backend issue? A CSI driver problem? Or maybe a kubelet configuration hiccup? You then embark on a time-consuming journey: checking CSI controller logs, inspecting node permissions, or even digging into driver source code. What should be a simple fix becomes an hours-long debugging marathon.

This problem stems from Kubernetes’ reliance on gRPC for internal communication. For instance, when you deploy a pod:

The API server receives your YAML request.
The scheduler decides which node hosts the pod.
The kubelet on that node then talks to containerd (or CRI) using a series of gRPC calls to pull the image, create the container, start it, and fetch its status.
Finally, the API server updates the etcd database with the cluster’s new status, again using gRPC.

These interactions form a critical chain. While this gRPC-driven architecture makes Kubernetes highly scalable and modular, it also creates that dreaded black box for developers, offering limited visibility when things break.

🚫 Why Traditional Logs Fall Short

You might think, “Can’t I just check the kubelet or API server logs?” You’re partially right, but these logs alone are often insufficient for several reasons:

Static and Isolated: Logs from the API server, kubelet, and other components live in their own silos. They aren’t inherently connected, offering no holistic view of the call flow across different Kubernetes components.
Missing Performance Insights: Logs typically don’t show crucial performance metrics like latency, status codes, or other information vital for diagnosing and fixing issues.

Because of these limitations, the root cause often remains hidden. To truly understand gRPC behavior within Kubernetes, we need runtime tracing, and logs simply don’t cut it.

💡 Shining a Light: Two Approaches to gRPC Tracing

We explored a couple of powerful approaches to gain visibility into these internal gRPC calls, aiming for end-to-end observability across Kubernetes' gRPC layers.

Custom gRPC Visibility Exporter: This approach involves using the OpenTelemetry SDK as a hook between Kubernetes components. It intercepts gRPC calls, fetches metadata (latency, status code, payload), converts it into meaningful spans via an OpenTelemetry exporter, and stores it in a data storage like Tempo. Grafana then visualizes these traces.
Kubernetes Built-in Native gRPC Telemetry: Recognizing the need for better tracing, Kubernetes itself (from version 1.22 onwards) started offering native OpenTelemetry exporters. This approach leverages these built-in capabilities by enabling specific feature flags during cluster setup, eliminating the need for external agents to intercept calls.

For our demonstration, we chose to dive deep into Approach 2, leveraging Kubernetes’ native capabilities for a more integrated solution.

🛠️ Deep Dive: Kubernetes Native gRPC Telemetry

Our objective was clear: gain end-to-end visibility into Kubernetes' internal gRPC communication between components like the API server, kubelet, and etcd.

🚀 Setup Highlights

Kubernetes Version: We used Kubernetes 1.23 for our demonstration.
Cluster Type: We opted for a Kind cluster. Its lightweight, reproducible nature makes it ideal for rapid experimentation and demos, as the entire Kubernetes control plane runs inside Docker containers, giving us full access to component manifest files.
Tracing: We utilized the native OpenTelemetry gRPC exporter built into Kubernetes.
Span Collection: We deployed the OpenTelemetry Collector backend as a DaemonSet within our Kubernetes cluster.
Visualization: We chose Jaeger for its powerful and developer-friendly visualization capabilities.

⚙️ Enabling Native Tracing

Kubernetes provides a native tracing configuration API. We used a Custom Resource Definition (CRD) approach along with feature gates to enable tracing for specific components:

For the API server, we passed APIServerTracing=true as a feature gate.
Similarly, for the kubelet, we passed KubeletTracing=true.

Once enabled, these components export their traces. The OpenTelemetry Collector (running as a DaemonSet) listens for these traces on port 4317, acting as the designated endpoint for our tracing configuration.

🌉 The Mighty OpenTelemetry Collector

The OTel Collector plays a crucial role as the central bridge between Kubernetes’ internal gRPC protocols and our Jaeger backend.

Deployment: We deploy it as a DaemonSet, ensuring a collector pod runs on each node to gather all spans from that particular node.
Data Flow: When the collector receives spans from the Kubernetes component exporters, it handles batching and sampling. It then converts this raw OpenTelemetry data into the Jaeger format using its inbuilt Jaeger exporter and forwards it to the Jaeger collector.

You might ask: Why do we need the OTel Collector? Can’t Kubernetes send traces directly to Jaeger? The answer is no! While Kubernetes pushes raw OpenTelemetry data, Jaeger doesn’t natively understand this format; it expects data in its own Jaeger format. The OTel Collector bridges this gap, transforming the data.

An added benefit? The OTel Collector offers flexibility. If you ever decide to switch visualization backends (e.g., to Grafana Tempo or Datadog), you can simply change the exporter configuration within the OTel Collector, making your tracing setup future-proof.

📊 Visualizing with Jaeger

Jaeger serves as our visualization layer, providing a clean, developer-friendly interface for our internal gRPC traces. It offers:

Traces and Timelines: Detailed views of call sequences and their durations.
Service Graphs: Visual representation of service dependencies.
Detailed gRPC Call Insights: Crucial information like latency, status codes, and operations invoked.

Within the Jaeger UI, you can select specific Kubernetes services (like kubelet or API server) and operations (e.g., sync pod) and filter by time range to pinpoint the exact traces you need.

A quick note on storage: By default, Jaeger stores traces in memory. This is perfectly fine for testing and demo purposes. However, for production environments, you’ll want a persistent backend such as Elasticsearch (most common), Kafka (for buffering large trace volumes), or Grafana Tempo if you prefer direct integration with Grafana dashboards.

✅ The Payoff: From Guesswork to Understanding

With this setup complete, we can finally trace what’s actually happening inside our Kubernetes clusters!

We can see traces of communication between the API server and etcd, complete with latency details and status codes, invaluable for diagnosing issues.
We gain full visibility into the communication between the kubelet and containerd, seeing every operation invoked – from fetching image status to pulling the image, creating the container, and starting it.

Remember that frustrating volume mount failure example? Now, if anything goes wrong in that layer, it’s clearly visible in the trace. Developers no longer have to spend hours debugging a simple issue; the problem’s root cause is laid bare.

What began as an invisible layer in Kubernetes has now transformed into a transparent and traceable system. By capturing these gRPC traces between components like kubelet, containerd, and the API server, we’ve replaced guesswork with genuine understanding. Instead of chasing vague log messages, we can now pinpoint exactly where and what exactly is going wrong.

This is the power of end-to-end gRPC tracing in Kubernetes – turning complex debugging into clear, actionable insights! Thank you!

Peeking Under the Hood: Observability of Kubernetes gRPC Internals - A. Chaudhuri & S. Agrawalla

🚀 Unmasking Kubernetes’ Hidden Conversations: Tracing gRPC for Crystal Clear Visibility

🕵️‍♂️ The Hidden World of Kubernetes gRPC Calls

🚫 Why Traditional Logs Fall Short

💡 Shining a Light: Two Approaches to gRPC Tracing

🛠️ Deep Dive: Kubernetes Native gRPC Telemetry

🚀 Setup Highlights

⚙️ Enabling Native Tracing

🌉 The Mighty OpenTelemetry Collector

📊 Visualizing with Jaeger

✅ The Payoff: From Guesswork to Understanding

Appendix

🚀 Unmasking Kubernetes’ Hidden Conversations: Tracing gRPC for Crystal Clear Visibility#

🕵️‍♂️ The Hidden World of Kubernetes gRPC Calls#

🚫 Why Traditional Logs Fall Short#

💡 Shining a Light: Two Approaches to gRPC Tracing#

🛠️ Deep Dive: Kubernetes Native gRPC Telemetry#

🚀 Setup Highlights#

⚙️ Enabling Native Tracing#

🌉 The Mighty OpenTelemetry Collector#

📊 Visualizing with Jaeger#

✅ The Payoff: From Guesswork to Understanding#

Appendix#

🚀 Unmasking Kubernetes’ Hidden Conversations: Tracing gRPC for Crystal Clear Visibility

🕵️‍♂️ The Hidden World of Kubernetes gRPC Calls

🚫 Why Traditional Logs Fall Short

💡 Shining a Light: Two Approaches to gRPC Tracing

🛠️ Deep Dive: Kubernetes Native gRPC Telemetry

🚀 Setup Highlights

⚙️ Enabling Native Tracing

🌉 The Mighty OpenTelemetry Collector

📊 Visualizing with Jaeger

✅ The Payoff: From Guesswork to Understanding

Appendix