Presenters
Source
Goodbye Sidecars, Hello Node-Level Envoy! Google Cloud’s Bold Leap in Service Networking 🚀
Are you tired of the overhead, the complexity, and the sheer bloat that comes with the traditional Envoy sidecar pattern in Kubernetes? Get ready for a breath of fresh air because Google Cloud is charting a new course, and it’s all about ditching those per-pod proxies for a more elegant, efficient, and glorious service networking future! 🌐
For years, the Kubernetes world has been accustomed to the ubiquitous Envoy sidecar, faithfully accompanying every pod. While this approach has served us well, providing excellent integration and native support, its limitations are becoming increasingly glaring. It’s time to talk about what’s not working and, more importantly, what’s coming next.
The Sidecar Stranglehold: A Legacy of Bloat and Complexity 😩
Let’s face it, the sidecar pattern, while functional, has brought its fair share of headaches:
- Resource Inefficiency: Imagine reserving CPU for peak loads that rarely arrive. Sidecars often do just that, leading to significant wasted resources and contributing to overall cluster bloat. This “occupied” capacity is a silent drain on your infrastructure.
- Operational Complexity: Modifying pod specs for every sidecar deployment and upgrade is a tangled web. Need to patch a critical CVE in your sidecar? Prepare for a disruptive, pod-wide rollout that sends shivers down the spine of both operators and customers. 🥶
- Invasiveness: Application developers are increasingly finding themselves bogged down by proxy complexities, diverting their precious focus from building innovative features to managing networking infrastructure. It’s like having an uninvited guest constantly peering over your shoulder.
The Node-Level Solution: A DaemonSet Revolution! 💡
Enter Google Cloud’s game-changing proposal: a DaemonSet that deploys a single, powerful Envoy proxy per node. This centralized approach promises a paradigm shift, offering a cascade of benefits:
- Reduced Resource Footprint: By consolidating proxying onto a single Envoy per node, we see a dramatic reduction in resource consumption. The unused capacity from one pod can now intelligently benefit all others on that node. It’s like unlocking hidden potential! ✨
- Simplified Operations: Say goodbye to intricate pod spec modifications! Upgrades become a simple proxy update, eliminating the need for disruptive pod restarts. This means smoother operations and happier teams. 🛠️
- Non-Invasive Design: Application developers can finally breathe easy, focusing on their core competencies without the burden of managing proxy intricacies. It’s about empowering developers, not encumbering them. 👨💻
The “Ambient Mesh” Inspiration and Envoy’s Strategic Adaptation 🧠
While the DaemonSet approach might sound like a brand-new invention, the speaker humorously points out its conceptual lineage. It draws inspiration from patterns like Istio’s “ambient mesh” and its “Ztunnel” component. However, Google Cloud’s team has opted for a strategic adaptation of Envoy, rather than a wholesale adoption of Ztunnel. Why?
- Leveraging Existing Investment: Google Cloud has a deep and robust infrastructure built around Envoy. This includes custom filters, kernel optimizations, security integrations, and a dedicated Envoy platform team. Abandoning these valuable assets would be… well, impractical!
- Envoy’s Unmatched Extensibility: Envoy’s inherently extensible nature makes it a perfect canvas for customization. It allows for deep modifications and even reimplementations of core components, perfectly suiting Google Cloud’s specific needs.
- Avoiding Tool Sprawl: Introducing and managing yet another proxy like Ztunnel would only add to the complexity. When you have a highly capable, battle-tested solution like Envoy already in your ecosystem, it makes perfect sense to leverage its power.
Technical Hurdles and Envoy Modifications: Engineering for the Future ⚙️
This ambitious transition wasn’t without its technical challenges. The Google Cloud team had to roll up their sleeves and engineer significant modifications within Envoy itself:
- Network Namespace Awareness: To enable Envoy to proxy traffic from pods
into its own network namespace, Envoy needed a deeper understanding of Linux
network namespaces. This was achieved by introducing a
network_namespace_file_pathfield to thesocket_addressproto. This allows Envoy to utilize thesetnssyscall, effectively entering the correct namespace and creating sockets there. - Upstream Connection Binding: Ensuring outbound traffic from pods appears
to originate from the pod itself is crucial for network policy enforcement
(think Cilium!). To achieve this, Envoy’s upstream configuration was updated
with a
bind_config. This allows specifying a network namespace, enabling Envoy to establish outbound connections from the desired pod’s namespace. - Listener Configuration for Dynamic Pods: To avoid disruptive configuration changes as pods are created and destroyed, a per-pod listener approach was adopted. While not the absolute most optimal, this strategy elegantly prevents connection draining during pod lifecycle events.
- The Traffic Flow: Imagine this: network filter rules intercept outbound traffic, redirecting it to a local Envoy listener within the pod’s network namespace. Envoy then meticulously processes this traffic and sends it back out through the pod’s network interface. It’s a symphony of networking! 🎶
Performance Gains and Optimization Wins: Speed and Efficiency Unleashed! ⚡
The results of this architectural shift are nothing short of impressive. Rigorous iperf tests were conducted, revealing initial memory consumption challenges primarily due to large default buffer sizes. The solution? A critical optimization:
- Buffer Size Reduction: By aggressively limiting downstream and upstream buffer sizes to a lean 32KB, memory allocation plummeted. This single optimization was a game-changer, slashing memory usage from a hefty 5-6GB to a stable 3.5GB while handling an astonishing 25,000 connections!
- Kernel Buffering Awareness: Further deep dives revealed that kernel-level socket buffering was also a significant factor. Adjusting these kernel parameters led to an aggregate memory usage of approximately 4.75GB for four Envoys, translating to just over 1GB per Envoy. This efficiency pushed an incredible 12 gigabits of traffic with those same 25,000 active connections. Talk about optimization! 💪
Addressing Key Concerns: Isolation and Seamless Upgrades 🛡️
Of course, any major architectural shift brings questions. Two key concerns were addressed:
- Isolation: While Envoy itself doesn’t natively offer noisy neighbor isolation, the team has a plan! They’re developing an extension that leverages the overload manager and a fair queuing mechanism to gracefully kill expensive connections, ensuring fairness.
- Upgrades: The DaemonSet approach makes upgrades a breeze. By running two
Envoys side-by-side on the same port using
SO_REUSEPORT, the old envoy can gracefully drain traffic as the new one seamlessly takes over. It’s a smooth transition, minimizing disruption.
The Verdict: Outperforming Ztunnel Out-of-the-Box! 🏆
The numbers don’t lie. In a direct comparison, the optimized node-level Envoy significantly outperformed Istio’s Ztunnel in its out-of-the-box configuration. While Ztunnel guzzled a staggering 10GB per instance, the optimized Envoy utilized a lean just over 1GB per instance. To add insult to injury, the Ztunnel test even exhibited a memory leak, underscoring the maturity and efficiency of Google Cloud’s adapted Envoy solution.
This presentation is a powerful testament to the future of service networking. It showcases the incredible potential of adapting and optimizing existing, battle-tested technologies to create a more efficient, scalable, and developer-friendly ecosystem. Get ready for a world of service networking that’s lighter, faster, and smarter! ✨