Presenters

Source

Navigating the Envoy Labyrinth: Netflix’s Journey to Distributed System Mastery 🚀

Hey tech enthusiasts! Ever found yourself staring at a complex microservices architecture, wondering how to connect all the dots while maintaining sanity and performance? Well, you’re not alone! Today, we’re diving deep into Netflix’s experience with Envoy, the powerful open-source edge and service proxy, and how they’ve wrangled it to manage their vast and dynamic ecosystem. Kevin Bichu, an Envoy maintainer and Netflix engineer, shared some invaluable insights at a recent conference, and we’re here to break it all down for you.

Why Envoy? The Netflix Predicament 🌐

Netflix, like many of us, deals with a polyglot environment – a delightful mix of different programming languages and technologies powering their services. Their challenge? To provide uniform capabilities across this diverse landscape. Their solution? The trusty Envoy-based sidecar.

These aren’t just for servers; Netflix deploys Envoy sidecars on:

  • Virtual Machines (VMs) 💻
  • Containers 📦
  • Even Laptops! 🧑‍💻 (Though Kevin humorously noted they won’t be debugging your personal machines today!)

What made Netflix’s adoption particularly interesting were two unique pieces:

  • On-Demand CDS (ODCDS): This allows Envoy to fetch configuration on demand, making it incredibly dynamic.
  • Incremental Adoption: The ability to gradually introduce Envoy without causing immediate disruption.

How it Works: The Magic Behind ODCDS and Incremental Adoption ✨

Imagine a client needing to connect to a service. In the traditional setup, instances register in a service registry. When an application wants to communicate, it sends a specific header to its Envoy sidecar. This header is then used by ODCDS to fetch the necessary Envoy cluster configuration. Once received, Envoy can seamlessly route the request to the target service.

For incremental adoption, especially when Envoy was missing certain capabilities a service needed, clients had a fallback: they could bypass Envoy entirely and communicate directly with the service. This was a crucial feature for smooth onboarding!

For ingress adoption, the game changes slightly. Applications can program the ingress Envoy, fetch runtime configurations, and then hot-restart Envoy to apply these changes before registering themselves for service discovery. This ensures the ingress is perfectly tuned from the get-go!

Lessons Learned: The Bumpy Road to Envoy Mastery 🛠️

Adopting a powerful tool like Envoy isn’t always a walk in the park. Netflix encountered several challenges, and their solutions offer a goldmine of practical advice:

1. The “Too Few Connections” Conundrum 🤏

The Problem: Initially, they used just one connection between the application and the egress Envoy for a specific target service. This meant the Request Per Second (RPS) was bottlenecked by a single Envoy worker, which wasn’t enough for high-traffic scenarios.

The Solution:

  • Use more than one connection per target service.
  • Multiplex services across connections.

The Twist: Even with multiple connections, they discovered that simply balancing connections doesn’t automatically balance load. A graph clearly showed significant RPS imbalance across Envoy workers, with some handling 10x more requests than others!

The Real Solution: Employing more connections for finer-grained load balancing.

2. The “Too Many Connections” Overload 🤯

The Problem: This time, the issue was an explosion of connections within Envoy itself during egress. This is a classic case of high fanout. With ’m' caller instances, ’n’ target instances, and ‘c’ workers per caller, you could end up with a staggering m * n * c connections, risking fleet-wide outages!

The Solution: Implement connection subsetting. This involves informing caller Envoys of only a subset of the target instances, drastically reducing the total number of connections.

The Impact: A rollout of subsetting showed a remarkable reduction from 100 million connections to 50 million connections! 📉

3. The “Not Enough Connections” Bottleneck 🚧

The Problem: This challenge arose between the ingress Envoy and the application, particularly during traffic spikes. Envoy lazily opens ingress connections per worker, leading to request queuing when demand suddenly surged. This queuing could then degrade the success rate of requests.

The Solution: Increase the connection pool cardinality. By setting min_num_workers * cardinality connections, they ensured enough connections were pre-warmed to handle spikes without queuing.

The Proof: Before the fix, ingress connections showed significant queuing points. Afterward, a stable 80 pre-warmed connections were ready to absorb any load!

4. WOM Filter Woes and Restartable VMs 💥

The Problem: When using WOM filters (WebAssembly filters), a crash in the shared WebAssembly VM could bring down all dependent filters, leading to requests being black-holed.

The Solution: If you’re using WOM filters, configure your VMs to be restartable. This ensures that even if one filter crashes, the VM can recover, and the system remains operational.

5. Huffman Encoding: A CPU Drain? ⚡

The Problem: While Huffman encoding is great for remote scenarios, using it for local connections is an unnecessary waste of CPU resources.

The Solution: Remove Huffman encoding for local connections. Netflix saw significant CPU savings, in the range of 5 to 10 percentage points!

6. Envoy’s Stat System: A Double-Edged Sword 🗡️

Envoy’s stat system is incredibly powerful, providing immense observability. However, it can also consume a lot of memory and requires careful handling.

The Problem: Excessively stringifying Envoy’s native stat names led to severe lock contention. Off-CPU flame graphs revealed blocking due to locks taken for stat name stringification.

The Solution: Avoid stringifying native stat names. Use them directly and with care, especially when dealing with high-volume metrics.

7. ODCDS and Slow Control Planes: A Recipe for Failure ⏳

The Problem: While ODCDS is fantastic for dynamic configuration, it can lead to request failures if the control plane is slow. Envoys that haven’t yet received their cluster configuration simply cannot serve traffic.

The Evidence: Netflix observed CDS and RDS latencies in the hundreds of seconds, and LDS timeouts (leading to “configless proxies” or bricks).

The Challenge: Building and maintaining a robust XDS server is critical for the success of dynamic configuration systems like ODCDS.

The Takeaway: Envoy is Powerful, But Requires Mastery 🎯

Netflix’s journey with Envoy highlights that while the tool is incredibly capable, its effective implementation requires deep understanding and careful tuning. From connection management and high fanout to resource optimization and control plane resilience, each challenge overcome has yielded significant improvements in performance and reliability.

So, the next time you’re wrestling with your distributed systems, remember Netflix’s lessons. Embrace the power of Envoy, but be prepared to dive deep, optimize relentlessly, and learn from every hurdle.

Kudos to Kevin Bichu for sharing these invaluable insights! And if you’re as fascinated by this as we are, be sure to check out the Envoy maintainers booth! Happy coding! 👨‍💻✨

Appendix