Presenters
Source
Taming the BPF ARU: A Journey to Eliminate TCP Resets in Kubernetes 🚀
Ever experienced the dreaded TCP resets in your Kubernetes cluster, especially after adopting eBPF for network performance? You’re not alone! In this post, we’ll dive deep into a real-world scenario where a seemingly small issue with eBPF’s Address Resolution Unit (ARU) maps led to significant network instability. But fear not, because we’ll also uncover how to diagnose, fix, and even enhance your eBPF network solutions.
The Problem: A Silent Killer of Connections 💔
Our journey began after adopting eBPF with Cilium to replace kube-proxy, aiming to boost network performance. While this transition brought improvements, it also introduced an alarming increase in network failures, specifically TCP resets.
Symptoms of the Trouble:
- Increased Application Errors: Applications running within the cluster started reporting more errors.
- Outbound Traffic Woes: The issue exclusively affected outbound traffic to external services.
- Node-Specific Failures: Affected nodes experienced a staggering 10% failure rate in outgoing requests.
Diagnosis: Unmasking the Culprit with Precision 🕵️♂️
The first step in tackling any complex problem is accurate diagnosis.
Traditional methods like tcpdump can show what’s happening, but pinpointing
where in the code it’s going wrong can be a labyrinth.
The tcpdump Clue: Unexpected NAT 💡
Initial tcpdump analysis revealed a crucial detail: a normal connection would
be established, but then, unexpectedly, the source port would be re-NATted by
Cilium. This sudden change in the packet’s identity triggered a TCP reset, as
per standard TCP behavior. The challenge was that this port change made it
difficult to track the issue within a single connection stream.
The Power of bpftrace: Tracing to the Code 👨💻
This is where bpftrace (or bpw as it’s affectionately called) shines.
Leveraging kprobes, bpftrace allows you to inspect packet flow at the exact
code path without needing to modify, rebuild, or restart your eBPF programs or
the node itself. This dramatically narrows down the scope of investigation.
Key bpftrace Techniques:
trace TC: Inspecting the Cilium datapath.filter bpf helpers: Identifying which eBPF code is calling specific BPF helpers, especially those involved in packet manipulation.- Packet Filters: Mimicking
tcpdump’s filtering capabilities directly withinbpftrace.
By comparing bpftrace outputs in normal and unexpected states, we observed a
modification of the checksum by eBPF code in the unexpected case, indicating
packet manipulation – precisely the incorrect NAT we suspected. Further tracing
pointed to the tail_handle_nat_world_ipv4 function as the source of the
erroneous NAT.
Understanding the Root Cause: The ARU Eviction Algorithm 🎯
Now that we knew where the problem was occurring, the next step was to
understand why. The tail_handle_nat_world_ipv4 function’s logic is
straightforward: if a NAT entry doesn’t exist, it allocates a new port; if it
exists, it reuses it. The issue arose when a NAT entry was deleted during an
active communication.
The investigation revealed that the only entity deleting these entries was Cilium’s BPF ARU (Address Resolution Unit) eviction algorithm. The core concept of this algorithm is performance-driven approximation. To avoid high CPU cycles, it doesn’t meticulously track every entry. Instead, it uses a simplified approach to decide which entries to evict.
The Critical Flaw: This approximation meant that active connection entries could be evicted, leading to the incorrect NAT and subsequent TCP resets. In essence, the ARU’s eagerness to free up resources was causing network instability.
The Solution: Rebuilding Connections with Symmetry 🛠️
The challenge was to fix the ARU eviction without sacrificing performance. Two primary strategies emerged:
- Increase Map Size: Simply raising the size of the ARU maps reduces the frequency of evictions. While effective, it doesn’t eliminate the problem entirely; evictions can still occur and cause issues.
- Symmetric Entry Recovery: This was the more robust solution. The key idea is symmetry. Original NAT entries and their corresponding “reverse entries” hold symmetrical information. If an original entry is evicted, it can be completely rebuilt using the reverse entry. Crucially, this recovery happens within the existing reverse NAT function, incurring almost no additional CPU overhead.
Quantifying the Fix: A Dramatic Drop in Failures 📊
Testing with package sizes ranging from 1KB to 8KB showed remarkable results:
- Before Patch: An average failure rate of approximately 4.5%.
- After Patch: The failure rate plummeted to an astonishing 0.004% on average, translating to only about 4 failures per 100,000 requests – effectively zero!
Beyond the Fix: Enhancements and Continued Vigilance ✨
While the symmetric recovery algorithm significantly mitigated the TCP reset issue, a lingering problem remained. Cilium uses multiple eBPF maps, and while NAT entries could be recovered, other maps, like connection tracking maps, were harder to restore.
The Cube Steel Plugin: A New Frontier 🦾
To address this, a custom tool, the “Cube Steel plugin,” was developed. This plugin helps identify nodes where eviction is in progress. If a cluster is experiencing many such nodes and also seeing TCP resets, the recommendation is to increase the map size using the dynamic size option. This provides an additional layer of resilience.
Key Takeaways for Your Network 🌐
This deep dive into taming the BPF ARU offers invaluable lessons for anyone working with eBPF in Kubernetes:
- Diagnose with Precision:
bpftraceis your best friend for tracing issues directly to the code path without disruptive changes. - Understand the Algorithms: Be aware of the tradeoffs made by eBPF features for performance. Approximation algorithms can have unintended consequences.
- Embrace Symmetry: Leverage symmetrical data structures for robust recovery mechanisms.
- Monitor and Enhance: Even after a fix, continuous monitoring and tools like custom plugins can provide an extra layer of safety and performance tuning.
The journey to a stable eBPF-powered network can be challenging, but with the right tools and a deep understanding of the underlying mechanisms, you can overcome even the most complex issues. A big thank you to the Cilium community for their invaluable support in this endeavor!