Presenters
Source
Beyond Quotas: How Tesco Mastered Dynamic Rate Limiting with Envoy 🚀
Hey tech enthusiasts! Ever felt the sting of a rate limit kicking in when you least expect it, even when you know you’re not going overboard? Well, the brilliant minds at Tesco, the UK’s retail giant, have been wrestling with this very challenge. They’ve built a massive, distributed API gateway using Envoy and shared their journey from the complexities of quota-based rate limiting to a more dynamic, percentage-driven solution. Let’s dive into how they achieved this, making their API ecosystem robust and user-friendly! 👨💻
The Challenge: Rate Limiting at Scale 🌍
Tesco operates at an immense scale, serving around 800 APIs that cater to public internet users, internal services, and a staggering 100,000 checkouts and devices in their stores. With such a diverse and high-volume traffic landscape, rate limiting isn’t just a nice-to-have; it’s a critical necessity.
- Protecting Services: It prevents individual services from being overwhelmed by sudden traffic surges.
- Fairness for Clients: It ensures that one “noisy neighbor” client doesn’t hog resources and impact others.
The team’s goal was ambitious: global rate limiting that was both incredibly fast and minimally impactful on response times. This led them to explore RLQS (Rate Limit Quota Service), a protocol within Envoy designed for this purpose.
How RLQS Works (Theoretically) 💡
RLQS allows each Envoy instance to report its observed usage to a central server. This server then calculates and distributes quotas back to the individual Envoys, enabling them to enforce limits locally without constant remote calls. Sounds promising, right?
The Reality Check: Dynamic Traffic Shifts 🔄
The real-world complexity hit when Tesco considered how dynamic their traffic patterns truly are:
- Rolling Deployments: As new Envoy instances come online, traffic shifts, and older ones are drained.
- Cellular Architectures: Tesco uses multiple “cells” (full stack copies per region) for deployments. Traffic is deliberately shifted between these cells during updates, leading to significant movement.
- Organic Load Balancing: Even with stable connections, traffic naturally ebbs and flows between different Envoy instances.
The “Lag” Problem: False Positives 🚩
This dynamic shifting created a critical issue with the RLQS approach: lag. There’s an inevitable delay between a change in traffic usage, its reporting to the central server, and the redistribution of updated quotas. During this lag:
- If usage changes are small and there’s ample buffer, things might be fine.
- However, for larger shifts or when already near quota limits, some Envoys might start rejecting requests.
- These are false positives: requests rejected even though, globally, the system is within its configured limits.
Tesco couldn’t accept these false positives, especially when they directly impacted end-users. While reducing lag is an option, it becomes increasingly difficult and expensive, eventually hitting physical limitations like network latency between regions.
The “Bursting” Dilemma 💥
Another consideration was making the system more tolerant to lag. Could Envoys temporarily burst above their quotas? This presented its own set of challenges:
- Complexity: Implementing and tuning such a feature could be highly complex.
- Balancing Act: Too little tolerance means continued false positives; too much tolerance renders rate limiting ineffective.
A Paradigm Shift: Percentage-Based Rate Limiting ✨
Faced with these hurdles, Tesco decided to move away from strict quotas and embrace a more global, percentage-based approach. Jay from Tesco explains:
“Rather than adopting the RLQS approach where we are calculating an individual quota for each envoy, we adopted a percentage-based approach.”
How the Percentage Approach Works 🎯
Instead of assigning fixed quotas to each Envoy, Tesco now:
- Aggregates Usage: All usage reports are collected globally.
- Calculates a Global Percentage: A single percentage is determined based on the overall traffic volume relative to the configured rate limit.
- Distributes the Percentage: This same percentage is communicated to all Envoys.
The Magic: Each Envoy is instructed to allow through that specific percentage of the traffic it’s currently receiving.
- Happy Path: If traffic is within limits, the percentage is 100%.
- Rate Limiting Scenario: If traffic doubles the configured limit, Envoys are told to allow only 50% of their incoming traffic.
This elegantly solves the traffic shifting problem and eliminates false positives. It no longer matters where the traffic is going; each Envoy simply enforces the same proportion of allowed traffic. There’s no delay waiting for new quotas as traffic shifts, because the decision is instantaneous and globally applied.
The Technical Implementation: A Custom Protocol & Sidecar 🛠️
The RLQS protocol, however, doesn’t natively support percentage-based values. So, Tesco built their own solution:
- Custom Protocol: They developed a new protocol to communicate these percentage values.
- Sidecar Pattern: To integrate this with Envoy, they adopted the sidecar
pattern. This sidecar:
- Translates their custom protocol into something Envoy’s RLS (Rate Limit Service) can understand.
- Acts as a local gatekeeper, making rapid allow/deny decisions.
- Is physically located very close to the Envoy (often in the same pod) for near-instantaneous decision-making.
Benefits of the Sidecar Approach 🦾
- Future-Proofing: Allows for future updates to their protocol without modifying core Envoy.
- Reduced Data Volume: By calculating one system-wide percentage instead of individual quotas, they reduced the amount of data being transmitted.
The Results: Zero False Positives! 🎉
The impact of this shift has been profound. The graph shared showed a clear spike in successfully denied requests during a period of high traffic, but crucially, the lines on either side remained flat, indicating zero false positives.
Tesco can now confidently shift traffic across their vast ecosystem, knowing that their rate limiting is effective, dynamic, and doesn’t penalize users unnecessarily.
Key Takeaways 🔑
- Rate limiting with fixed quotas is hard, especially in dynamic, large-scale environments.
- Percentage-based rate limiting offers a more flexible and robust solution for handling traffic shifts.
- The sidecar pattern is a powerful tool for integrating custom logic with existing service meshes like Envoy.
- Tesco’s innovative approach demonstrates how to overcome protocol limitations and achieve a truly scalable and user-friendly rate limiting system.
This is a fantastic example of how a deep understanding of system dynamics can lead to elegant and highly effective technical solutions. Kudos to the Tesco team for sharing their journey!
If you’re dealing with similar challenges, this percentage-based approach is definitely worth considering. Happy coding! 🚀