Presenters

Source

🚗 Driving Autonomy: How BYD and Alibaba Cloud Scale Data Pipelines with Argo Workflows

In the fast lane of autonomous driving, data is the fuel that powers innovation. But what happens when your data engine starts to sputter under the weight of one petabyte of information per day?

At a recent tech summit, Zhang Bao (Engineering Leader at BYD) and Shuangkun Tian (Alibaba Cloud & Argo Workflows Maintainer) shared how they revolutionized automatic annotation pipelines. By migrating from Airflow to Argo Workflows, they didn’t just fix a bottleneck—they accelerated their entire development lifecycle.


🏗️ The Challenge: Processing a Petabyte a Day

BYD faces a monumental task: processing at least 1PB of data daily to generate training sets for their autonomous driving models. Each data clip originates from multiple sensors, requiring diverse computing resources and high-throughput orchestration.

The Airflow Bottleneck 🛑

Previously, the team relied on Airflow for data filtering and processing. However, as the scale exploded, they hit several walls:

  • Scalability Limits: Airflow struggled with frequent state synchronization. A task might finish in 1 minute, but the system would take 10 minutes to acknowledge it, leaving resources idle.
  • Mutable DAGs: Airflow lacked native GitOps support. The team needed immutable, versioned pipelines to track exactly which version of a pipeline produced a specific training set.
  • Resource Inefficiency: Managing a mix of CPUs and GPUs across massive datasets led to high costs and low utilization.

🚀 Shifting Gears to Argo Workflows

To overcome these hurdles, BYD and Alibaba Cloud built a sophisticated, multi-cluster system powered by Argo Workflows and Argo CD.

🌐 Multi-Cluster Orchestration

Because a single Kubernetes cluster cannot handle BYD’s massive daily throughput, the team implemented a high-hierarchy management system.

  • They deployed identical Argo Workflows instances across multiple Kubernetes clusters.
  • Argo CD and Helm manage these clusters, ensuring consistency and enabling easy rollbacks.
  • A centralized dashboard provides a unified view of the entire global operation.

🛠️ The Tech Stack

  • Alibaba Cloud PPU: Custom GPU chips for high-demand AI workloads.
  • Elastic Compute Service (ECS): A mix of standard and elastic instances to handle bursty CPU tasks cost-effectively.
  • Ray Integration: For specialized GPU tasks, they integrated Ray clusters within the Argo pipeline. While Ray handles the heavy lifting of distributed computing, Argo acts as the supervisor, managing the lifecycle, retries, and scaling of the Ray jobs.

⚙️ Engineering for Ultra-Scale Stability

Moving to Argo was only the first step. To support tens of thousands of concurrent tasks, Shuangkun Tian and the Alibaba team implemented deep optimizations within the Argo ecosystem.

1. Concurrency and Quota Management 🚦

To prevent system exhaustion, they utilized:

  • Namespace-level Concurrency: Different teams share a total quota but have individual limits.
  • Semaphores: This limits the number of pending pods, reducing pressure on the Kubernetes scheduler.
  • Priority-Based Resource Borrowing: High-priority tasks can “borrow” resources from lower-priority ones during peak hours, ensuring critical stability without starving smaller tasks.

2. High-Performance Controller Optimizations ⚡

The team re-engineered how Argo interacts with the Kubernetes API:

  • Smart Caching: They designed a new cache to reduce update conflicts and prevent “double-creation” of pods.
  • Reduced API Pressure: By optimizing patch requests and batch-cleaning test results, they reduced CPU utilization on the control plane by 50%.
  • Event Offloading: They moved time-consuming actions (like listing pods for deletion) out of the main event handler to prevent Controller Out-of-Memory (OOM) errors.

📊 The Results: Faster, Cheaper, Stronger

The transformation delivered staggering improvements in performance and operational efficiency:

  • 11x Faster Execution: Data processing pipelines now run eleven times faster than the previous Airflow-based setup.
  • 30% Cost Reduction: Smarter resource scheduling and elastic scaling significantly lowered infrastructure spend.
  • Massive Throughput: The system now supports a queue capacity of 200,000 tasks and handles 20,000 to 40,000 concurrent workflows.
  • Ultra-Low Latency: High-priority tasks experience queue delays as low as 50 milliseconds.
  • 99% Success Rate: The enhanced stability ensures almost every workflow reaches completion.

🤝 Giving Back to the Community

This journey wasn’t just about internal gains. As a maintainer of Argo Workflows, Shuangkun Tian ensured that these “battle-tested” fixes made their way back to the open-source community. Alibaba Cloud contributed multiple Pull Requests focused on resolving informer cache bottlenecks and controller OOM issues under massive scale.


💡 Final Thoughts

The collaboration between BYD and Alibaba Cloud proves that with the right orchestration strategy, even the most daunting data challenges become manageable. By combining the robust supervision of Argo with the computational power of Ray, they have built a blueprint for the future of autonomous driving infrastructure.

Questions from the Audience:

  • How do you handle high-priority task response times?
    • Answer: By utilizing a 50ms queue delay mechanism and priority-based resource borrowing, the system ensures high-priority tasks are processed almost instantly.
  • Can I use this for non-automotive tasks?
    • Answer: Absolutely. These optimizations for Argo Workflows are now part of the open-source project, benefiting anyone running ultra-large-scale Kubernetes workloads.

Ready to scale your pipelines? Explore the Argo Workflows project and start building your own high-throughput data engine! 🦾🌐🎯

Appendix