Presenters

Source

Unleashing the Power of Data: How Argo Workflows is Revolutionizing Autonomous Driving Pipelines 🚀

The road to truly autonomous vehicles is paved with mountains of data. From sensor readings and video feeds to intricate training models and simulation environments, processing this sheer volume of information efficiently is a colossal challenge. GAC Group, a trailblazer in China’s automotive industry, teamed up with Alibaba Cloud to tackle this head-on, transforming their autonomous driving data pipelines with the power of Argo Workflows.

If you’re involved in AI, machine learning, or any data-intensive field, get ready to dive deep into a story of overcoming scale, complexity, and the relentless pursuit of agility. 🧠✨

The Bottlenecked Highway: GAC Group’s Data Processing Pains 🚧

Before the transformation, GAC Group’s data pipeline, while functional, was starting to feel the strain. Imagine trying to funnel terabytes of data daily and petabytes in batch jobs through a system that wasn’t built for such gargantuan appetites. The key pain points were clear:

  • The Scale Monster: Processing TBs of data daily and PBs in batch jobs demanded a system that could not only handle the load but also scale effortlessly.
  • The Complexity Labyrinth: Intricate Directed Acyclic Graphs (DAGs) involving diverse data types and a mix of CPU and GPU workloads created a tangled web of management challenges.
  • The Agility Hurdle: The breakneck pace of autonomous driving model development meant the pipeline needed to be super responsive to new requirements.
  • Storage Congestion: Scaling up workers horizontally meant a massive surge of data downloads from shared object storage, often overwhelming the storage layer.
  • Underutilized GPUs: Smaller perception and labeling models weren’t fully leveraging the power of expensive GPUs, leading to inefficient resource allocation.
  • Heterogeneous Workload Woes: Tightly coupling CPU and GPU components within a single pod caused suboptimal scheduling and poor GPU utilization, especially when CPU resources were elastic but GPU resources were not.
  • Multi-Tenancy Mayhem: Different teams (perception, planning, data, simulation) with vastly different needs required a unified platform for resource sharing, while ensuring critical tasks got the dedicated power they deserved.

The Symphony of Solutions: Argo Workflows Takes the Stage 🎶

Enter Argo Workflows, the chosen orchestration engine. Its native Kubernetes integration, robust scaling capabilities, declarative DAG support, intuitive visualization, and comprehensive logging made it the perfect candidate for this ambitious reconstruction. The solution was a multi-pronged attack, optimizing both the pipeline and the underlying infrastructure.

Pipeline-Level Brilliance (GAC Group’s Engineering Magic) 🛠️

GAC Group’s engineering teams implemented clever optimizations within their pipelines:

  • Streamlined Multi-Stage Pipelines: By passing results directly between stages, they eliminated intermediate data collection bottlenecks, creating a smoother flow.
  • Processing Time Optimization: They meticulously identified and optimized time-consuming steps, like object motion interpolation, by considering factors such as object count and record duration.
  • Computational Performance Boosts: Techniques like vectorization, new bar compilation, and caching were employed to significantly accelerate processing speeds.
  • Decoupling Heterogeneous Workloads: Splitting simulation engine and algorithm components into separate pods allowed for placement on specialized CPU and GPU nodes, dramatically improving resource utilization.
  • Model Quantization for Inference: This technique was applied to increase batch sizes for inference on GPUs, boosting throughput and ensuring those GPUs were working overtime (in a good way!).

Infrastructure-Level Powerhouse (Alibaba Cloud’s Expertise) ☁️

Alibaba Cloud brought its robust infrastructure solutions to the table:

  • Concurrency Control Mastery:
    • Global Concurrency Limits: This prevented cluster crashes by managing overlapping workloads effectively.
    • Namespace-Level Isolation: Dedicated concurrency resources for different services ensured that one team’s heavy lifting didn’t impact another’s.
    • Parameter Adjustment: Fine-tuning concurrency for specific large-scale and high-priority tasks ensured optimal performance.
  • Event-Driven Automation with Argo Events: 📡
    • Seamless integration with Argo Workflows automated tasks like data mining, auto-labeling, and inference. These were triggered by events from object storage and Kafka, drastically reducing manual effort and boosting reliability.
  • Hybrid Scheduling Pipeline: A smart approach to resource allocation:
    • CPU-Sensitive Tasks: Leveraged Alibaba Cloud’s serverless ECS for cost-effective and rapid processing.
    • GPU/Lightweight Tasks (Inference): Employed CPU sharing with projects like Coordinator and Hammy to deploy multiple tasks on a single high-performance CPU, maximizing overall resource utilization.
    • GPU-Intensive Tasks (Simulation): Reserved CPU resources and utilized topology-aware scheduling to place tasks on nodes with optimal PCIe or even link connections, ensuring peak GPU performance.
  • Storage Acceleration with Fluid:
    • Pre-caching large datasets (images, data) before running large-scale tasks was a game-changer. This unlocked an astounding 200 GB/s of bandwidth, slashing execution times, especially for GPU-based tasks that no longer spent precious minutes pulling images.

The Quantifiable Victory: Measurable Wins 🏆

The collaborative effort between GAC Group and Alibaba Cloud yielded some truly impressive results:

  • Unprecedented Scale: The data pipeline now effortlessly handles millions of data bags per day, consuming thousands of CPU cores.
  • Storage Superhighway: Achieved an incredible hundreds of gigabits per second of storage bandwidth.
  • GPU Utilization Surge: A significant 40% increase in GPU utilization means more power from their hardware.
  • Agility Unleashed: Support for multiple teams to quickly deploy and iterate on their pipelines, fostering innovation.
  • Cost Efficiency Gains: A remarkable 17% reduction in operational costs, proving that efficiency and performance can go hand-in-hand.

The Road Ahead: A Blueprint for Success ✨

This case study is a powerful testament to the capabilities of Argo Workflows in orchestrating the complex and demanding data pipelines essential for autonomous driving. Its flexibility, efficiency, and ability to decouple workloads, combined with intelligent infrastructure optimizations, empower teams to achieve unparalleled scale, performance, and agility.

For anyone navigating the intricate world of AI and big data, this is more than just a success story; it’s a blueprint for tackling emerging industrial challenges with cloud-native innovation. The future of autonomous driving is being built, one optimized data pipeline at a time! 🤖🚗💨

Appendix