Presenters

Source

Beyond Single Clusters: Mastering Multi-Cluster Releases with Progressive Delivery 🚀

Are you tired of the release day jitters, especially when you’re juggling multiple Kubernetes clusters? You’re not alone! As Ryan Wu from TIS Cloud Music and Zhuang Zhang from Huawei shared, Kubernetes deployments quickly evolve from a “cluster problem” to a “coordination problem.” This means orchestrating releases across regions, clusters, and environments becomes the real challenge.

Today, we’re diving deep into how they transformed their approach, moving from cluster-centric thinking to a scalable, progressive delivery strategy across their vast multi-cluster infrastructure.

The Scale of the Challenge: Thousands of Applications, Countless Clusters 🌐

Since adopting GitOps and Argo CD in 2020, TIS Cloud Music has seen an explosion in their application deployments. We’re talking thousands of Argo applications running across multi-region, multi-zone, and multi-cluster environments, and this scale is only growing! This rapid expansion brought critical challenges to the forefront:

  • Safe and Efficient Releases: How do you ensure deployments are smooth and quick across so many environments?
  • Disaster Recovery: What happens when a data center goes down? How do you keep services running?
  • Graceful Kubernetes Upgrades: How do you upgrade your underlying Kubernetes infrastructure without causing downtime?
  • Scalability: What’s the plan when new regions or clusters are added to the mix?

Embracing Progressive Delivery: From Cluster to Global Orchestration 💡

The answer to safe and efficient releases lies in progressive delivery. This isn’t just about deploying a new version; it’s about gradual release combined with continuous verification. Think canary releases and blue-green deployments, where traffic is incrementally shifted to new versions while crucial metrics are constantly monitored. Decisions to promote or roll back are data-driven.

While tools like Argo CD offer excellent cluster-scope progressive delivery, the modern reality demands more. Applications are spread across multiple clusters for high availability and disaster recovery. This means progressive delivery needs to evolve from a cluster-level rollout to global release orchestration.

Argo CD’s progressive syncs feature within Application Sets is a game-changer here. It enables staged synchronization of applications, allowing for phased rollouts (e.g., dev first, then staging, then production). This automates cluster-by-cluster releases, with checks on application health at each stage.

The Emerging Challenges of Multi-Cluster Releases 🚧

Despite these advancements, a few pain points remained:

  1. Application Exposure: Managing and understanding deployments across numerous clusters can become complex and impact performance.
  2. Cluster-Aware Rollchecking: Developers often don’t need to know the intricate details of each cluster’s infrastructure or specific rollback rules. This knowledge gap can hinder efficiency.
  3. Pause/Resume Inefficiency: Manually pausing and resuming releases across multiple clusters is a significant drain on efficiency.
  4. Cluster Maintenance & Failover: How do you handle new cluster additions or existing clusters becoming unhealthy? Developers shouldn’t be burdened with tracking which cluster is rolling out, paused, or needs resuming.

The goal? To abstract away cluster-level complexities, allowing developers to focus on release waves: Is wave one complete? Can we move to wave two? Has this batch finished?

Introducing Kamada: A New Model for Cluster-Abstracted Waves Releasing 🌊

To tackle these challenges, a new framework was needed: the cluster-abstracted waves releasing model. This is where Kamada steps in.

Kamada is a CNCF project focused on multi-cluster management, fully compatible with native Kubernetes APIs. This means it plays beautifully with existing Kubernetes tools. Integrating Kamada with Argo CD makes managing a hundred clusters as straightforward as managing one.

How Kamada Works: The Core Components 🛠️

Kamada simplifies multi-cluster management with three key user-facing APIs:

  • Resource Template: This is akin to the Kubernetes API. You don’t need to alter your YAML files when migrating from a single cluster to Kamada.
  • Propagation Policy: This defines how resources are distributed across clusters.
  • Override Policy: This allows for different settings for different clusters.

When you deploy a template with a propagation policy, Kamada creates “resource bindings” that specify which clusters to use and how many replicas to run. It then applies the override policy and syncs the objects to your member clusters.

Smart Scheduling and Status Aggregation 📊

Kamada goes beyond simple synchronization:

  • Resource Template Interpreter: This acts as Kamada’s “eyes,” allowing it to understand custom resources, even non-standard ones.
  • Replica Operation: Kamada tracks desired application scale, replica counts, and resource requirements. This enables smart scheduling decisions based on actual needs.
  • Aggregated Status: Kamada pulls status data from all member clusters into a single view, providing Argo CD with a global perspective of resource status.
  • Health Interpretation: Custom health checks define application health, ensuring that observed and desired states match and all replicas are available.

Built-in Resilience: Cluster Affinity, Anti-Affinity, and Failover 🛡️

Kamada is designed for resilience:

  • Cluster Affinity/Anti-Affinity: It can schedule applications based on affinity rules, distributing them across clusters intelligently.
  • Automatic Failover: If a cluster becomes unhealthy, Kamada detects it and automatically migrates applications to healthy clusters, minimizing manual intervention and ensuring high availability.

The Magic of Argo CD + Kamada: A Unified Release Orchestration 🤝

When Argo CD and Kamada unite, the results are powerful:

  1. Centralized Rollout Orchestration: Argo CD manages the application definition, while Kamada handles the execution of the rollout across member clusters.
  2. Wave-Based Progressive Rollout: A GitOps repository defines the rollout strategy, including propagation and override policies. Argo CD deploys this to the Kamada control plane.
  3. Replica Set Management: The Argo Rollouts controller watches the rollout resource in Kamada, creating replica sets. Kamada then uses the propagation policy to sync these to member clusters.
  4. Cluster Abstracted Releases: Developers see release waves, while Kamada manages cluster placement and maintenance, embodying a perfect separation of duties.

Powering Progressive Delivery with Kamada Features:

  • Replica Dividing: Distribute replicas across clusters for high availability, with more resources allocated to clusters that can handle them.
  • Customized Configurations: Use override policies for region-specific configurations, like different image URLs.
  • Batched Releases: Kamada enables phased rollouts. Step B can only start after Step A is ready. This is achieved through pausing and resuming scheduling for specific clusters, enabling region-level canary awareness. For example, a rollout can resume for cluster one, and only once its replicas are ready, proceed to cluster two.
  • Automatic Failover: During or after a rollout, if a cluster fails, Kamada automatically reschedules applications to healthy clusters, ensuring SLOs are met and providing continuous maintenance.

The Demo: A Recorded Journey Through Argo CD + Kamada 🎬

The presentation included a recorded demo showcasing the power of this integration. The demo illustrated:

  • Registering Kamada as a cluster within Argo CD.
  • Creating a new application with replica counts, override policies (for environment variables), and propagation policies (static weight for member clusters).
  • Initiating synchronization, leading to the creation of replica sets and resource bindings.
  • Observing the distribution of pods across member clusters.
  • Performing a new release by changing the image tag, demonstrating how a new replica set is created and scaled, with the old version scaled down.
  • Resuming the rollout process in stages, scaling up replicas across clusters until the final desired state is achieved.

Journey in Open Source and Future Contributions 🌱

The team’s primary contribution to Kamada has been building the replica set resource interpreter. Their future plans include continuous improvement of this platform based on their real-world needs and further refining the synergy between Kamada and Argo CD. They aim to continue giving back to Argo CD, Kamada, and the broader open-source community.

Audience Q&A: Traffic Management and Override Policies 🗣️

A key question from the audience revolved around traffic management between clusters and how rollouts are handled if they fail. The speakers explained:

  • They use ZooKeeper for service registration.
  • Traffic shifting is managed at the gateway level, not through ingress.
  • When pods are ready, they register with ZooKeeper, and the gateway handles traffic routing.

Another question clarified the purpose of the override policy. It was explained that while the propagation policy defines how resources are distributed, the override policy allows for specific configurations for different clusters, such as using distinct image URLs for different regions.

This session provided a compelling look at how to scale progressive delivery beyond single clusters, offering a robust and efficient solution for managing complex, multi-cluster application deployments.

Appendix