Presenters

Source

Get ready to dive into the future of deployments! 🚀 We recently had the incredible opportunity to join Anastasiia Gubska from Chase UK, Julia Furst Morgado from Dash0, and Joe Fuller from Buoyant, as they illuminated the path to self-healing progressive delivery with Argo Rollouts and Linkerd. This isn’t just about faster deployments; it’s about making them smarter, safer, and self-sufficient.

The Modern Deployment Paradox: Faster Isn’t Always Safer 😱

Modern engineering teams are on a mission: deploy small changes more often to move faster. Sounds great, right? But here’s the catch: the faster we move, the more outages we experience. This is the paradox of modern engineering.

Anastasiia kicked us off by highlighting a stark reality: research shows a staggering 80% of outages are caused by very small changes. Think about that – on average, every second deployment leads to some form of service degradation or outage! Systems are incredibly complex, and this complexity makes outages more regular.

Julia shared a chilling example from July 2024: CrowdStrike pushed a routine config update that wasn’t a canary deployment. The result? Over 8 million Windows systems crashed into a bootloop. Hospitals switched to manual workflows, banks stalled, and Delta lost more than $500 million. Imagine the chaos! A canary deployment, pushing to just 1% of systems, could have contained that blast radius.

SRE teams are often caught in a reactive loop, struggling to monitor every dashboard for silent errors. They might not even know there’s a problem until 10 to 15 minutes after an outage, when a page or notification finally arrives. It’s like finding a needle in a haystack, but the haystack is constantly changing.

Beyond Application Metrics: Why We Need a Deeper Look 💡

So, if changes cause outages, how can we deploy more safely? Julia emphasized that while Argo Rollouts is fantastic for gradual, progressive delivery with canary deployments, it still needs robust observability signals to truly understand what’s happening and act automatically.

The challenge? We often rely on application-level metrics, but these aren’t always enough.

  • An application might return an HTTP 200 (success!), but the user experience could be painfully slow.
  • Lower-level issues like TCP resets or TLS failures can happen below the application layer, completely unbeknownst to the app itself.
  • Many large organizations use diverse languages and frameworks. Some teams might instrument their applications well, others barely at all, leading to inconsistent and unreliable data.

If we can’t fully trust the application layer, what can we trust? The answer, as our experts revealed, is to observe the traffic layer directly at the network level, independently of the application.

The Self-Healing Loop: Argo Rollouts + Linkerd 🤖

Achieving truly self-healing delivery requires two key components working in harmony:

  1. Traffic Control: A way to introduce changes gradually.
  2. Traffic Observation: A way to watch what’s happening in real-time.

Most teams already embrace GitOps, where developers push changes to Git, and Argo CD syncs them to the cluster. This core workflow remains. The crucial difference lies in the resource we’re syncing: instead of a standard Kubernetes deployment, we define an Argo Rollout. This custom resource provides:

  • Step-based traffic splitting
  • Promotion gates
  • The ability to attach automated analysis during the rollout.

Instead of sending 100% of traffic to a new version immediately, you can start with just 5% or 10%, creating a “canary” version alongside your stable one.

Now, for the observation part: that’s where Linkerd shines. When you annotate a namespace for Linkerd injection, a mutating webhook automatically injects a sidecar proxy into every pod. The magic? No code changes, no SDKs required! Every request passes through this proxy, allowing Linkerd to observe L7 traffic and expose all of it as Prometheus metrics.

So, the moment your canary starts receiving traffic, Prometheus is already scraping these critical production signals. Argo Rollouts then uses an Analysis Template – a set of automated checks that run PromQL queries against Prometheus on a schedule.

If the canary’s success rate stays above, say, 95%, the rollout advances, sending more traffic. But if it drops below 95% twice, Argo Rollouts automatically stops and rolls back. The system has healed itself before an SRE is paged at 2 AM!

How the Loop Works: Delivery, Observation, Decision ⚙️

This self-healing process forms a powerful closed feedback loop with three distinct layers:

  1. Delivery Control Layer: Argo CD syncs with Argo Rollouts, initiating the canary deployment.
  2. Observation Layer: Linkerd proxies emit metrics, and Prometheus scrapes them. Crucially, no application instrumentation is needed here.
  3. Decision Layer: The Analysis Template queries these metrics, making real-time decisions to either promote the canary or roll it back.

When these three layers work together, the system catches problems, contains the blast radius, and rolls back automatically. This is the essence of self-healing progressive delivery.

The Power of the Service Mesh: Seeing What Apps Miss 👀

Joe emphasized the service mesh’s role with a great analogy: “Your brain doesn’t ask your organs to periodically report their status. The nervous system continuously observes signals across the body.” Similarly, Linkerd sits directly on the network path, intercepting all traffic as a transparent proxy.

This means Linkerd observes what’s already happening on the wire. An application might report no errors, but Linkerd can detect:

  • Timeouts
  • Connection failures
  • Other issues in the connection that the application itself is “naive” to.

Because Linkerd observes traffic at the network layer, it captures crucial signals like retries and connection failures that applications often won’t report, providing invaluable context for incident resolution.

AI Ops: The Future Beyond Rollbacks? 🤖🔮

Anastasiia posed a thought-provoking question: in a world moving towards AI Ops, could we go beyond simple rollbacks? Could AI find and fix a fault, reverting a deployment without human intervention?

Julia believes that while AI holds immense promise, rollbacks are still the answer for many teams today because they are fast and deterministic. When something is on fire at 2 AM, you need your systems back in a healthy state as quickly as possible.

The next step for AI, Julia envisions, is for an AI agent to have access to the environment, see the signals, and try to fix the issue first, rather than immediately rolling back. If it can’t fix it, then it performs the rollback. This would lead to more options and less conservative deployment strategies. However, the critical takeaway remains: AI still needs strong observability signals. Without a clear view of the environment, an AI agent is effectively blind.

Golden Metrics for a Resilient System ✨

So, what signals should we truly pay attention to? Joe recommended starting with a small number of golden metrics:

  • Latency: If your P99 latency jumps from 50 milliseconds to 2 seconds, even if the application reports no errors, there’s clearly a problem. This is a prime canary signal.
  • MTLS and Zero Trust: Signals like authentication failures or certificate errors can indicate issues that the application might not report.
  • TCP connection resets and retries: These low-level network issues are crucial indicators of instability.

Implementing these analysis templates is surprisingly simple, often just a YAML snippet. Joe showed how easy it is to define a success rate condition (e.g., above 95%) and a failure limit (e.g., two consecutive failures) to trigger an automatic rollback.

Live Demo: Self-Healing in Action! 🎯

Joe then brought it all to life with a fantastic demo! He showcased a simple “faces” app where each tile represented a healthy connection (smiley face). He triggered a deployment of version two, which was intentionally unhealthy.

As Argo Rollouts gradually routed traffic (30%, then 50%, then 70%) to the new version, we saw angry faces appear on the screen, indicating failures. The analysis stage kicked in, querying Prometheus metrics about the success rate. After a brief pause (set to 20 seconds for visibility, but configurable), Argo Rollouts detected that the success rate had dropped below the 95% threshold.

The result? Argo Rollouts automatically reverted the load balancing configuration, routing 100% of traffic back to the stable version one. Immediately, all the angry faces disappeared, replaced by happy smiley faces! As Joe quipped, “I went for a coffee break and when I returned I didn’t come back to a bunch of teams shouting at me and pages and alerts!”

Summing It Up & Moving Forward 🌐

Modern systems are complex, but we can avoid many failures by using automation to create a closed feedback loop. You don’t need complex AI yet to build a self-healing system. By combining:

  • Linkerd for transparent observation
  • Prometheus (or your preferred observability tool) for metric storage
  • Argo Rollouts for intelligent action

…you can create systems that heal themselves before your SRE team receives any pages.

Merge Forward: Building a More Inclusive Cloud Native Community 🤝

Julia then shared exciting news about Merge Forward, a brand-new CNCF community-backed program. This initiative aims to welcome and support underrepresented groups in the cloud native space, including:

  • Deaf and hard of hearing
  • Blind and visually impaired
  • Neurodiversity
  • Women in cloud

It’s a safe space to learn, contribute to open source, and find mentors and allies.

Anastasiia, a member of the Deaf and Hard of Hearing working group, highlighted the need for more allies and mentors to support and welcome more deaf and hard of hearing individuals into the industry. She invited everyone to check out their resources via a QR code and join their Slack channel.

A special treat for attendees: the Deaf and Hard of Hearing group prepared a sign language crash course, teaching signs for “Linkerd,” “OpenTelemetry,” “Kubernetes,” and basic conversational phrases. They also organized an escape room event for Wednesday evening!

The insights shared by Anastasiia, Julia, and Joe offer a clear, actionable path to building more resilient, self-healing systems. The demo was a perfect illustration of how these powerful tools work together to bring peace of mind to engineering teams.

Ready to try it yourself? Joe’s demo resources are available for download here!

A huge thank you to our brilliant panelists for sharing their expertise and inspiring us all to build a better, more stable future for deployments!

Appendix