Presenters

Source

Hello everyone, and welcome to a crucial discussion from the Con42 Site Reliability Engineering conference 2026! I’m your host, diving deep into insights shared by Uma Mukkara, Product Lead for Resilience Testing at Harness and co-creator of the CNCF-hosted chaos engineering tool, LitmusChaos. Today, we’re unpacking a truth that’s becoming increasingly evident in our fast-paced tech world: resilience testing isn’t just important; it’s absolutely non-negotiable in the software development lifecycle (SDLC).

Let’s explore why.

💥 The Inevitable Truth: Outages Are Coming

No matter how brilliantly we design our systems, how meticulously we plan for resilience, or how top-tier our cloud service providers are, outages are an inevitable part of the digital landscape. We see them time and again, impacting even the most robust infrastructures.

Consider these real-world examples:

  • Physical destruction, like a dome strike that knocked AWS data centers partially offline.
  • Software misconfigurations or policy issues on platforms like Azure, leading to service interruptions.
  • Regional or multi-AZ infrastructure losses.
  • Even a surge of user retries during a minor hiccup can cascade into a service outage lasting more than 6 hours, as users inadvertently overload systems trying to reconnect.

The message is clear: businesses must be resilient against such disruptions. But designing for resilience isn’t enough; we need to verify it proactively and with informed decisions. Relying on a one-time design validation is a risky gamble because failure types constantly evolve. Hope is simply not a good strategy for resilience.

🤔 What Is Resilience, Anyway?

Before we dive deeper into testing, let’s nail down what resilience truly means in this context. It’s not just about preventing crashes. A system that simply “ties up” when it encounters an issue isn’t truly resilient.

Instead, resilience is defined by the grace with which systems handle failures and recover to an active state. This includes:

  • Generic failures
  • Load conditions (when systems are under stress)
  • Disasters

How smoothly your system navigates these challenges and returns to full functionality defines its resilience.

💸 The High Cost of “Hope”: Understanding Resilience Debt

A common problem arises when teams mistakenly believe resilience is “someone else’s problem.” They might think, “It’s a data issue,” or “Outages happen in production, let the ops team deal with it.” This mindset creates significant risk.

Neglecting continuous resilience testing is akin to accruing technical debt. Just as technical debt leaves known issues that can hurt you later, resilience debt leaves your systems vulnerable. The more resilience debt you accumulate, the greater the potential business impact when failures inevitably strike. With modern, cloud-native, and distributed systems, and their dynamic nature, failure is almost a mathematical certainty.

The solution? A strategy of continuous resilience testing, verified and documented with every shipment. This means integrating testing into your delivery process and SDLC.

🎯 Beyond Basic Testing: A Holistic Approach to Resilience

To ensure your business services are truly resilient, you must test them against system failures, load conditions, and disasters. Focusing on just one area leaves you exposed.

Resilience testing isn’t just about functional or limited load testing. It demands a holistic approach that integrates:

  • Full-fledged load testing
  • Chaos engineering
  • Disaster Recovery (DR) testing

These aren’t independent silos; they’re interconnected parts of a comprehensive strategy. Over eight to ten quarters, consistent resilience testing can lead to a significant maturity level, increasing test coverage, providing invaluable feedback to your teams, and ultimately reducing potential business impact.

🤝 Integrated Testing: Chaos, Load, and DR – Better Together!

Looking at risks like resource saturation or bottlenecks, you might use chaos and load testing. While both can uncover these issues, their intensity and the specific insights gained can differ. The key is to view these as integrated testing types, not isolated efforts.

Integrated resilience testing uncovers more advanced risk patterns than individual tests. For example:

  • Combining chaos testing with load testing pushes systems to their absolute limits, exposing critical failure points under stress.
  • Reusing chaos tests in DR scenarios allows for more frequent, efficient, and comprehensive DR tests, helping you verify RPO/RTO targets more effectively.

This integration provides opportunities to uncover more risks at more frequent intervals. It also highlights the need for collaboration. Often, load testing and chaos testing teams might share infrastructure, leading to bottlenecks. Organizations have a huge opportunity to improve efficiency by:

  • Collaborating on infrastructure.
  • Reusing tests across different purposes.

Think about it:

  • Chaos testing can happen in dev and production environments.
  • Load testing is crucial in QA, but also valuable for benchmarking in dev and for SREs to ensure compliance with load numbers (especially for SaaS).
  • DR testing isn’t just for SREs; developers can contribute by validating infrastructure as code, and platform engineering teams can coordinate pre-production DR validation.

Bringing personas and environments together, and reusing tests, significantly enhances overall resilience.

🤖 AI’s Role in Smarter Resilience Testing

As organizations increasingly leverage AI to enhance operations, its role in resilience testing is becoming clear. AI can help us achieve the “best tests run at the right times.”

Imagine feeding your AI agent a wealth of proprietary system data: knowledge bases, wiki pages, incident data, CI/CD pipelines, and infrastructure details. This AI agent could then provide recommendations for:

  • The most effective load tests.
  • Optimal chaos test scenarios.
  • The best DR test cases.

Furthermore, AI can analyze runtime events to suggest which tests to run at a given moment, ensuring efficiency and relevance. Some products are already moving in this exciting direction!

🛠️ Meet the Tool: Harness Resilience Testing

Recognizing the need for a unified approach, Harness has evolved its Chaos Engineering product into Harness Resilience Testing. This powerful tool integrates native capabilities for load testing and DR testing alongside its existing chaos engineering features.

Harness Resilience Testing offers a one-stop shop for all your resilience testing needs. You can:

  • Create and run load tests, either independently or combined with chaos tests.
  • Build DR workflows that leverage existing chaos tests.

This single pane of glass allows you to detect risks from chaos, load, and DR testing in one place, giving you a comprehensive view of your system’s vulnerabilities and trends. You can then act on this data to reduce risks throughout your SDLC.

A free version of the Harness Resilience Testing module is available for you to explore at harness.io.

✅ The Bottom Line: Resilience is Non-Negotiable!

To summarize, the core message is undeniable: resilience testing is not negotiable. The business impact of unchecked resilience debt will only grow.

To tackle this head-on:

  1. Embrace a holistic view of resilience testing early in your SDLC.
  2. Bring all personas and teams together to collaborate and reuse resources (tests and environments).
  3. Create a joint action plan to continuously improve your overall resilience testing strategy.
  4. Leverage your AI efforts to create and run more efficient resilience tests at optimal times.
  5. Explore tools like Harness Resilience Testing that provide end-to-end capabilities for integrated resilience verification.

Thank you for joining this vital discussion. Let’s build more resilient systems, together!

Appendix