Presenters

Source

Migrations Gone Wild? How Feature Flags Turn Risk into Controlled Experiments 🚀

Hey tech enthusiasts! Ever felt the sheer terror of a big, risky migration project? You know, the kind where you hold your breath and pray everything goes smoothly? Well, what if I told you there’s a way to transform these high-stakes events into calculated, controlled experiments? Ryan Vila, the Harness Feature Management Experimentation Advisory Director, recently shared some brilliant insights on how feature flags can be your secret weapon for smoother cloud migrations and re-platforming efforts. Let’s dive in!

The Perils of Traditional Migrations 🚩

Ryan kicked things off by highlighting the common pitfalls that make migrations such a headache:

  • The Big Bang Cutover: Imagine deploying a massive change all at once, hoping for the best. If even one small piece breaks, your entire system can crumble. This approach couples multiple complex changes (like UI updates, pipeline overhauls, and new services) into a single, high-risk event.
  • No Escape Hatch (Rollback Path): What happens when things go south? If you can’t pause or easily step back during a migration, teams face immense friction and uncertainty. The inability to stop or recover creates significant tension.
  • Uncontrolled Blast Radius: When a problem hits, everyone is affected. Without the ability to limit who sees the change, a single bug can impact 100% of your users.

These failure modes can lead to significant disruption, but there’s a better way!

Enter the Strangler Fig Pattern & Feature Flags 🪴

Ryan introduced us to the elegant Strangler Fig Pattern, a concept coined by Martin Fowler decades ago. The core idea is to gradually redirect traffic from an old service to a new one that runs in parallel. This controlled redirection allows for:

  • Traffic Bifurcation: Splitting user traffic between the old and new systems.
  • Measurement Opportunities: Teams can measure the differences between the two streams, both qualitatively and quantitatively.
  • Informed Decisions: This data helps teams understand the efficacy and performance of the new service, paving the way to retire the old one and reduce technical debt.

Gradual Replacement: From Canary to Controlled Rollouts 📈

So, what does a gradual replacement actually look like? We’re all familiar with concepts like:

  • Canary Releases: Rolling out changes to a small subset of users first.
  • Percentage-Based Releases: Gradually increasing the percentage of traffic hitting the new system.
  • Incremental Rollouts: Step-by-step deployment of new capabilities.

Ryan emphasized starting small:

  • Internal Users First: Expose changes to your own team for initial validation.
  • Test Accounts & Synthetic Traffic: Leverage dedicated test accounts or simulated user behavior.
  • Trusted QA Partners: Engage with key customers who can provide qualitative feedback.

Defining “Small”: Minimizing the Blast Radius 🎯

What constitutes a “small” release? It varies! It could be:

  • 1% of Traffic: A highly incremental approach.
  • A Specific Region or Zip Code: Targeting a localized segment.

The key is to define “small” based on your understanding of your customers and the smallest blast radius you can operate within while still having the ability to measure impact, especially negative impact.

Measure, Measure, Measure! 📏

Crucially, measurement must happen at every single step. What works at 5% might not at 6%. Ryan advised teams to:

  • Measure Frequently: Don’t wait until the end to see what’s happening.
  • Think Like a Load Test: Treat the entire rollout from 1% to 100% as a test of scalability and stability.
  • Pause or Revert Smartly: If instability arises, you don’t always need to kill the release. You might scale back, learn more, and then re-progress. This “scaling back” is a crucial aspect of a robust rollback strategy.

What to Watch at 100%? Beyond the Lift-and-Shift 📊

When you reach general availability (100%), what should you be looking at?

  • Performance Metrics: Compare your new service against the old. Are you seeing reduced latency, fewer errors, or increased throughput? Aim for improvements, not just parity.
  • Error Budget: How much of your error budget is being consumed? Is it appropriate for a 100% rollout?
  • Cost Efficiency: Is the new service more cost-effective? Work with your FinOps teams to measure the financial impact.
  • Cohort Behavior: Don’t just look at aggregate data. Dive deep into specific user segments (e.g., iOS vs. Android users, specific regions) to uncover hidden problems. Aggregate data can mask critical issues affecting particular customer groups.

Measuring vs. Monitoring: The Critical Difference 🎛️

Many teams monitor their systems, but Ryan highlighted the distinction with measuring during migrations. Monitoring is essential for observability, but in a production environment with lots of “noise,” a 2% error rate might go unnoticed.

However, when you measure from the point of change during a migration, that same 2% error rate could actually represent a 40% error rate for the new code path. This amplified signal is a clear indicator that action is needed, triggering alerts and enabling faster remediation.

A Real-World Success Story: Split’s Data Pipeline Overhaul 💡

Ryan shared a compelling example from Split (now part of Harness):

  • The Challenge: Split’s B2B platform processes petabytes of data weekly. A single batch job, processing requests sequentially, couldn’t keep up with customer growth, leading to skyrocketing costs and quality issues.
  • The Solution: They built a new, streaming pipeline designed for horizontal and vertical scaling, capable of handling message queues non-sequentially. They used a feature flag to route traffic from the old pipeline to the new one.
  • The Results:
    • 80% Cost Reduction!
    • Processing time slashed from 10 hours to 15 minutes for the same payload.
    • Migration completed in 3 months with zero customer impact.
    • Improved pipeline quality, giving customers better observability.

Cautionary Tales: Avoiding Flagged Risks ⚠️

While feature flags are powerful, Ryan warned about potential pitfalls:

  • Stale Flags: Flags should have a lifecycle. Implement governance to detect and mitigate stale flags that can cause unexpected behavior or harm if accidentally toggled or removed.
  • Untested Fallback Mechanisms: Flags often have multiple states (on, off, default). Test every single fallback scenario.
  • Environment Drift: Ensure configurations in staging match production. If using flags in DR, ensure they align with production or are in an “off” state.
  • Ownership Ambiguity: Clearly define who owns a flag and is responsible for its behavior. Ambiguity leads to inaction.

Remember: A flag is a change. Test flags as code, and clean them up when they’re no longer needed.

Your Migration Checklist: Key Questions to Ask Yourself 🤔

As you embark on your next migration, consider these critical questions:

  1. Gating Your New Path: How are you controlling access to the new migration path? Are the changing elements of your service controlled?
  2. Problem Resolution: If a problem occurs, what’s the immediate action? Do you revert to the old service, or scale back the new one? Which flag facilitates this?
  3. Automated Kill Switches: Can you automate the “kill switch” functionality within your flags for rapid response?
  4. Rollback SLA: What’s your Service Level Agreement for recovering from a problem and returning to a known good state?

By embracing feature flags and a measurement-first mindset, you can transform your cloud migrations from high-risk gambles into controlled, successful experiments. Happy migrating! ✨

Appendix