Presenters

Source

Hey there, tech enthusiasts! 👋 Ever wondered how we can make our software deployments smarter, safer, and lightning-fast in this age of AI? Carlos Sanchez from Adobe and Kevin Dubois from IBM recently took us on an exhilarating journey, revealing how Argo Rollouts and AI agents are teaming up to revolutionize progressive delivery. Get ready to dive into the future of deployments!


🤖 Is AI a Revolution or Just Another Tool? The Great Debate Kicks Off!

Carlos and Kevin kicked things off with a fascinating poll, asking the audience: Is AI going to radically change the way we do things, or is it just another tool? A resounding majority believed in AI’s revolutionary power, with only a brave few seeing it as merely an incremental tool or a passing fad. This set the perfect stage for exploring how AI is already transforming a critical area of software development: deployment.


💡 Learning from Chaos: The CrowdStrike Incident & Progressive Delivery

Remember the infamous CrowdStrike incident that took down a significant portion of the internet? Carlos recounted how a single change broke everything, highlighting the desperate need for a robust recovery mechanism. The post-mortem analysis pointed to a crucial solution: deployment rings. This concept, essentially a form of canary deployment or progressive delivery, involves deploying changes to smaller user segments first, assessing the impact, and only then moving to the next ring.

This is precisely where Argo Rollouts shines! It empowers you to implement progressive delivery strategies, automatically shifting traffic to new versions and rolling back if issues arise. Think of it as your deployment safety net, catching problems before they impact all your users.


🚀 Beyond Metrics: Why AI is the Game Changer for Analysis

Traditionally, tools like Prometheus have been excellent for defining metrics-based rollouts. You set up queries, define success rates (e.g., 95% success, no 500 errors), and Argo Rollouts acts accordingly. However, Carlos and Kevin pointed out a crucial limitation: Prometheus measures what you know can go wrong.

This is where AI enters the scene as a true disruptor! AI can not only analyze known metrics but also help you figure out what’s wrong that you didn’t even consider beforehand. Imagine an AI agent scrutinizing your deployments, identifying subtle issues that escape your predefined metrics. This shifts the paradigm from reactive problem-solving to proactive, intelligent detection.


🛠️ Integrating AI: The Metric AI Plugin for Argo Rollouts

The team developed a powerful Metric AI plugin for Argo Rollouts, now available in Argo Project Labs. This plugin seamlessly integrates AI into your deployment pipeline:

  1. Installation: You can install it via the Argo Rollout Manager or by building your own Docker image.
  2. Configuration: In your Argo Rollouts analysis template, you simply reference a new template that handles the AI magic.
  3. Agent Communication: The plugin uses a 2-way communication channel to talk to your AI agent (which can run in your Kubernetes cluster).
  4. Contextual Prompts: You can send extra prompts to the agent, making it application-specific. For a Java app, you might ask it to look for Java-related issues; for a Python app, Python-specific concerns. This allows one agent to serve multiple applications.

This agent isn’t just looking at a single metric; it leverages its “knowledge of the world” to determine if a rollout is good or bad. It can run kubectl commands, fetch logs from stable and canary pods, analyze metrics, and even detect issues like crash looping pods.


🧠 A Symphony of Agents: The AI-Powered Remediation Workflow

Carlos and Kevin envision an even more automated flow, featuring a collaborative team of AI agents:

  • Diagnostic Agent: This agent acts as the primary investigator, looking at pods, logs, metrics, and any other relevant data points.
  • Analysis Agent: It evaluates the diagnostic agent’s findings, making an initial assessment of the canary’s health.
  • Scoring Agent: This agent provides a quality score for the analysis, ensuring accuracy and even prompting for re-analysis if needed (a crucial feedback loop!).
  • Remediation Agent: If the agents decide something is wrong, this is where the magic truly happens! It can automatically:
    • Create a GitHub issue for your project.
    • Even create a pull request (PR) with a proposed code fix!

This entire process can run within Kubernetes, potentially using local models, offering incredible flexibility and automation for your deployment pipeline.


🎬 Demo Time! Witnessing AI in Action

Kevin then walked us through a live demo, showcasing the plugin in action with an application deployed with 10 replicas.

  1. The Setup: A canary deployment strategy was configured: 10% of pods received the new version, followed by a 10-second wait to gather data, and then a 40-second analysis period by the AI agents.
  2. Null Pointer Exception (NPE): Kevin intentionally introduced a null pointer exception. The AI agent quickly detected the issue, returned a failure, and triggered an automatic rollback. Asynchronously, the remediation agent created Pull Request #84 on GitHub, even attempting to propose a code fix! While the model used (a small “Quen” model running on an OpenShift cluster) wasn’t perfect, it correctly identified the NPE and offered a starting point for remediation.
  3. Memory Leak: Next, a more subtle memory leak was introduced. This time, the AI agents identified resource exhaustion and potential OOM kills. Recognizing the complexity of the issue, the remediation agent intelligently created a detailed GitHub issue instead of a PR, providing a clear description for human intervention or a more advanced LLM.
  4. Successful Rollout: Finally, a stable version was deployed. The AI agents ran their analysis, found no issues, and the rollout completed successfully, updating all pods.

The agentic system itself was developed in Java, leveraging Quarkus and LangChain4j for impressive performance. Kevin noted that early versions took several minutes for analysis, but optimizing with parallel agents and the right framework brought it down to seconds.


🎯 The Future of Deployments: Key Takeaways

Carlos and Kevin left us with powerful insights:

  • Increased Importance of Safety Nets: As AI helps us produce code faster (potentially 10 times faster!), projects like Argo Rollouts become even more critical. They provide the necessary “safety net” to limit the blast radius of changes through canaries and feature flags.
  • AI for Scale: Humans simply cannot keep up with analyzing logs and metrics for every deployment. AI agents automate the mundane, allowing humans to focus on complex, high-value tasks. AI can handle 80-90% of normal, simple checks.
  • Intelligent Automation: AI agencies are the future. They not only detect issues but can also initiate intelligent remediation steps, from creating issues to proposing code fixes, drastically reducing resolution times.

This integration of Argo Rollouts with AI agents isn’t just an improvement; it’s a fundamental shift towards more resilient, efficient, and intelligent software delivery pipelines. The days of wasting human time sifting through endless logs are numbered!

Appendix