Presenters
Source
🚀 The SRE Survival Guide in the Age of GenAI: Navigating the 100x Code Explosion
The world of software engineering is moving at a breakneck pace. As of late 2025, the landscape has shifted permanently. Microsoft reports that 30% of their codebase is now written by Generative AI (GenAI), while Anthropic reports a staggering 90%. In mid-2025, the founder of Cursor noted that users accepted 1 billion lines of code every single day.
For Site Reliability Engineers (SREs), this isn’t just a trend—it’s a fundamental shift in the physics of production stability. Sylvain Kalache, Lead of AI Labs at Rootly and a veteran SRE, breaks down what this means for the future of reliability.
📈 The New Math of Incidents
To understand the impact of GenAI, we must look at the fundamental incident formula. While $\lambda$ (the baseline failure rate) remains a constant, two variables are changing rapidly:
$$Incident Rate \approx C \times P$$
- C (Change): The frequency of changes to your system.
- P (Probability): The chance that any single change introduces a failure.
GenAI empowers developers to do more. Research across 5,000 engineers at Microsoft and Accenture shows that those using GenAI produce 15% more commits and achieve 25% more tasks.
Because $C$ is increasing drastically, the statistical likelihood of incidents is skyrocketing. Data from Rootly confirms this: the average number of incidents per customer has tripled since 2023. 📉
⚠️ The Rise of “AI Slop” and New Failure Modes
AI doesn’t just write code faster; it introduces unique failure patterns. Sylvain identifies several ways AI is currently “screwing up” in production:
- Logic Replication: AI agents often write unit tests that pass perfectly because they simply map the broken logic of the code they just generated.
- Hallucinated Infrastructure: In one instance, a Vercel agent hallucinated a GitHub repo ID and deployed a completely random repository into a customer’s environment. 😲
- Split Squatting (Dependency Confusion): Attackers now create packages
with names that LLMs are likely to hallucinate. One researcher created a
fake package called
huggingface-cli, which received 30,000 downloads in three months from companies unknowingly pulling in hallucinated dependencies. - The “Cat” Distraction: LLMs are sensitive to irrelevant context. Research shows that adding a random sentence (like interesting fact: cats sleep most of their lives) to a prompt can double the error rate of an LLM. 🐱
👥 The Changing Human Landscape
As AI takes over more coding tasks, the SRE’s support network is changing:
- The “I Didn’t Write This” Problem: When an incident occurs and you “git blame” a developer, the answer is increasingly: I don’t know how it works; I just prompted my agent to do it.
- The Death of the Specialist: GenAI turns practitioners into generalists. Finding that deep PostgreSQL or networking expert to help during a Sev-0 incident is becoming harder.
- Management Pressure: Managers expect SREs to “do more” because they have AI, even as the volume of incidents grows.
🛠️ Fighting Fire with Fire: AI for SREs
If you can’t beat the 100x code explosion, you must join it. Sylvain argues that SREs must leverage GenAI to keep $P$ (the probability of failure) low. 🛡️
1. Automated Compliance & Testing
Meta recently shared an experiment using Sapling, an automated compliance hardening tool. By using LLMs to generate 9,000 mutants and 500 privacy test cases, engineers accepted over 70% of the AI-generated tests. This level of rigor was previously impossible at scale.
2. The AI SRE Agent
New tools are moving beyond simple chat. MCP (Model Context Protocol) allows AI agents to pull data directly from Datadog, Rootly, and Grafana.
- Rootly’s AI SRE: This agentic tool automatically pulls telemetry, reviews previous postmortems, and performs root cause analysis for Sev-2 and Sev-3 incidents without human intervention. 🤖
3. Eliminating Toil
AI is exceptionally good at the parts of the job SREs dislike:
- Postmortems: AI can aggregate timelines and data points from multiple sources to draft reports.
- Internal Tools: AI can instantly “beautify” clunky internal dashboards, adding professional UIs to functional but ugly tools. 🎨
🎯 The Bottom Line: Focus on the Fundamentals
GenAI and agentic coding are not going anywhere, but neither is the need for reliability. As an SRE, you cannot control $C$ (the volume of code), but you are the guardian of $P$ (the probability of failure).
We must apply the Blameless Culture to LLMs. If an AI breaks production, it isn’t the AI’s fault—it’s a sign that our system lacked the necessary guardrails to catch the issue. 🏗️
The fundamentals of SRE—observability, testing, and resilience—matter more now than ever before. Embrace the tools, automate the toil, and focus your human expertise on the most complex challenges.
Q&A Highlights 💬
Audience: I’m skeptical of non-deterministic systems in production. How can I trust an AI? Sylvain Kalache: This sounds exactly like the gatekeepers who didn’t want interns or junior engineers touching production years ago. AI makes mistakes just like humans do, but at 100x or 1,000x the speed. The answer isn’t to ban it, but to build better guardrails and a more resilient system that can handle that pace.
Audience: What should I do if I feel overwhelmed by the pace of change? Sylvain Kalache: Come back to the formula. You can’t stop the world from shipping code faster, but your job—keeping the probability of failure low—is more important now than it has ever been in the history of tech. 🌐✨