Presenters
Source
🚀 From Over-provisioned to Optimized: Automating FinOps with GitOps
In the world of Kubernetes, over-provisioning isn’t just a mistake—it is a structural reality. We often treat it as a “safety tax” to ensure our applications run 24/7, but that tax is becoming increasingly expensive. At a recent talk, Hrittik Roy (Platform Advocate at Loft Labs and CNCF Ambassador) and Kunal Das shared a provocative truth: your massive cloud bill isn’t an architecture problem; it is an automation problem.
Here is how you can stop throwing money at the cloud and start using GitOps to close the loop on FinOps.
📉 The Uncomfortable Truth in Numbers
We often blame expensive GPUs or high-cost regions for our cloud bills, but the data tells a different story. According to recent surveys:
- 70% of organizations identify over-provisioning as the primary cause of Kubernetes overspend.
- 10% is the average CPU utilization across clusters with 50 or more CPUs. This means companies are effectively paying for ten CPUs while only using one.
- 40% to 60% of wasted capacity is a rational response from developers who lack the tools to know what is safe to reduce.
The root cause? Kubernetes has zero native feedback loops between what you request and what you actually use. The scheduler sees your request, allocates the space, and the cloud provider bills you for it—regardless of whether your pod sits idle at 12% utilization for months.
🛠️ The “Closed Loop” Challenge
While tools like Prometheus, cAdvisor, and Metrics Server have the data, and KubeCost provides visibility, there is a massive gap in implementation. Currently:
- Observability tools (KubeCost, OpenCost) give recommendations but don’t implement them.
- Autoscalers (VPA, HPA) lack context and can cause evictions without approval flows.
- Cloud billing tools suffer from low adoption and a lack of granular awareness.
Only 18% of organizations have reached the “optimization” stage of FinOps. The rest are stuck in the “crawl” or “walk” phases, researching tools rather than implementing changes.
🦾 The Solution: The Argo Suite for FinOps
To bridge the gap between recommendation and implementation, Hrittik and Kunal propose using the Argo ecosystem to create a self-healing, cost-aware infrastructure.
1. Argo Workflows: The Orchestration Engine ⚙️
Think of cost optimization as a Directed Acyclic Graph (DAG). Argo Workflows can:
- Collect metrics from OpenCost and Prometheus.
- Analyze and merge data to create right-sizing recommendations.
- Use CronWorkflows to schedule weekly policy checks.
- Automatically open a GitHub PR with updated resource values.
2. Argo Events: Reactive Management ⚡
While workflows handle scheduled tasks, Argo Events handles real-time triggers:
- New Deployments: Trigger right-sizing based on historical data before the first billing cycle hits.
- Budget Alerts: Identify top spenders immediately when a threshold is breached.
- Namespace Creation: Proactively implement resource quotas and limit ranges.
3. Argo CD & Rollouts: The Safety Net 🛡️
Changing resource limits in production is scary. Argo CD provides the delivery mechanism with self-healing and sync waves, while Argo Rollouts adds a layer of progressive delivery.
- If you reduce CPU from 1000m to 400m, Argo Rollouts can monitor P99 latency and success rates.
- If performance degrades, the system automatically rolls back the change.
⚖️ Trade-offs and Impact
Implementing a GitOps-driven FinOps strategy isn’t without its hurdles.
The Upside:
- 20% to 30% savings without any performance degradation.
- Up to 70% savings when combined with cloud provider commitments and orphan storage cleanup.
- Declarative & Auditable: Every cost change has a PR, a justification, and an approval history.
The Challenges:
- Complexity: Managing multiple CRDs across namespaces can increase the load on the Kubernetes API server.
- Steep Learning Curve: Teams must master workflow templates and event buses.
- Organizational Shift: FinOps is a systems problem, not a people problem. It requires moving from manual “nagging” to automated feedback loops.
🗺️ A Practical Roadmap to Automated FinOps
Ready to start? Follow this graduated approach:
- Visibility First: Deploy KubeCost or OpenCost to establish a baseline.
- Collect & Alert: Setup an Argo CronWorkflow to gather data and send recommendations to Slack or GitHub.
- Automate Non-Prod: Enable automatic PR creation and merging for development and staging environments.
- Guardrails for Prod: Implement PR-based approval gates and Argo Rollouts for production changes.
- Infra Scaling: Integrate tools like Carpenter and vCluster to ensure your underlying nodes scale as efficiently as your pods.
💬 Closing Thoughts
As Hrittik Roy emphasized, engineers should focus on value provisioning, not infra provisioning. By using the Argo suite to close the loop between usage and requests, you transform FinOps from a tedious manual chore into a high-performance automated system.
“Start small, build toward complete automation, and let GitOps do the heavy lifting.” ✨
Note: This post is based on a presentation by Kunal Das and Hrittik Roy. Special thanks to the community for the factual data and surveys that backed this session.