Presenters
Source
Mastering Argo Workflows: Essential Dos and Don’ts for Scalability, Security, and Sanity 🚀
Argo Workflows is a powerhouse for orchestrating complex processes within your Kubernetes cluster. But like any powerful tool, it comes with its own set of nuances. Becky Pauley from Tailscale and Tim Collins, an Argo maintainer from Pipekit, recently shared invaluable insights into navigating the common pitfalls and unlocking the true potential of Argo Workflows. This post distills their expert advice into actionable dos and don’ts, covering everything from fundamental misunderstandings to advanced scaling and security strategies.
1. Rethinking Argo: It’s Not Just an App, It’s a Kubernetes Orchestrator 💡
A common misconception is viewing Argo Workflows as just another application running in Kubernetes. The reality is far more profound.
- The Core Truth: Argo Workflows is fundamentally a Kubernetes operator designed to orchestrate Kubernetes pods by deeply interacting with the Kubernetes API.
- Why This Matters: This distinction is crucial. Thinking of Argo as an app will lead you down the wrong path. Instead, embrace its role as an orchestrator that leverages Kubernetes’ core functionalities.
- The Takeaway: If your primary goal is to orchestrate things outside of Kubernetes (like VMs or mobile devices), Argo Workflows might not be the ideal tool. Stick to its strengths within the Kubernetes ecosystem.
- For Non-Kubernetes Folks: If you’re new to Kubernetes, understanding its fundamentals is key to Argo’s success. Buddy up with someone who knows Kubernetes well, or invest time in learning it yourself. This upfront effort will save you immense headaches and costs down the line.
2. Observability is Your Crystal Ball: Seeing Problems Before They Explode 🔮
Running many orchestrated pods can strain your cluster. Without proper observability, you’re flying blind.
- The Goal: Gain enough visibility to identify issues before they become catastrophes and to tune your cluster and Argo configurations to your specific workloads.
- Key Questions to Answer:
- Is my Argo Workflow Controller healthy?
- Is my Kubernetes API healthy?
- Is the Kubernetes API being rate-limited by Argo Controller or the Scheduler?
- How are my workflow steps (pods) performing? Are they starting promptly? Are they resource-constrained? Are they scheduled correctly?
- Are there delays in image pulling?
- How are dependent tools like object storage and databases performing?
- Actionable Steps:
- Metrics: Implement observability tooling to answer the above questions. Set up alerting and potential remediation for undesirable states.
- Logs: Collect and aggregate logs from both workflow steps (pod logs) and the workflow controller.
- Log Archiving DO NOT: Do not use the built-in Argo Workflows log archiving tool. It’s officially documented as “pretty rubbish,” failing to collect essential logs (e.g., from the controller or sidecars) and being difficult to parse at scale.
- Log Archiving DO: Use a dedicated log archiving solution like DataDog or Loki. Integrate custom links in the Argo Workflows UI to allow users to directly access relevant logs.
3. Resource Management: The Unsung Hero of Workflow Performance 💾
Ignoring the CPU and memory needs of your Kubernetes pods is a surefire way to encounter issues like pending pods, out-of-memory killings, and prolonged workflow execution times.
- The Impact: Poor resource management leads to slowdowns, unreliability, and increased costs.
- Understanding Requests and Limits:
- Requests: The baseline CPU and memory a container needs. These are used for scheduling decisions.
- Limits: The maximum CPU and memory a pod can consume, preventing it from destabilizing the cluster.
- What to Resource:
- Argo Workflow Controller
- Argo Server
- Individual workflow steps (including sidecar, init, and wait containers)
- Setting Resources:
- Defaults: The Helm chart and official manifests for the controller and server do not set default requests/limits because needs vary wildly.
- Observability is Key: Use observability tools to understand average and peak CPU/memory utilization.
- VPA (Vertical Pod Autoscaler): Can be useful in recommendation mode for initial estimates, but it’s less effective for highly spiky usage patterns.
- Workflow-Wide Defaults: Identify the average CPU/memory needs of most workflow steps and set these as defaults in the workflow controller config map. This can be done for main, wait, and init containers.
- Individual Overrides: For specific steps requiring different resources, set values on an individual task level to override defaults.
4. Tidy Up Your Cluster: Keeping Argo Lean and Mean 🧹
A cluttered cluster leads to instability. Every workflow you create is a Kubernetes custom resource that can accumulate rapidly.
- The Problem: Unmanaged workflows and their associated resources can build up, consuming valuable cluster resources and impacting performance.
- Key Strategies:
- TTL Strategy: Configure workflows to delete themselves automatically upon completion.
- Workflow Archiving: For regulatory compliance, do not keep workflows in your cluster. Archive them to a database instead.
- Garbage Collection: Utilize garbage collection to clean up completed pods and artifacts.
- Stay Updated: Regularly update Argo Workflows and Kubernetes to the latest stable versions. Tools like Renovate can automate this process.
5. Security First: Locking Down Your Argo Deployment 🔒
Argo Workflows, by its nature, requires elevated privileges to create other workloads. Securing it is paramount.
- Authentication:
- Modes: Server (service account), Client (bearer token), SSO.
- Recommendation: Use SSO for robust authentication. It delegates to your identity provider, enforces MFA, and enables SSO RBAC for granular permission control based on OIDC groups.
- Exposing Argo Server:
- DO NOT: Expose Argo Server directly to the public internet without strong authentication and access controls.
- DO: Use a VPN or a secure connectivity tool.
- Example with Tailscale: Use the Tailscale Kubernetes operator to create an Ingress with the Tailscale ingress class. This ensures only authenticated users on your Tailnet can access Argo Server.
- Permissions (RBAC):
- DO NOT: Grant broadly privileged access to your cluster or Argo Server to all engineers.
- DO: Implement a least privilege approach.
- Use namespaced installations for human users.
- Configure RBAC permissions carefully.
- Grant workflow steps only the access they need to perform their tasks.
- Avoid using the default service account for workflow steps.
6. Cluster-Wide Optimization: Smart Scaling and Cost Savings 💰
Thinking beyond Argo Workflows and considering your Kubernetes cluster as a whole unlocks significant cost savings and performance improvements.
- Cluster Autoscaling:
- DO: Leverage cluster autoscaling to dynamically adjust the number of nodes based on workload demand. This turns off unused nodes, saving compute costs.
- ARM64 CPUs:
- DO: Consider using ARM64 CPUs instead of AMD64. ARM is generally more efficient and cost-effective, with often better server availability. Argo Workflows runs well on ARM.
- Impact: Customers have reported savings upwards of 70% by moving to ARM and spot instances.
- Node Selectors:
- DO: Use node selectors to ensure specific workflow steps run on the appropriate type of node. This is especially relevant for on-prem clusters.
7. Scaling to Massive Heights: When Workflows Meet the API Limit 📈
The question “How many workflows can I run?” is common but impossible to answer definitively due to varying workflows and clusters.
- The Bottleneck: At massive scale, the Kubernetes API server is often the first component to falter, not the Argo Workflow Controller itself. This can be due to a full etcd database or API rate limits.
- Observability is Crucial: You need observability to pinpoint the root cause.
- Scaling the Control Plane:
- On-Prem: Increase CPU for the control plane and enlarge the etcd database.
- Cloud Providers: Your control plane is abstracted. You may need to contact your cloud provider’s support to request increased capacity, though this can be very expensive.
- Multi-Cluster Strategy: Often, it’s more cost-effective to deploy multiple clusters than to significantly scale a single control plane. The downside is managing multiple Argo Server URLs.
- Workflow Controller Configuration for Scale:
- Starting Point: For a generic 10,000 concurrent workflows, use the provided configuration numbers as a starting point and tune based on observability.
- Re-queue Time: The default is 10 seconds. For high concurrency, increase this to 30-60 seconds to reduce API load. For lower loads, decreasing it can speed up workflows.
- Workflow Controller Replicas:
- DO: Set only one workflow controller replica. Controllers use leader election, so only one is active. Additional replicas consume resources and add unnecessary API queries.
- Benefit: If the single controller pod is deleted, its replacement usually comes online faster than a secondary controller can win leader election. Ensure the controller has a solid priority (which is default with supported installations).
- Argo Server Replicas:
- DO: Run multiple replicas of the Argo Server. There’s no leader election here, ensuring UI/API availability.
- Consider HPA: Use a Horizontal Pod Autoscaler (HPA) to scale Argo Server pods dynamically based on demand.
- Workflow CR Management: At scale, the Argo Server querying many workflow CRs can also strain the Kubernetes API. This reinforces the importance of deleting or archiving completed workflows.
8. Beyond YAML: Embracing SDKs for Workflow Development 👨💻
For those who prefer coding in familiar languages, Argo Workflows offers SDKs.
- The Benefit: Write your workflows in Golang, Python, TypeScript, or other languages, producing fully functional workflows without the YAML complexity.
- Availability: Some SDKs are auto-generated, while others are community-maintained.
The Payoff: Real-World Savings and Reliability ✨
Implementing these best practices isn’t just about avoiding problems; it’s about achieving tangible results:
- Reduced Runtime: One customer saw their single workflow runtime drop from 6 hours to 30 minutes.
- Increased Reliability: Another achieved 99.9% reliability in workflow runs, reducing random failures from 30% to near zero.
- Significant Cost Reduction: Companies have directly reduced cloud compute costs by two-thirds by understanding the interplay between Argo Workflows and Kubernetes.
The Key Takeaway for Everyone:
Success with Argo Workflows hinges on a deep understanding and respect for Kubernetes. Invest time in planning, cluster tuning, and continuous learning. By keeping Kubernetes at the core of your Argo strategy, you unlock unparalleled scalability, security, and efficiency.
(Slides and examples are available on GitHub at the link provided in the original presentation.)