Presenters

Source

From Chaos to Cloud-Native Nirvana: How IBM Research Tamed Bare Metal OpenShift with GitOps and Policy Magic ✨

Ever felt like you’re wrestling a wild beast when managing bare metal OpenShift clusters? You’re not alone! IBM Research faced a similar challenge, drowning in manual configurations and user requests for their high-performance computing environments. But they didn’t just survive; they thrived, transforming their chaotic setup into a streamlined, automated powerhouse using a brilliant trifecta: Argo CD, Kyverno, and Kueue. 🚀

This isn’t just about tech; it’s about understanding people. The researchers in the accelerated discovery field – a hotbed of AI, hybrid cloud, and quantum computing – weren’t Kubernetes gurus. They were brilliant minds focused on breakthroughs, not YAML files. Their workflows, often characterized by bursty behavior and a need for interactive access, clashed head-on with the complexities of Kubernetes. The result? Wasted resources, scheduling nightmares, and an overwhelmed administrative team.

Let’s dive into how IBM Research turned this chaos into control, making their OpenShift clusters a dream for researchers and admins alike.

The Researcher’s Plight: When HPC Meets Kubernetes 🤯

Imagine a world where deadlines like NeurIPS and ICML loom large. Researchers need to crunch massive datasets, often leading to intense, CPU-heavy jobs. The problem? These CPU behemoths were hogging resources on GPU-enabled nodes, leaving other users in the lurch.

Compounding this, researchers accustomed to the freedom of full node access in traditional HPC environments found workarounds. They’d spin up “sleep infinity” pods, essentially keeping GPU resources tied up even when not actively in use. This led to the infamous interactive GPU pod misuse, a massive drain on valuable hardware. 💰

And let’s not forget the admin side. The team managing these clusters were passionate about research, not necessarily seasoned platform engineers. Manual setups, combined with a high volume of non-expert users, meant they were constantly bogged down with “requests for even simple changes.” This was a critical bottleneck for scientific progress.

The Grand Vision: Automation, Policy, and Fairness 🎯

The IBM Research team set out with a clear, three-pronged strategy:

  1. Empowerment & Efficiency: Shift namespace management to the users and automate away those tedious, manual deployments.
  2. Resource Optimization: Implement smart policies to keep CPU jobs off GPU nodes and stop the misuse of interactive GPU pods.
  3. Equitable Access: Introduce a queuing system that ensures everyone gets a fair shot at those precious GPU resources.

Argo CD: The GitOps Backbone of Automation 💾

The secret sauce for their automation journey? Argo CD, seamlessly integrated via OpenShift GitOps. This move brought them the power of GitOps, treating infrastructure as code and ensuring consistency and auditability.

  • Kustomize for Simplicity: They harnessed Kustomize to simplify Kubernetes configurations, boiling them down to plain YAML. This made development and maintenance a breeze for everyone.
  • Cascading Changes with Shared Bases: A shared project definition with sensible defaults for roles, group memberships, and quotas became the foundation. Any change here would cascade beautifully across all user-created namespaces.
  • Application Sets for Automated Magic: To banish manual Argo CD application creation, they deployed an application set. This clever generator scans user project directories and automatically spins up Argo CD applications. Plus, it boasts self-healing to combat drift and pruning to clean up resources when they’re deleted from Git.
  • User-Driven Project Creation: Researchers could now create their own projects by simply adding a folder and a pre-defined file. Minimal effort, maximum impact.
  • Zero Support Requests? Yes, Please! This GitOps approach, coupled with user empowerment, led to a remarkable outcome: “zeroed support requests on Slack.” Now that’s music to an admin’s ears! 🎶

Kyverno: The Policy Enforcer for a Smarter Cluster 🦾

To tackle the thorny issues of CPU sprawl and interactive GPU misuse, Kyverno stepped in as the policy enforcement engine. Their focus was on transparent, explicit policies with built-in escape hatches.

  • No More CPU Sprawl on GPU Nodes: Kyverno inspects incoming jobs. If a job requests GPUs but doesn’t specify limits, or if it requests GPUs without needing them, Kyverno swoops in and adds an anti-affinity rule. This ensures that CPU-only jobs are kept far away from valuable GPU real estate.
  • Blocking Interactive GPU Pod Abuse: The “sleep infinity” and “tail dev null” patterns were put to an end. Kyverno blocks exec commands into pods based on namespace labels, rendering these interactive pods useless for their intended misuse.
    • Admin Exemptions & Configurable Exceptions: Legitimate administrative tasks are safe! Cluster admins can still exec. Plus, a ConfigMap allows for namespace exemptions, offering crucial flexibility.
    • Clear Communication: When an action is blocked, users receive clear messages explaining why and guiding them on how to request exceptions.
  • Audit and Refine for Perfection: Recognizing that policies can sometimes be too strict, they initially deployed them to a small group of power users. Kyverno’s audit functionality logged every exec command. This revealed unintended consequences, like blocking essential tools like tar and rsync!
    • Policy Tweaks: The policies were refined to include an allow list for commands like tar and rsync, and the exec denial is now conditional on pods actually requesting GPUs.
  • Labeling for Proactive Management: A brilliant bonus policy automatically labels newly created resources with the user’s identifier. This makes identifying and managing deprecated storage classes a breeze, allowing for proactive outreach.
  • User Ignorance is Bliss: The best part? Users are largely unaware of these policies because the changes are made transparently. This also provided valuable insights into which workloads could be refactored into cloud-native patterns and which truly needed interactive GPU access.

Kueue: Fair Play for GPU Resources 🤝

The final piece of the puzzle was Kueue, bringing an HPC-like queuing system to ensure fair and efficient GPU access.

  • HPC-like Queuing for Non-Time-Sensitive Workloads: Most research workloads aren’t urgent. Kueue queues these jobs, preventing the cluster from being overwhelmed by a flood of seemingly urgent pods.
  • Resource Flavors for Specific GPUs: Kueue allows defining resource flavors based on node labels. This means you can create specific queues, like an “H100 queue,” for users who need GPUs with particular CUDA capabilities or memory.
  • Generic GPU Queue for Maximum Utilization: Within the same cohort as specific queues, a “generic GPU queue” can be established. This queue can borrow resources from other queues, leading to faster turnaround times for users who don’t need specialized GPUs. This is perfect for testing smaller models or utilizing older hardware.
  • Fairness and Choice: Kueue guarantees fair resource access by queuing workloads and empowers users to choose between specific GPUs or a generic queue, optimizing both resource utilization and user experience.

The Empathetic Approach: Key Takeaways 💡

The transformation at IBM Research is a masterclass in intelligent cluster management. The benefits are crystal clear:

  • Admin Headaches Vanish: Automation and user empowerment have “zeroed” support requests and eliminated manual deployments.
  • Visibility and Auditability Reign Supreme: GitOps ensures everything is visible, auditable, and traceable, while self-healing keeps the cluster in its desired state.
  • Resource Utilization Soars: Policies automatically fix misconfigurations and problematic behaviors, drastically reducing resource contention.
  • Namespace Changes Happen Instantly: Users can now propose and deploy changes in a flash, breaking down administrative bottlenecks.
  • GPU Access is Fair for All: Queuing systems ensure equitable distribution of valuable GPU resources.

The overarching message from this presentation is profound: be empathetic. Understanding your users’ needs and making the system as intuitive and easy as possible for them is the key to successful adoption and efficient operation. It’s about building systems that empower, not overwhelm. And that, my friends, is the future of cloud-native excellence. ✨

Appendix