Presenters

Source

The Python Software Foundation’s Observability Revolution: Building a Self-Hosted Stack for Community Power 🚀

Hey tech enthusiasts! Jacob Coffee, Director of Engineering at the Python Software Foundation (PSF), recently shared a fascinating journey of how the PSF built its own self-hosted observability stack using Grafana, Loki, Mimir, and Alloy. This isn’t just about fancy dashboards; it’s a story of empowering a massive global community, managing incredible scale, and championing open-source principles. Let’s dive in!

The Sheer Scale of Python’s Ecosystem 🌐

First, let’s grasp the monumental scale the PSF operates at. PyPI, the Python Package Index, is where most developers get their packages. It handles an astonishing 5 billion requests daily, hosts over 750,000 packages, tens of millions of files, and over a million users. To put that in perspective, PyPI now sees more requests per second than Google does searches per second! 🤯

Beyond PyPI, the PSF also manages critical community resources like:

  • Python.org: The official website.
  • Python Docs: Essential learning resources.
  • Build Bots: Crucial for Python runtime development.
  • Benchmark Tooling: For performance analysis.

Each of these represents a different codebase and tech stack, all running on volunteer-driven infrastructure that’s over 20 years old. This polyglot environment, from Django and Mailman to Golang and TypeScript, is a testament to the evolving nature of open-source projects.

The Community’s Growing Needs: Beyond Internal Tools 🤝

The PSF’s mission extends to supporting a vast array of community projects. This includes working groups, PyLadies chapters (promoting women in tech), PyCons worldwide, and packaging teams. These groups deploy their applications on PSF infrastructure, and critically, they need to know when things break.

Jacob highlights a key challenge: while the PSF had excellent observability tools like Datadog and Sentry (generously donated through open-source programs), these solutions were primarily for the core engineering team. They couldn’t be extended to the hundreds, if not thousands, of community members who relied on the infrastructure.

The Gap: What worked for a small engineering team didn’t translate to the needs of a global community. When issues arose at 2:00 AM Pacific time, community members in Europe or Asia had to message Jacob or his colleagues on Slack, hoping they were awake, or file tickets and wait, losing valuable debugging context.

The “Why Not Us?” Moment: Sovereignty and Sustainability 🛡️

This led to two critical concerns:

  1. Community Empowerment: The PSF wanted to empower its community to self-serve their own observability needs, without them having to incur their own costs or rely on the limited access of donated enterprise tools.
  2. Vendor Lock-in & Sustainability: Every piece of infrastructure was donated. What if a sponsor’s board decided to cut budgets? The PSF faced a potential “big bill” if those donations were pulled. This highlighted the need for control and long-term sustainability.

The Solution: Build their own self-hosted observability stack.

Building the Self-Hosted Stack: Grafana, Loki, Mimir, Alloy 🛠️

The PSF needed a solution that provided:

  • Self-hosted Logs: Accessible for community members.
  • Metrics with Decent Retention: To track performance over time.
  • Shareable Dashboards: For easy visualization.
  • Multi-tenant Access: To ensure users only see their own data.
  • Single Collection Agent: To simplify configuration.

The chosen stack, Grafana, Loki, Mimir, and Alloy, checked all these boxes and more:

  • Open Source: Aligns perfectly with the PSF’s mission.
  • Seamless Integration: Designed to work together efficiently.
  • Active Development: Backed by a company actively shipping features to open-source additions, not just security updates.
  • Incremental Growth: The team could start small, beginning with logs and then adding metrics, with plans for distributed tracing using Tempo.

The Infrastructure: Kubernetes Clusters and Cabotage 🧊

The PSF operates two main Kubernetes clusters:

  1. PSF Cluster: Hosts python.org, PyCon, and community projects. It uses Alloy for collection, Loki for logs, Mimir for metrics, and Grafana for dashboards. Mimir is backed by MinIO for S3 storage.
  2. PyPI Cluster: Handles the massive scale of PyPI. It uses the same monitoring stack but with independent storage due to PyPI’s larger blast radius and different security requirements.

Both clusters run on Cabotage, the PSF’s open-source platform as a service. Cabotage manages Kubernetes deployments with Vault for automatic mTLS, Consul for service discovery, and Buildkite for container builds. This ensures that the monitoring stack is automatically wired up and comes online with any new cluster, simplifying deployment and management.

The Data Flow: Alloy, Loki, and Mimir in Action 🌊

  • Metrics: Alloy runs as a DaemonSet, collecting data from nodes. It scrapes metrics from cAdvisor (around 40 metrics per node) and Traefik (for requests, CPU, latencies). Cardinality control is crucial here, especially when running on donated cloud credits.
  • Logs: Traefik access logs are parsed, extracting structured labels like status code and service name. This structured labeling at collection time is paramount for fast querying. App logs from community projects are also collected and tagged with Kubernetes metadata (namespace, pod name).
  • Loki: Stores logs, indexing metadata but not full content, making it cost-effective. It’s backed by MinIO. The PSF currently has a seven-day retention for infrastructure logs and 48 hours for community tenants, with ongoing experiments to increase this based on security researcher needs and cost considerations. LogQL queries are efficient thanks to the structured labels.
  • Mimir: Handles metrics storage. Grafana’s M Query feature is noted for reducing peak memory usage by up to 92%, which is a significant win for resource-constrained environments. Pre-computing latency histograms also speeds up dashboard loading.

The Game Changer: Community Self-Service and Empowerment ✨

The real magic happened when Grafana provided multi-tenant access. Each Kubernetes namespace maps to a Loki tenant, meaning:

  • PyCon organizers can see only PyCon data.
  • PyLadies can see only PyLadies data.
  • No one sees what they shouldn’t.

This was a massive win for security and for empowering the community. A PyLadies organizer, facing a 500 error, could log into Grafana, find their service dashboard, identify the misconfigured environment variable in the Loki logs, fix it, and redeploy – all without involving the infra team.

The Transformation: The PSF went from being the bottleneck to being the platform. This shift scales infinitely better than a few engineers answering Slack messages at midnight.

The Impact: Data-Driven Decisions and Sustainable Funding 📊

The self-hosted observability stack provided invaluable data that transformed conversations with sponsors and within the PSF’s board:

  • Concrete Numbers: Instead of “we think this is expensive,” they can now say, “This is exactly how expensive it is,” backed by real traffic numbers, request volumes, and metrics.
  • Defending Resources: The data allows the PSF to clearly demonstrate the value of their infrastructure and the impact of donations, making a strong case for continued support.
  • Transparency: The PSF co-signed the “Open Infrastructure is Not Free” statement, and their new data provides the concrete measurements to back up this crucial message. Sponsors can see exactly what their donations are enabling.

The total market rate for the PSF’s infrastructure, supported by generous in-kind donations from companies like Fastly and AWS, easily runs into the tens of millions of dollars annually. This data is now quantifiable and defensible.

The Road Ahead: Tempo, More Projects, and Public Insights 🛣️

The PSF’s roadmap includes:

  • Distributed Tracing with Tempo: To complete the LGTM (Logs, Grafana, Tempo, Mimir) stack and pinpoint slow operations.
  • Onboarding More Projects: Expanding the infrastructure to support even more community initiatives.
  • Public Dashboards: Offering public-facing insights into the health and capacity of the entire Python ecosystem, going beyond simple uptime status.

Key Takeaways for Your Own Stack 💡

Jacob’s insights offer powerful lessons for any organization:

  1. Own Your Observability: Don’t build things that can be taken away. Host with open source.
  2. Share with Your Community: If people run things on your infrastructure, let them see their own logs and data.
  3. Be the Platform, Not the Bottleneck: Empower users to self-serve.
  4. Build the Evidence: Sponsors and boards respond to data and numbers, not just “vibes.” Dashboards provide that crucial evidence.

Join the Mission! 📢

The PSF is actively hiring! They’re looking for an Infrastructure Engineer and a Software Engineer to work on PyPI. If you want to make an enormous impact on the Python ecosystem, check out python.org/jobs.

And if you’re heading to PyCon US in Long Beach, May 13th-19th, Jacob will be there – perhaps with even better dashboards!

This journey by the Python Software Foundation is a shining example of how embracing open-source observability can not only solve internal challenges but also fundamentally empower and scale a global community. It’s a powerful reminder that when we build for the community, everyone benefits.

Appendix