Presenters
Source
Beyond Plugins: Building a Self-Healing, Dynamic Service Catalog 🚀
In the world of internal developer portals, Backstage is often hailed as the “one place to find every service, API, and piece of infrastructure.” For many teams, the initial launch is a triumph. Developers register services, YAML files are written, and leadership is impressed.
But then comes Month Six.
As Jenil Jain (Harness) and Debabrata Panigrahi (Parseable Inc.) shared in their recent talk, the excitement of a new catalog often fades into the frustration of stale data. When reorgs happen, repos move, and ownership shifts, the static catalog becomes a graveyard of 404 links and outdated information.
Here is how they moved beyond basic plugins to build a living, dynamic enrichment platform that keeps the catalog fresh automatically. 💡
📉 The Crisis of Trust: Why Static Catalogs Fail
The speakers identified a recurring pattern: data is perfect on Day One but decays silently. This decay hits four key personas differently:
- The On-Call Engineer: At 2 AM, they need to know who owns a failing service. If the catalog says Team Alpha—a team that dissolved last quarter—they waste 20 minutes digging through GitHub history while the incident escalates.
- The Engineering Manager: During quarterly reviews, they can’t tell which of their 200 services are actually well-maintained. It becomes a “spreadsheet exercise” based on best guesses.
- The Platform Engineer: Their credibility is tied to data they don’t control. If teams don’t trust the data, they don’t trust the platform.
- The New Hire: By week two, they stop using the catalog because the links are broken. They go back to asking every question in Slack.
The realization? You cannot force humans to update YAML files. The catalog must update itself. 🤖
🛠️ Building the Enrichment Platform
The team initially tried manual pings, Slack bots, and Python scripts, but these methods failed to scale. They eventually hit a wall with Backstage Catalog Processors because third-party API dependencies (like GitHub) caused constant rate-limiting and timeouts.
To solve this, they designed a dedicated Enrichment Platform using a Base Connector architecture.
The “What” and “How” of Enrichment 🔗
Instead of handling every API case-by-case, they built a base connector that manages the “nitty-gritty” details like caching and rate limits. Developers simply declare:
- What data to enrich (e.g., entity kind).
- How to enrich it (which API endpoints to hit, such as Sentry, Parseable, or DataDog).
This data is then stored directly in the catalog as annotations under the
enrichment.io namespace. Why annotations? Because every part of
Backstage—Search, Scaffolder, and TechDocs—already understands them. 💾
🌡️ The Health Score: Measuring What Matters
The most impactful feature is the Health Score—a single number from 0 to 100 computed from four weighted dimensions:
- Data Freshness: Is the enrichment data current?
- Maintenance: Are people actively committing and reviewing PRs?
- Coverage: How many connectors are providing data for this entity?
- Incidents: Are there active PagerDuty alerts?
This score transforms the catalog from a directory into a management tool. Managers can instantly filter for services with a score below 40 to identify abandoned or unmaintained assets. 🎯
🤖 Adding the AI Layer: Proactive Insights
Debabrata and Jenil didn’t stop at scores. They integrated an AI layer (pluggable with GPT or Claude) to provide:
- Plain English Summaries: Instead of interpreting graphs, users read: “This service has high activity but declining health due to unresolved incidents.”
- Anomaly Alerts: If a score drops 30 points in 48 hours, the AI flags it—catching silent degradations over weekends.
- Impact Prediction: Using BFS (Breadth-First Search) traversal across service maps, the system predicts the “blast radius” of a failure, identifying exactly how many downstream services are at risk. 🌐
🏗️ A Resilient Architecture
The system consists of six core plugins that handle everything from the backend engine to the frontend dashboards. To ensure the platform didn’t crash when external APIs went down, they implemented:
- Circuit Breakers: If a connector fails 5 times, it pauses for 5 minutes to prevent cascading failures.
- Token Bucket Rate Limiting: This ensures they don’t burn through GitHub or Jira API quotas.
- In-Memory Caching: A 60-second TTL prevents the backend from being hammered during catalog refreshes.
The Power of Extension Points 🔌
The team emphasized designing for extension. Adding a new tool like Jira takes only 15 lines of code. You register the connector in a backend module, and you’re done. No coordination with the central platform team is required.
🚀 Supercharging the Scaffolder
The enriched data also powers Scaffolder Actions. They built five custom actions to enforce quality:
- Deployment Gates: If a health score is less than 60, the system blocks new deployments to production.
- Staleness Checks: It warns users if they are deploying artifacts that haven’t been updated in over 6 hours.
- Force Refresh: A “manual trigger” that runs the enrichment pipeline immediately before a critical workflow. 🛠️
🎓 Key Takeaways and Future Roadmap
The journey from a static to a dynamic catalog taught the team three vital lessons:
- The Catalog is a Platform, not a Database: It is meant to be extended, not just filled.
- Freshness is the Real Problem: Stale data is worse than no data.
- Automate the Data, Not the Humans: Philosophically, if you ask humans for manual toil, you have already lost. 💡
What’s next? The team plans to move toward proactive enrichment, where a low health score automatically triggers a Jira ticket. They are also working on Historical Trends to monitor long-term service health and eventually open-sourcing the entire system for the Backstage community. 🦾
❓ Q&A Highlights
Audience Question: How do you handle the initial “Day Zero” services when GitHub integration times out? Debabrata Panigrahi: We moved away from standard processors to this enrichment platform precisely to handle those timeouts with circuit breakers and caching. By separating the “pulling” of data from the “registration” of the service, we ensure the UI remains responsive even when GitHub is struggling.
Audience Question: Is the AI layer mandatory? Jenil Jain: Not at all. The AI layer is pluggable. You can start with the basic connectors and health scores, then add the LLM provider of your choice later to get those human-readable summaries and anomaly detections.
Speakers:
- Debabrata Panigrahi, Parseable Inc.
- Jenil Jain, Harness
Are you ready to make your Backstage catalog self-healing? Start by automating your data, not your developers! 👨💻✨