Presenters

Source

Unlocking Deeper Insights: Supercharging Backstage with OpenTelemetry 🚀

Hey tech enthusiasts! 👋 After a much-needed lunch break, we’re diving back into the exciting world of developer platforms. Today, we’re joined by Ekansh Gupta and Shivay Lamba, who are here to illuminate how you can supercharge your Backstage instance with the power of OpenTelemetry. Get ready to explore how to go beyond the surface and truly understand what’s happening under the hood of your complex infrastructure!

Why Backstage? The Unified Developer Experience 💡

For those new to the scene, Backstage is a game-changer. It acts as a single pane of glass for all your infrastructure needs. In the realm of cloud-native applications, where you’re juggling multiple services and infrastructure layers daily, this unified dashboard is invaluable. It helps you untangle the web of microservices, infrastructure pipelines, and Terraform scripts, preventing confusion and boosting efficiency.

Backstage provides a beautiful UI where you can visualize:

  • Infrastructure Layers: Get a clear overview of your entire setup.
  • Services: Monitor the health and status of your applications.
  • Logs: Access crucial logging information in one place.
  • Metrics: Understand key performance indicators at a glance.

The platform boasts a powerful backend built with NodeJS and a rich plugin system. This allows seamless integration with popular tools like Argo CD and GitHub Workflows. Your SRE and DevOps teams can leverage these plugins to build and interact with your infrastructure, and the best part? You can even design your own custom plugins to tailor Backstage to your exact needs! This significantly improves the developer experience, making it easier for stakeholders to understand the overall health of your infrastructure without jumping between different tools.

The Limits of “Out-of-the-Box” Observability ⚠️

While Backstage is fantastic, when you’re managing a large and growing ecosystem of products, its native observability features can reach their limits. You get basic insights like:

  • Whether a service is running or has gone down.
  • Basic logs for service status.
  • High-level metrics like who ran a particular service.

These are great for a quick overview, but what happens when a service goes down? How do you connect that failure to a specific action in your infrastructure? What was the root cause? Backstage’s built-in telemetry doesn’t always provide these deeper metrics.

Furthermore, what about the observability of Backstage itself? Why did Backstage go down? What happened when you connected a new plugin? Why is your UI showing a 503 error? These critical questions often go unanswered with the default offerings. As you build more custom plugins, the challenge of gathering in-depth telemetry data only compounds.

Enter OpenTelemetry: The Observability Powerhouse 🌟

This is where OpenTelemetry shines! Recognized as the second fastest-growing open-source project in CNCF (just behind Kubernetes!), OpenTelemetry acts as one single pane of glass for all your observability needs. It provides a standardized way to manage various telemetry data types, including:

  • Metrics: Quantifiable measurements of your system’s performance.
  • Traces: Visualizing the end-to-end journey of a request through your services.
  • Logs: Detailed records of events and errors.
  • Events: Capturing significant occurrences within your applications.
  • Profiling: Understanding resource utilization (CPU, GPU) to identify performance bottlenecks.
  • Exceptions: Tracking and analyzing errors that occur.

OpenTelemetry offers unified SDKs across multiple languages, making it easier to instrument your applications and gain deep insights into what’s happening behind the scenes. Recently, OpenTelemetry has even added support for profiling and eBPF workloads, further expanding its capabilities.

The Synergy: Backstage + OpenTelemetry = Unbeatable Insights 🤝

When you combine Backstage’s incredible platform for managing and visualizing your infrastructure with OpenTelemetry’s comprehensive telemetry capabilities, you create a powerful synergy.

  • Backstage gives you the what: the high-level overview of applications, infrastructure, and deployments.
  • OpenTelemetry provides the why and how: the detailed traces, metrics, and logs that explain failures, performance issues, and the intricate workings of your system.

This powerful combination allows you to bake deep observability directly into your Backstage dashboard, offering a full-fledged overview of not just your platform, but also the intricate details happening beneath the surface across multiple layers.

Practical Implementation: Instrumenting Your Backstage Backend 🛠️

Integrating OpenTelemetry with your Backstage instance involves instrumenting your backend. This can be achieved by:

  1. Adding the OpenTelemetry Node.js Auto-instrumentation Package: This is a crucial step to automatically capture traces and metrics.
  2. Creating an instrumentation.js File: This file configures OpenTelemetry, specifying trace and metric exporters. You can direct this data to a collector or other backends like BindPlane.
  3. Updating package.json: Ensure your application requires the instrumentation.js file during startup.

While Backstage provides some basic metrics (catalog count, scaffolder task counts), this is often insufficient for deep troubleshooting. By instrumenting with OpenTelemetry, you can see detailed traces for your Backstage backend, including resource attributes like the owner, runtime (NodeJS), service name, and SDK version.

Demo Insights: Unraveling a Complex Scenario 🕵️‍♂️

The presenters walked through a compelling demo showcasing a typical Backstage setup with external plugins for Argo CD, GitHub Workflows, and SonarQube, along with custom plugins. The scenario highlighted a common challenge: a service failure that wasn’t immediately obvious.

The demo illustrated:

  • Setting up public URLs: Using tools like ngrok to make local services accessible.
  • Configuring Backstage: Updating configuration files and instrumentation to route telemetry data.
  • Triggering a workflow: Initiating a process through Backstage that involves multiple services.
  • Observing failures: Initially, the Argo CD overview showed an error.
  • Leveraging Grafana for Traces: Querying traces in Grafana based on service names and HTTP URLs to pinpoint the failure.
  • Identifying the Root Cause: Through trace analysis, the team discovered the plugin backend wasn’t running, leading to a server-side error.
  • Resolving the Issue: Starting the plugin backend and observing the immediate resolution in Backstage and Argo CD.
  • Proactive Alerting: The ability to create alerts based on the collected metrics and traces to notify teams of potential issues before they impact users.

A key takeaway from the demo is that even with external plugins, the quality of observability depends on whether the plugin creator has enabled OpenTelemetry instrumentation. If not, you’ll primarily see the duration of the interaction, not the internal workings of the plugin itself. This emphasizes the importance of instrumenting your own custom plugins as well.

The Power of Correlation and Deeper Debugging ✨

The demo beautifully demonstrated how OpenTelemetry allows you to correlate multiple traces and spans, providing a clear picture of what’s happening across your distributed system. You can identify:

  • Which specific endpoint is failing (e.g., a /random route).
  • The duration of each operation.
  • Whether an endpoint was even hit.
  • The exact error messages and their context.

This level of detail is crucial for debugging complex systems. By analyzing traces in tools like Grafana, you can move from “something is wrong” to “this specific component failed at this precise moment due to this reason.”

Conclusion: Empowering Your Developer Platform 🌐

In essence, while Backstage provides an exceptional framework for managing your developer infrastructure, OpenTelemetry unlocks the deep insights needed to truly understand and troubleshoot it. By integrating these two powerful technologies, you equip your teams with the ability to proactively identify and resolve issues, optimize performance, and build more resilient and efficient systems.

The ability to instrument your Backstage instance and its plugins with OpenTelemetry empowers you to gain invaluable visibility, making your developer platform not just a portal for information, but a robust tool for deep operational intelligence.

Thank you, Ekansh and Shivay, for this insightful session! If you have further questions, feel free to connect with them. Happy instrumenting! 👨‍💻✨

Appendix