Presenters
Source
Unlocking CI/CD Secrets: Deep Dive into Observability with OpenTelemetry and Argo 🚀
Ever feel like your CI/CD pipelines are operating in a black box? You push code, a workflow spins up, and then… poof… it’s either deployed successfully or it fails, leaving you scratching your head about why. If this sounds familiar, then get ready, because we’re about to pull back the curtain! This session dives deep into how to achieve end-to-end observability for your CI/CD processes, specifically focusing on the powerful duo of Argo CD and Argo Workflows, all powered by the magic of OpenTelemetry. ✨
The core challenge? Gaining deeper insights into your deployment and workflow pipelines. We’re talking about understanding exactly where errors creep in, pinpointing those sneaky latency bottlenecks, and ultimately, ensuring your entire system is humming along perfectly.
The Observability Backbone: OpenTelemetry 🌐
At the heart of this transformation is OpenTelemetry. If you’re not yet familiar, this open-source, vendor-neutral protocol is rapidly becoming the de facto standard for unifying observability signals. Think of it as a universal translator for your system’s health: it takes metrics, events, logs, and traces and converts them into a single, coherent format. Its meteoric rise within the CNCF ecosystem – second only to Kubernetes itself! – speaks volumes about its importance and widespread adoption. OpenTelemetry isn’t just about theory; it provides SDKs for various signals and has even introduced profiling instrumentation to give you even more granular insights.
Bringing Argo CD and Argo Workflows into Focus 🎯
Now, let’s see how this powerful observability tool plays with our favorite Argo tools:
Observing Argo CD 🛠️
Argo CD is a treasure trove of metrics, with its application controller, application set controller, API server, repo server, and commit server all exposing valuable data. However, getting this data into the OpenTelemetry Protocol (OTLP) format isn’t always a direct path. The presenters highlighted a clever approach:
- A Prometheus collector acts as the initial scraper, pulling metrics from specific Argo CD endpoints (AT82, AT83, AT84 for application controller, API server, and repo server metrics respectively).
- This scraped data is then forwarded to an OpenTelemetry collector, which standardizes it for your observability backend.
Observing Argo Workflows ⚙️
Argo Workflows offers a more streamlined experience for metrics, with data flowing directly in OTLP format. However, tracing requires a bit more finesse:
- Since Argo Workflows doesn’t natively send traces via an OpenTelemetry agent, the OTEL CLI steps in as our hero! 🦸
- This handy tool allows us to execute commands or use OTLP-formatted SDKs directly within bash scripts to send traces to our OpenTelemetry collector.
- The result? We can visualize the execution of your Directed Acyclic Graphs (DAGs) as a beautiful, intuitive waterfall model, just like you’d see with traditional trace visualization.
The Grand Vision: Deeper Pipeline Insights 💡
The ultimate goals of this observability push are clear and impactful:
- Comprehensive Tracing Support: Implementing robust tracing for both Argo CD and Argo Workflows.
- Durametrics for CI/CD: Generating critical metrics that truly reflect the performance and health of your CI/CD pipelines.
- Unveiling Pipeline Secrets: Gaining a profound understanding of how your deployment and workflow pipelines perform, swiftly identifying the root causes of errors, pinpointing exactly where latency is introduced, and confidently confirming successful operations.
Architecture & Implementation: The OpenTelemetry Collector in Action 🏗️
The proposed architecture elegantly centers around the OpenTelemetry Collector, acting as the central nervous system for your observability data:
- Argo Workflows (both its controller and individual pods) diligently send their traces and metrics to this collector.
- Argo CD contributes its metrics to the same central point.
- The OpenTelemetry Collector then acts as a smart router, forwarding this rich data to your chosen trace store and metric store.
- For visualization and analysis, platforms like Signos shine, bringing your data to life.
The implementation details are just as fascinating:
- For Argo Workflows tracing, the OTEL CLI is your go-to, utilizing commands
like
otel cli span tp-printfor root spans,otel cli span start/stopfor child spans with essential trace ID propagation, and a simpleexportcommand to send it all off. - When it comes to metrics, a Prometheus scraper is configured with a specific scrape interval and metric endpoint, diligently collecting data from those designated Argo CD ports.
A Glimpse into the Future: Demo and Visualization 📊
The presentation offered a compelling live demo, showcasing the power of this setup:
- An OpenTelemetry Collector was deployed within a Kubernetes cluster (Minikube), configured with receivers for gRPC (port 4317) and HTTP (port 4318), all neatly pointing to Signos for data ingestion.
- The magic happened when the workflow DAG execution was visualized as a stunning waterfall model in Signos. Each span was adorned with valuable attributes like namespace and service name, providing instant context.
- Key metrics such as “workers busy count,” “workflow condition,” and “queue depth” were displayed on the Signos dashboard. It was particularly insightful to see the queue depth drop to zero when no workflows were active, and to observe the Q latency in action.
Extending Observability to Your CI Pipelines 💻
The vision doesn’t stop at Argo. The presentation extended this powerful observability to your CI pipelines, with a specific demonstration using GitLab CI. The goal here is crystal clear: to expose metrics from your CI pipelines, allowing you to proactively identify performance bottlenecks or pinpoint the exact moment of failure.
A simplified architecture illustrated how data from GitLab CI, alongside your Argo Workflow data, flows seamlessly through the OpenTelemetry collector to your observability backend. While the direct link between a specific CI commit and a workflow trace might not be immediately obvious in the visualization, the presenters emphasized that attributes are the key to populating this crucial information and connecting the dots.
Challenges and Tradeoffs: Navigating the Nuances 🤔
As with any powerful technology, there are a few challenges and tradeoffs to be aware of:
- Argo CD Metric Extraction Complexity: As mentioned, Argo CD’s metrics aren’t natively in OTLP format. This necessitates the Prometheus collector intermediary, which, while effective, adds a layer of complexity to the setup.
- Argo Workflows Tracing Implementation: Injecting trace data for Argo Workflows requires explicit use of the OTEL CLI. This might involve making minor adjustments to your workflow definitions or scripts, but the payoff in visibility is immense.
- Instantaneous Execution Impact: In demos, extremely fast-executing workflows or DAGs can sometimes result in spans appearing at the exact same millisecond (0.00 ms). While this shows efficiency, it can make flame graph visualizations less informative due to the lack of temporal separation. The presenters acknowledged this and suggested that introducing small, artificial delays can significantly improve visualization clarity in such scenarios.
The Future is Observable! 🚀
This segment powerfully advocates for adopting OpenTelemetry to gain comprehensive visibility across your entire CI/CD lifecycle – from the initial code commit all the way through to the successful execution of your workflows. By embracing these tools, you’re not just reacting to problems; you’re empowering yourself to proactively detect and resolve issues and continuously optimize performance.
The presenters also highlighted an exciting ongoing effort within the OpenTelemetry community (a dedicated SIG) to standardize semantic conventions specifically for CI/CD, with a keen focus on Argo CD. They strongly encourage everyone to participate in this initiative and help shape the future of CI/CD observability!
Want to see this in action? The team invited attendees to visit the Signos solution showcase booth (1372) to meet the experts and dive deeper into their offerings.
In conclusion, the message is clear: the era of black-box CI/CD is over. With OpenTelemetry, Argo CD, and Argo Workflows, you have the power to illuminate every step, understand every process, and build more resilient, efficient, and observable software delivery pipelines.