Presenters
Source
Pyroscope 2.0: Unleashing the Power of Continuous Profiling for Your Applications! ๐
Hey tech enthusiasts! ๐ We’re thrilled to dive deep into Pyroscope 2.0, the latest major release that’s set to revolutionize how you understand and optimize your applications. Yesterday marked the official launch, and even a quick 2.0.1 fix has already landed, showing the team’s dedication!
Join us, Christian Simon and Alberto Soto from the Pyroscope team, as we explore the “why” behind continuous profiling, dissect the lessons learned from v1, and unveil the exciting new features and architecture of v2. Get ready for a journey that promises to make your applications faster, more efficient, and significantly easier to manage! โจ
What Exactly is Profiling, and Why Go Continuous? ๐ค
Profiling, at its core, is about gathering insights into how your code runs and the resources it consumes, all the way down to the line number. Traditionally, profiling happened in development environments, often with significant overhead that could impact performance. While useful, we all know how real-world production data can throw curveballs that our tests and benchmarks never predict.
This is where continuous profiling shines! ๐ It operates in production, minimizing overhead to avoid impacting your application’s cost or latency. By continuously observing your system, even at a low sampling rate, it captures those crucial, often-missed events. This becomes incredibly powerful during incidents. Imagine jumping back in time to see precisely how your application’s CPU usage or memory allocations shifted, pinpointing the exact moment a problem began.
While metrics and logs provide vital alerts and high-level overviews, and traces help identify services in distress, profiling takes you deeper. It allows you to zoom into a specific service, down to the line of code, and understand exactly how resource consumption changed. This can confirm your suspicions, accelerate debugging, and get your systems back online faster. โก
Key Use Cases for Continuous Profiling ๐ฏ
Continuous profiling isn’t just for emergencies; it’s a proactive tool for optimization:
- Reducing Latency: ๐๏ธ In today’s microservice-driven world, latency between services adds up. While tracing gives you wall-clock times, profiling reveals what you can optimize. It points out areas for caching and identifies bottlenecks in your hot paths, directly contributing to a snappier user experience.
- Cost Optimization & Resource Efficiency: ๐ฐ Is your service slow? The first instinct might be to scale up. But over time, this can lead to ballooning costs. Profiling helps you identify inefficient resource usage โ those low-hanging fruit opportunities to optimize. It can even highlight changes over time, like outdated constants that are no longer suitable for current data formats, allowing you to adapt and save.
Lessons Learned: The Challenges of Pyroscope v1 ๐
Our journey to Pyroscope 2.0 was paved with valuable lessons from v1. The v1 architecture, largely based on the Cortex architecture, presented several challenges:
- Ingesters in the Middle: ๐ง In v1, ingesters sat at the heart of both the write and read paths. This created a bottleneck: heavy query traffic could easily overwhelm ingesters, impacting write performance and availability.
- Scaling Pains & Cost: ๐ธ Scaling ingesters to handle both write and read traffic was complex and expensive. They needed to be large enough for memory and potentially disk usage, leading to over-provisioning for 99% of the time to handle the 1% of demanding queries. This lack of elasticity meant scaling up or down could take hours, not minutes.
- Write Amplification & Data Redundancy: ๐ Every profile was replicated three times (replication factor of three). Furthermore, the sharding strategy, based on a simple series hash, meant that profiles with similar labels but different values could end up in different ingesters. This led to significant duplication of symbolic data (stack traces, function names) across all ingesters, drastically increasing storage and read amplification.
- Operational Complexity: ๐ง The lack of elasticity and the intricate dependencies made operating and scaling Pyroscope v1 a significant challenge.
Enter Pyroscope 2.0: A Reimagined Architecture! โจ
With v1’s lessons in mind, we embarked on a major refactoring to build Pyroscope 2.0, focusing on increasing availability, decoupling read and write paths, and optimizing storage and cost.
The core of the v2 architecture revolves around a fundamental shift: storing all profiling data in object storage buckets. ๐ฆ This decision liberates us from the constraints of ingesters holding data in memory and on disk.
Here’s how it works:
-
Write Path:
- Profiles arrive at distributors.
- Distributors map profiles to segment writers (the new write-path-only components).
- Segment writers upload the profile data to object storage.
- Crucially, a new component called the meta store registers the fact that the data has reached the object store.
- Only after the data is safely in object storage and registered in the meta store do we acknowledge the upload to the client. This ensures data durability.
-
Read Path:
- The query front end interacts with the meta store to determine where the relevant data resides in the object store.
- Query backends are then spun up, on-demand and stateless, to fetch and process the data from the object store. These can be scaled rapidly to match query load.
This new architecture brings several key advantages:
- Stateless Read Components: ๐ค Query backends are stateless and can be spun up in seconds, allowing us to react quickly to changing query patterns without the overhead of attached disks.
- Decoupled Paths: ๐ The read and write paths are now independent, with the meta store acting as the primary dependency. This significantly improves availability.
- Object Storage Benefits: โ๏ธ We leverage the scalability, low cost, and high redundancy of cloud object storage, offloading the burden of scaling and maintenance to our cloud provider.
- Optimized Sharding: shard by service. This ensures that data for a single service is ideally stored only once, dramatically reducing symbol data duplication. For mega-scale services, we can still shard to manage query performance, but this is now a conditional optimization, not a default.
The Impact: Storage Reduction and Cost Savings ๐ฐ๐
The architectural changes in v2 lead to remarkable storage reductions:
- Sample Data: In v1, sample data was stored three times. In v2, it’s stored only once, resulting in a 66% reduction.
- Symbol Data: This is where the biggest gains are seen. In v1, symbol data was replicated across many ingesters. With v2’s service-based sharding, we’ve seen reductions of up to 95% in our production rollout! This theoretically can go down to 2%, but 95% is the generally observed value.
This storage reduction directly translates to lower costs and less data to read during queries, further improving performance.
Exciting New Features in Pyroscope 2.0 ๐
Pyroscope 2.0 isn’t just a re-architecture; it’s packed with new capabilities, many of which are already in use at Grafana Cloud:
- Recording Rules: ๐ Process profiling data and export it as metrics. This is incredibly useful for targeting specific function names, especially when developing libraries.
- Profile Exemplars: ๐ (Shown in the demo) Allows for deeper inspection of individual profiles.
- Heatmap Panel: ๐ฅ Visualizes request breakdown based on CPU usage, highlighting high-traffic spans.
- Profiles to Trace Flow: โก๏ธ Seamlessly jump from profiling data to trace information, providing a comprehensive view of request behavior.
Live Demo: Seeing Pyroscope 2.0 in Action! ๐ฌ
Let’s take a look at what makes Pyroscope 2.0 so powerful:
- Profile Drilldown App: We start by looking at our own deployment’s query backend. The flame graph provides an aggregated view of CPU usage over the last three hours, breaking it down by code components.
- Individual Profile Inspection: A game-changer in v2 is the ability to
inspect particular profiles. We can now isolate a single pod (e.g.,
pod set DMM MSS) and see its specific profile, making it much easier to diagnose individual performance issues without extensive zooming and filtering. This is a real-time saver! โฑ๏ธ - Heatmap Panel Showcase: The heatmap panel visualizes span requests based on CPU usage. Red areas indicate a high number of requests within a specific CPU time bucket (e.g., 10ms to 500ms). We can identify top requests and then drill down into their flame graphs to understand where CPU time is spent.
- Profiles to Traces Integration: We can now pull data from Tempo (our tracing backend) based on profiling information. This provides crucial context about where a request originates and how it traverses the system. For instance, 10 seconds spent on the CPU during a request might translate to a 17-second total wait time for the query, highlighting a significant optimization opportunity. ๐
Why You Should Embrace Continuous Profiling with Pyroscope 2.0! ๐ช
So, should you adopt continuous profiling? And if you’re running Pyroscope, should you upgrade to v2? Absolutely!
- For New Adopters: Start with Pyroscope 2.0 immediately. It empowers you
to:
- Reduce Downtime: Quickly find root causes and resolve issues.
- Improve Latency: Optimize your services for better performance.
- Enhance Resource Efficiency: Reduce costs by using resources more effectively.
- For Existing Users: Upgrade to Pyroscope 2.0. It’s:
- Cheaper: Significant cost savings due to architectural improvements.
- Easier to Run: Simplified operations and increased elasticity.
- More Robust: A more stable and scalable platform.
All the features demonstrated today are available in the open-source project! If you prefer to avoid the operational overhead, consider Grafana Cloud, which offers a free tier and a fully managed Pyroscope experience.
We’re incredibly excited about the future of continuous profiling with Pyroscope 2.0 and can’t wait to see how you leverage it to build better, faster, and more cost-effective applications!
Additional Resources:
- Join our live community call tomorrow morning at 10:30 AM.
- Visit us at the Ask The Expert booth right after the call.