Presenters

Source

Level Up Your Distributed Systems: Why Fly.io is Betting on Speed with corrosion and CRDTs! 🚀

Hey tech enthusiasts! Ever found yourself staring at a loading spinner, wishing your global application data could just… be there? You’re not alone! The world of distributed systems is constantly evolving, balancing the need for data consistency with the ever-present demand for blazing-fast performance.

Recently, Sini Pinchala, InfoQ’s lead editor for the AIML and data engineering community, sat down with Samtochi, a brilliant software engineer from Fly.io, to dive deep into these very challenges and unveil some exciting advancements. Their discussion revolved around Fly.io’s innovative open-source distributed system, corrosion, and its fascinating reliance on eventual consistency and Conflict-Free Replicated Data Types (CRDTs). Get ready to have your mind expanded! ✨


🚀 Chapter 1: The Need for Speed – Why Eventual Consistency Wins for Most Apps

Imagine a world where every piece of data in your global application has to be perfectly identical across every single server, all the time. Sounds great for banks, right? But for the vast majority of internet applications – think social media feeds, gaming leaderboards, or collaborative docs – this “strong consistency” comes at a steep price: latency.

Samtochi from Fly.io champions a different philosophy: for many internet applications, speed over strong consistency is the game-changer. He argues that demanding strong consistency across all nodes introduces significant latency, making a “relaxed” system far more efficient. Fly.io’s corrosion is built on this very principle, embracing eventual consistency to deliver lightning-fast application deployments and data access. While acknowledging the absolute necessity of strong consistency for mission-critical applications like banking, he asserts that a vast majority of internet applications function effectively with eventual consistency. It’s about choosing the right tool for the job! 💡


💡 Chapter 2: Navigating the Tradeoffs – Stale Data & Conflicts

Of course, embracing eventual consistency isn’t without its hurdles. The primary challenge involves achieving rapid data availability across geographically distributed nodes while mitigating the inherent risks of stale data and conflicts. The tradeoff? Accepting potential temporary data inconsistencies to gain significant speed advantages.

  • Impact of Stale Data: Imagine routing a request to a server that, unbeknownst to you, has already been deleted! Stale data can lead to suboptimal decisions, wasted resources, and frustrated users.
  • Mitigation for Stale Data: corrosion has clever strategies:
    • Edge servers reroute requests to active workers if a target machine is unavailable.
    • Requests are routed to the closest server possessing the most up-to-date state, allowing that server to respond with corrections if necessary.
  • Impact of Conflicts: What happens when multiple sources try to write to the same data concurrently? Conflicts!
  • Mitigation for Conflicts: corrosion tackles this head-on:
    • Data Structuring: Each server “owns” specific rows or partitions within a shared SQL table, minimizing actual write conflicts from the start.
    • CRDTs (Conflict-Free Replicated Data Types): This is where the magic truly happens! corrosion leverages these specialized data structures for robust, automatic conflict resolution.

🧬 Chapter 3: CRDTs to the Rescue – Conflict-Free Replication Explained

So, what exactly are CRDTs, and why are they so crucial for corrosion?

CRDTs are specialized data structures that enable independent replicas to accept writes and updates without direct communication. The truly remarkable part is that they guarantee convergence to the same state regardless of the order in which changes are received. This is a game-changer for eventually consistent systems!

Samtochi broke down the two main types:

  • State-based CRDTs: These exchange full states between replicas.
  • Operation-based CRDTs: These exchange only the operations (e.g., “add 1,” “insert ‘X’”).

He gave some fantastic examples:

  • A grow-only counter (think Instagram likes 👍) uses a simple merge function that takes the maximum value. If one server has 100 likes and another 105, the merged state is 105. No conflict, just growth!
  • Other CRDTs include grow-only sets, add-wins sets, remove-wins sets, and observed removals sets.
  • For tie-breaking in conflict resolution, last writer wins CRDTs track timestamps, sometimes requiring advanced mechanisms like Lamport’s clocks in distributed environments to ensure a consistent ordering across different nodes.
  • Think of Git as a real-world parallel for application-level conflict resolution. It merges non-conflicting changes automatically and defers complex conflicts to the user or application. CRDTs bring this kind of intelligent merging to your data structures!

However, it’s important to note the limitations of CRDTs. As Samtochi highlighted, they can introduce metadata overhead, are generally unsuitable for order-dependent data (like a strict transaction log), and in complex scenarios, still require application-specific conflict resolution logic. They’re powerful, but not a silver bullet for every data problem.


🛠️ Chapter 4: Under the Hood – corrosion’s Tech Stack & Replication Magic

corrosion is Fly.io’s open-source distributed system designed to replicate SQL data across a global cluster, built from the ground up with eventual consistency and CRDTs. It’s truly a marvel of modern distributed systems engineering!

  • Database Foundation: corrosion builds on top of the incredibly robust and lightweight SQLite, using it as its underlying database.

  • CRDT Implementation: Specifically, corrosion leverages the CR-SQL SQLite extension for its CRDT capabilities. CR-SQL creates “clock tables” (shadow tables) to store metadata for each column within a row, tracking information like the original node ID, update counts, and deletion/recreation counts. Its sophisticated merge logic prioritizes columns with more updates, then those with deletions/insertions, then tie-breaks on value, and finally on the node ID. This granular, column-level tracking is key to its effectiveness!

  • Technology Stack:

    • Language: Rust! This choice is no surprise, given Rust’s reputation for speed, memory safety, and concurrency. The project name, corrosion, is a playful nod to its Rust foundation!
    • Asynchronous Runtime: Tokio, a popular Rust async runtime, and its powerful utilities like channels, provide the backbone for concurrent operations.
    • Database Interaction: Rust SQLite handles seamless interaction with the underlying SQLite databases.
    • HTTP API: Hyper and Axum are used for exposing a robust HTTP API, allowing applications to easily read and write SQL data.
    • Networking Protocol: Instead of traditional TCP, corrosion opts for the modern QUIC protocol for its advantages in multiplexing, reduced latency, and better congestion control.
  • Data Replication Protocols: How does corrosion actually spread data so fast?

    • Gossip protocol: Changes are broadcast to local nodes first, then exponentially spread across the cluster. It’s like a whisper network for data!
    • Periodic Syncs: Nodes periodically contact each other to exchange states, identify missing versions using version numbers, and request “gaps” to ensure eventual consistency.
    • corrosion also intelligently favors more recent states during data processing and requests the most recent data first during sync protocols.

⚡ Chapter 5: Blazing Fast Replication – The Numbers Don’t Lie

So, how fast is “fast” with corrosion? The numbers are genuinely impressive:

  • For local nodes, corrosion achieves data replication in mere milliseconds. That’s almost instant!
  • For the global cluster, corrosion boasts a P99 (99th percentile) replication time of approximately 1 to 2 seconds. This means 99% of your data changes are replicated globally within a couple of seconds, a huge win for distributed applications!

This is a significant improvement over Fly.io’s previous system, which used Consul (a strongly consistent system employing the Raft consensus algorithm). While Raft is excellent for strong consistency, it couldn’t meet the speed requirements for Fly.io’s use cases, leading to the development of corrosion.


🌐 Chapter 6: Beyond corrosion – The Future of Local-First & Collaborative Tech

The discussion didn’t stop at corrosion itself. Sini and Samtochi explored broader implications, particularly the concept of “local-first software.” Samtochi connected this paradigm directly to CRDTs, explaining how they enable applications to be used offline with collaborative capabilities, seamlessly syncing data when network access becomes available. Imagine working on a document on a plane, and as soon as you land, it automatically merges with your colleagues’ changes!

He also expressed significant interest in CRDTs’ application in complex editing software, such as text editors, where granular tracking and application-level conflict resolution become crucial for a smooth, collaborative user experience.


Wrapping Up: A New Era for Distributed Systems 🎯

Fly.io’s corrosion is a testament to the power of thoughtful engineering and a clear understanding of application needs. By prioritizing speed and intelligently leveraging eventual consistency with CRDTs (specifically CR-SQL on SQLite), corrosion offers a compelling blueprint for building fast, resilient, and globally distributed internet applications.

This approach isn’t just about making things faster; it’s about enabling a whole new class of applications that can operate with incredible agility and responsiveness, pushing the boundaries of what’s possible in a connected world. If you’re building distributed systems, corrosion and the principles behind it offer a treasure trove of insights. Go check out the project and see how it might inspire your next big idea! 👨‍💻✨

Appendix