Dropbox runs hundreds of services, written in different languages, which exchange millions of requests per second. At the core of our Service Oriented Architecture is Courier, our gRPC-based Remote Procedure Call (RPC) framework. While developing Courier, we learned a lot about extending gRPC, optimizing performance for scale, and providing a bridge from our legacy RPC system.
Note: this post shows code generation examples in Python and Go. We also support Rust and Java.
The road to gRPC
Courier is not Dropbox’s first RPC framework. Even before we started to break our Python monolith into services in earnest,
We introduced Cape in a previous post. In a nutshell, Cape is a framework for enabling real-time asynchronous event processing at a large scale with strong guarantees. It has been over a year since the system was launched. Today Cape is a critical component for Dropbox infrastructure. It operates with both high performance and reliability at a very large scale. Here are a few key metrics, Cape is:
- running on thousands of servers across the continent
- subscribing to over 30 different event domains at a rate of 30K/s
- processing jobs of various sizes at rate of 150K/s
- delivering 95% of events under 1 second after they are created.
Dropbox invests heavily in our security program. We have lots of teams dedicated to securing Dropbox, each working on exciting things. Some recent examples covered on our tech blog include:
- Our Product Security team rolled out support for WebAuthn to boost user adoption of two-step verification and upleveled our industry-leading public bug bounty program
- Because security is everyone’s responsibility, our Security Culture team helps our employees make consistently secure and informed decisions that protect Dropbox, our users, and our employees
- Our Detection and Response Team (DART) implementation of extensive instrumentation throughout our infrastructure to catch any indications of compromise.
Dropbox stores petabytes of metadata to support user-facing features and to power our production infrastructure. The primary system we use to store this metadata is named Edgestore and is described in a previous blog post, (Re)Introducing Edgestore. In simple terms, Edgestore is a service and abstraction over thousands of MySQL nodes that provides users with strongly consistent, transactional reads and writes at low latency.
Edgestore hides details of physical sharding from the application layer to allow developers to scale out their metadata storage needs without thinking about complexities of data placement and distribution.
One of the greatest challenges associated with maintaining a complex desktop application like Dropbox is that with hundreds of millions of installs, even the smallest bugs can end up affecting a very large number of users. Bugs inevitably will strike, and while most of them allow the application to recover, some cause the application to terminate. These terminations, or “crashes,” are highly disruptive events: when Dropbox stops, synchronization stops. To ensure uninterrupted sync for our users we automatically detect and report all crashes and take steps to restart our application when they occur.
In 2016, faced with our impending transition to Python 3,
What is the JS Guild?
The JS Guild is a grassroots initiative at Dropbox to improve our frontend engineering by fostering community, culture, and code quality. The group strives to teach frontend best practices to generalists and to help strong frontend engineers leverage and grow their domain knowledge.