By Mihai Plesa, DevOps Manager at Brave
One might think that here at Brave we only deal with a few desktop and mobile applications but there’s an ever increasing number of backend services supporting our browser, private ads, and creator rewards ecosystem. These include browser and extension updates, lists for ad block and verified creators, background and Sponsored Images, ad catalogs, rewards transactions, private CDNs and proxies, infrastructure for Brave News, Brave Search, and many more. Let’s get deeper into some challenges encountered when scaling the delivery of our browser – the principal vessel carrying features to our users.
As a hands-on DevOps manager at Brave, I lead a global team of highly motivated engineers who not only keep things running smoothly and build new infrastructure, but also take pride in automation of build and release processes.
When I first joined Brave in 2018, our browser was still based on Muon (our Electron fork) but switched to Chromium shortly after. The first aim was to stop doing manual and fragile builds (done on developer’s machines). This also proved to be a bottleneck, especially when trying to deliver on-demand updates on 3 channels: Release, Beta and Development (we had no Nightly channel at the time).
In a span of a few months we managed to code platform-specific pipelines in our continuous integration system and move away from the ad-hoc freestyle execution. This allowed us to automate most of the manual steps and increase the predictability of outcomes. One of the big challenges was getting code signing right, and renewing certificates while keeping them safe. We also moved away from Azure to Amazon Web Services for Windows builds and from MacStadium to our own Apple machines for macOS builds.
Meanwhile I came out with a proof of concept for a pull request builder that would produce and store desktop browser artifacts at every code change. No tests were initially being executed and the whole workflow took hours but once the foundation for automatic builds was laid, it was time to take care of continuous integration and allow developers to get early feedback across multiple platforms.
Around March 2019, we added a Nightly channel which would get built multiple times during the day but be pushed to our users 5-7 times a week. This put more focus on getting faster builds and many things were tried. From various types of caching (Git, ccache, sccache, RAM disks) to running tests in parallel, reusing workspaces, reducing logging verbosity and more, we tried quite a few. For making builds cheaper we benchmarked various cloud instance types and tried builds in Docker. Reproducible builds and cross-compiling were also of interest (for example building Windows binaries on Linux).
Our browser’s Android versions moved to use the same code as our desktop platforms in March 2020, thus targeting 8 different architectures (all built in parallel). Please note that Android was based on Chromium even before the desktop versions (but in a separate source control repository).
iOS followed suit with automation, being currently built as a framework that uses the core browser libraries. It might get fully built on top of Chromium at a future point in time.
December 2020 brought another challenge which was about adding ARM64 builds for the new Apple M1 CPUs. This required building and creating universal binaries, doubling the amount of macOS builds to be done.
The big promise of reducing build times came when researching Google’s Goma and various remote execution backends for it. The Google Cloud RBE was in alpha and had a waitlist when other out of the box solutions didn’t seem to exist. Commercial variants appeared after but weren’t necessarily tailored to building something as complex as a browser (especially if not using Bazel as a build system).
We got approached by EngFlow and started working together on a trial for speeding up our Android and Linux builds. That was easily achieved and compilation time was reduced by 8 times compared to developer machines or cloud VMs. This allowed us to finally have a pooled build cluster and a cache to share across developers, wherever they are in the world, even on a modest laptop or commodity hardware (which saw the biggest improvement). We’re currently integrating this for macOS, and Windows is next. For more details please see EngFlow’s case study.
The need for individual developers to have recent and expensive hardware to do builds is going away (though there’s fallback to local compilation if that proves to be faster or if the network is slow). It’s also good to have an elastic and scalable system in the cloud versus individual physical nodes that are not shared for everyone to benefit.
We’re currently doing hundreds of builds per week and around 10 of them go to the public. Release (stable) channel builds go out on demand after the QA team finishes its testing. They happen quite often after security fixes are available or new features have to be shipped. In 2020 we had over 50 releases, though we had weeks with none or some with 3 of them. Every 3 weeks, we promote what’s on Nightly to Development and Beta channels and from there to the Release channel.
A totally different challenge was scaling how we serve browser updates to our users. When I joined we had an out of date Omaha server deployment from a now defunct company named Crystalnix. Features were out of date, deployments were manual and we only had one environment – production, which needed restarting once in a while, causing outages. After some cleanup, test fixing, and Docker automation, it was ready to have new features added.
We got in touch with Omaha Consulting and started a fruitful collaboration, initially working on security patches, reliability, and later progress regarding the topic of delta (differential updates). This needed both client and server-side work as we are using forks of Google’s Omaha client for Windows and the Sparkle client for macOS. We can now say that this has reduced our AWS bill while our users get updated faster and with less data transfer (some delta updates are even 100x smaller than the full installers).
Our user base has grown by more than 10x these past years, recently passing 32 million monthly active users, and if you add the increasing number of platforms, channels, and architectures to the mix, then our build and release capacity had to scale 100x. Now yours truly and the team (Harry, Linh and Wojciech) are working on the 1000x (join us at brave.com/careers).