Privacy-Preserving Product Analytics (P3A)
At Brave, we want our browser not only to provide the best protection against the surveillance economy, but to be the very best way to experience the web. We rely extensively on community feedback to make sure that the product provides the most vital features and is as reliable as it can possibly be. Sometimes, however, this simply is not enough to make sure we are providing the best experience to as many users as possible. Many people simply don’t have time to provide feedback, and there are many questions left unanswered. Do people make it through onboarding, or do we need to make it shorter? Are people using Brave Rewards? Are people using sync and if so, on how many devices? How many people still need to download important browser updates?
In an ordinary software company, these questions would be answered by using one of dozens of third-party analytics services. But the way such services operate would mean that Brave users could be individually identified and tracked, by a third party, and in some cases that behavior would be aggregated with other tracked behavior from the ad/tracker ecosystem for the benefit of the third party alone. None of this would be remotely acceptable to Brave given our commitment to user privacy.
We believe that completely private product analytics are the most effective way for us to make Brave the best it can be — by providing us with insights into how the various features of the product are actually being used, so we can shape the product to better match the needs of our users. As always, our code is open source and available for third-party audits and verification.
Privacy is our first value. We really, genuinely don’t want to know anything about you individually, or to know anything that could be used to track you. That means that we need to approach product analytics very differently from most other companies. We’ve built a completely private system which we’re calling Privacy-Preserving Product Analytics, or P3A for short. This project goes well beyond industry norms and GDPR requirements when it comes to privacy preservation. Here are the mechanics:
- P3A doesn’t collect any personal information. Nothing that could identify you, and nothing sensitive like your browser history, search queries, etc.
- Every so often, in the background, the browser sends reports containing simple, non-identifying information on product feature usage. These are essentially automatically-delivered answers to specific questions defined by Brave.
- All the “questions” we ask of the browser (the measurements collected) will be posted publicly in human-readable form. You can find the current list here.
- You can turn P3A off at any time in the “Privacy and Security” section of the browser preferences.
- All the P3A code will be open source (as is all our code except anti-fraud server-side code) — you can always check that your browser is only sharing the specific things we promise.
How P3A Works
Our work on P3A is split into two initial phases. In the first phase, we will use a simple protocol that sends a single answer to a single question, one at a time. In the second phase, we follow up with a more complex protocol that incorporates technologies such as oblivious shuffling and secure enclaves to support more complex questions while retaining the strict privacy goals of phase 1. Our objective is to keep it impossible for us to associate any particular data with any particular user, no matter how much analysis we perform on the data collected.
The first phase of P3A will collect “answers” to a set of 18 specific multiple-choice questions. These answers provide straightforward usage metrics, such as how many tabs people have open, or what fraction of people have turned on Brave Rewards. For example:
Question: Number of open tabs
Some (randomized) time after you open up your browser during the week, the browser counts the number of open tabs, and picks the corresponding answer from the list. This multiple-choice answer style is the first privacy safeguard. None of the questions have exact, detailed answers: only a small number of predetermined options are enabled. This helps ensure that no device ever has a unique or distinctive answer to any question. Roughly once an hour, the encoder prepares to send out that one answer, which looks a little like “Question: 7, Answer: 3”. The exact time is obscured somewhat by adding a random delay of 0-5 minutes. This is combined with information about the version of Brave it comes from, which looks like this:
- Distribution channel (nightly/dev/beta/release)
- Week the browser was installed (only sent within 90 days of installation)
- Country (removed for countries with fewer than 6000 installs per week)
- Referral code which indicates (broadly) what category of link brought you to the Brave website when you downloaded Brave. This is only sent within 90 days of installation, and only for referrers which we’re sure are big enough not to have a privacy impact. You can find a detailed description of our referral codes here.
This combined information — the answer and the version information — is finally sent to Brave’s content delivery network (CDN), operated by Fastly. When an answer reaches the edge of the Fastly CDN, it’s stripped of the IP address and precise timing information.
This system is designed so that we, Brave, are unable to associate any particular response with any other, so we do not have sufficient information to link together any particular user’s “answers”. Instead, each response is an independent data point.
Phase One provides a substantial baseline for protecting privacy while getting useful insight into how people use Brave. It’s a straightforward design in which each answer is a standalone data point. We can count how many people completed onboarding, and we can count how many people imported their bookmarks, but we don’t know how many of the people who completed onboarding also imported bookmarks. Phase Two is all about letting us answer these sorts of combined questions while ensuring that we still don’t have the ability to identify any particular user.
We’ve developed a protocol based on the PROCHLO design. This approach involves sending combined answers to an Intel SGX secure enclave. Using a secure enclave means that even we don’t have the ability to see what those raw answers are. The software running on that enclave combines individual answers into batches, and filters out any batches which are too small and therefore have the potential to be distinctive or unique. Because of the limitations of secure enclaves, this batching is based on oblivious shuffling — a cryptographic technique which ensures that other software running on the same machine (i.e., us) can’t work out which inputs correspond to which outputs.
This work is well underway and we expect to have it ready for previews within the next few months. It’s a more complex approach than Phase One, though, and we want to make sure we’ve got things right. When we get closer to release, we’ll have a deep dive into the technical details, and of course all the source code will be available on GitHub.
These dates are tentative and subject to change in the development process:
- August 2019 – Phase 1 of the P3A implementation is released into nightly channel on the desktop browser. As always, we are eager to receive user feedback from our early adopters as we refine and improve P3A.
- September 2019 – Phase 1 of P3A enters beta. Phase 2 is merged down to nightly.
- October 2019 – Phase 1 of P3A enters general release. Phase 2 enters beta.
- October – November 2019 – Phase 2 of P3A enters general release.
- After November 2019 – As part of continual product improvement, we will gradually collect new measurements. All such measurements will be continually available for review here. Github issues for these measurements will be given the label “feature/new_metric” and will be available for public comment.
Log level records are automatically deleted from our servers within 7 days. Note that these log-level records will not contain IP addresses or exact timing information. Our completely anonymous summaries of the data are intended to be kept indefinitely.
Most of the software you use includes some sort of product analytics, or usage data collection, as does every major browser. And for good reason — knowing which features are resonating and which need work is an important part of making software that’s a pleasure to use. We’ve been cautious about building analytics because we knew we had to get it exactly right. Some other browsers collect thousands of measurements along with a substantial amount of information about what you’ve searched for and which sites you visited. None of the commercial analytics products we’ve seen come anywhere close to our privacy standards. Building this ourselves took a lot longer than using an existing system, but we think that’s time well spent. We hope you agree.
Continue reading for news on ad blocking, features, performance, privacy and Basic Attention Token related announcements.