Brave Private Content Delivery Network

by Dec 10, 2020Community, New Features, Security & Privacy

This post describes work done by Senior Product Security Engineer François Marier and Senior Devops Engineer Ben Kero. Many thanks to Senior Privacy Researcher Pete Snyder (@pes10k) and Tom Lowenthal for their help in the design of this system, and to Matteo Varvello for the thorough performance testing.

Brave is a company where privacy isn’t just a feature; it’s a requirement. This is perhaps most obvious in the Brave Browser, where we block trackers, prevent fingerprinting, and include a privacy-preserving, opt-in and user-first ad-system, but Brave’s focus on privacy goes far beyond the browser.

For example, though Brave aggressively limits how frequently the browser connects back to Brave servers, there are nevertheless cases where the browser needs to fetch updates or assets from our servers. While we follow data minimization best practices, we aspire to go further. Promising to do the right thing is necessary but not sufficient: our goal is to make it impossible for Brave to harm user privacy.

For instance, when it comes to our users’ IP addresses, we currently filter them out of requests at the CDN level for most of our services so that we can’t accidentally include them in logs. For a new type of service however, we wanted to take the next step: making sure that the content delivery network we use couldn’t be reconfigured to log IP addresses even if we wanted to. 

Here is a description of the first phase of our work in this direction: a privacy-preserving content delivery network for Brave services.

Motivation

New Brave features prompted the need for stronger protection of client IP addresses. The most recent one was a new service called Brave Today which provides a personalized news feed on the new tab page.

Serving the news feed itself can be done using a traditional content delivery network because we send the same feed to all users and decide locally in the browser which items to show. The part that required a private CDN was the use of the images contained in the feeds. If these images were downloaded on-demand as a user scrolled through the feed, anybody looking at the user’s image requests would be able to determine which news articles were displayed in the feed. This in turn would leak information about the local machine learning model used to personalize the feed, and indirectly about the browsing history that was used to train that model.

Design

The traditional approach for handling latency-sensitive services such as Brave Today is to serve and cache content on a content delivery network. Under such a design however, the organization running the CDN (and terminating TLS encryption) would see both a user’s requests and IP address. Our use case requires that we keep these two elements separate. Therefore we decided to improve on the traditional CDN approach by adding a TCP load balancer in front of the CDN.

This is what the complete solution looks like:

Under this model, the load balancer vendor doesn’t see the contents of the requests or the responses because all they can see is encrypted TCP traffic on port 443 which can be decrypted only by the CDN. In addition, while the CDN vendor can see the contents of the encrypted HTTP traffic, they can’t see the user’s true IP address and instead get one of the load balancer’s IP addresses.

To ensure that all requests go through both the load balancer and the CDN, the CDN accepts connections only from IP addresses that belong to the load balancer. Likewise, the S3 bucket is private and requires an access key held by the CDN.

A crucial part of this design is the use of two different vendors/competitors in order to reduce the risk of these vendors colluding to deanonymize our users.

Additional privacy protections

While the TCP load balancer doesn’t have the ability to decrypt the traffic that transits through it on its way to the CDN, it can observe the size of the requests and responses, which could hint at what is being requested by a particular user. We have no reason to believe that our vendors would try and infer our users’ browsing history in this way, but we decided to implement an additional protection layer.

Padding

Requests to this service need to ideally look identical to one another so that the size of the request cannot be used to guess the file being requested.

For example, a service which requests images like these:

  • /article/12/image/3927.png
  • /article/8/image/148.jpg

could be modified to pad the request IDs like this:

  • /article/0012/image/03927.png
  • /article/0008/image/00148.jpg

On the response side, it may be too wasteful to pad all files to the same size, but we can at least aim to limit the number of different possible response sizes. This means that each application using this private CDN needs to figure out what the average response looks like and then pick a small number of standard sizes so that the responses will be uniformly distributed across all of these.

In order to make it easy for our applications to do response padding correctly, we created a very simple padding scheme along with a reference library and a common implementation in the browser.

Of course, no matter how accurate our size-padding is, it would all be in vain if we didn’t disable any kind of response compression (typically gzip or deflate) on the CDN side.

Omitting request headers

While the CDN never sees the user’s IP address, there are other HTTP headers that could be used to fingerprint the user to some degree depending on the uniqueness of the values in these headers.

This is why every application using our private CDN is asked to remove the following headers from their requests:

  • Accept-Language
  • Cookie
  • DNT
  • Referer
  • User-Agent

What about Brave?

So far we’ve covered how we can prevent our infrastructure vendors from knowing both the contents of a user’s request and the user’s IP address. However, one could ask: what about Brave, which has access to both vendor dashboards?

That’s a good question. In order to be able to correlate requests based on time, we would need to either:

  • have access to logs on both of these systems, or
  • attach additional information to the requests as they exit the TCP load balancer.

The first approach is easy to neutralize because the load balancer vendor configured our account to disable access to the logging facilities they offer.

In terms of adding additional information to the requests, we aren’t able to add any HTTP headers because the load balancer doesn’t terminate TLS, but we could configure it to enable the proxy protocol in order to inject the original client IP address in all outgoing requests. 

Fortunately, our current CDN provider doesn’t actually offer the ability to parse and use such incoming proxy information, but because this technical limitation could disappear in the future, we decided to add an additional contractual protection. As part of our enterprise agreement with the load balancer vendor, we requested that our service be subject to the following additional terms:

The Service will include [TCP load balancer], and Customer agrees to use [TCP load balancer] only with proxy protocol disabled. Customer understands that they are prohibited from accessing Client IPs and in connection with the Service, [Vendor] will not provide access to [logging facilities], even at Customer’s request.

Trust, but verify

Having a good design is important, but unless users can verify our claims independently, then we are relying entirely on their trust. As much as we consider ourselves privileged to have earned our users’ trust, we aim to be as transparent as possible when it comes to privacy.

For a start, anybody can verify the claims that we make about client-side processing since the Brave browser is Open Source.

Secondly, users can verify that their browser is connecting to an IP address that belongs to the load balancer vendor by looking through the browser traffic using a local proxy such as mitmproxy, or by simply checking what IP address the pcdn.brave.com hostname resolves to.

Finally, to verify that the first vendor is forwarding requests to another CDN and that this other CDN is the one terminating TLS, you can compare the response headers you get on https://pcdn.brave.com/ to those on a site served directly by the first vendor, such as https://haveibeenpwned.com/.

If you see something, say something

The whole point of developing this new system is to enable services that enrich the experience Brave provides without compromising on the privacy properties our users expect. As we evolve these features and add new ones we want you to be confident that we are doing the right things. So, as always, should you discover that any part of our system is not operating as intended, we encourage you to reach out via our security bug bounty program.

Related Articles

Continue reading for news on ad blocking, features, performance, privacy and Basic Attention Token related announcements.

How we Choose and Rank Content in Brave Today

How we Choose and Rank Content in Brave Today

We’re using Brave’s new private CDN to fetch RSS feeds anonymously and the browser’s personalization capabilities to rank headlines with a simple algorithm that will make the experience interesting for everyone.

Ready to Brave the new internet?

Brave is built by a team of privacy focused, performance oriented pioneers of the web. Help us fix browsing together.
Download Brave