Privacy updates

Cookiecrumbler: enhancing online privacy by automating cookie notice detection

By the Brave Privacy Team

This is the 33rd post in an ongoing series describing new privacy features in Brave. This post describes work done by Anton Lazarev (Sr. Research Engineer) and Moritz Schafhuber (Staff DevOps Engineer), and was written by Shivan Kaul Sahib (VP, Privacy and Security).

We’re excited to announce two major updates in Brave’s ongoing work to improve cookie consent notice blocking. First, we’ve open sourced Cookiecrumbler, a tool that automatically detects cookie consent notices on websites by using open-source LLMs and even suggests fixes for them. Second, we’re now publishing GitHub issues with the results of our Cookiecrumbler crawls across popular websites, inviting the broader adblocking community to help refine and triage these findings.

Brave blocks cookie consent notices by default. These banners are both annoying and harmful to user privacy (given how websites often implement them). Worse, as we mentioned in our blog post announcing cookie consent notice blocking, researchers have found that many consent systems still track people, even when users reject all cookies. These notices are especially useless in Brave, since we already block third-party tracking scripts and pixels (which protects you, for instance, when your health plan provider accidentally shares your data with Google Ads).

Blocking cookie consent notices offers clear benefits on privacy and reducing annoyances, but it also carries risks. Overly broad or incorrect blocking can break essential website functionality, from checkout flows to layout problems. We’ve encountered many issues (broken scrolling, blank pages) when a cookie consent notice block is applied indiscriminately. To reduce their maintenance burden, adblock list maintainers tend to rely on as few rules as possible for the widest coverage. However, cookie consent notice implementations can vary significantly from one site to another, making Web-wide generic rules a frequent source of breakage. In response, we’ve spent the past few months cleaning up community-maintained cookie consent notice blocking lists, removing generic rules that repeatedly cause problems. 

Enter Cookiecrumbler

Cookiecrumbler lets us scale site-specific cookie consent notice blocking across the Web. It uses open-source LLMs to automate detection, which lets us identify cookie consent notices across all sorts of site-specific variations, including non-English notices. We chose an LLM-based approach to automate the task of identifying notices for a few reasons, among them:

  1. It’s easier to perform (as cookie consent notices tend to use similar text)
  2. It’s repetitive (which helps with the sheer scale of the Web)
  3. It’s low-risk (allowing human reviewers to remove false positives upon visual inspection)
  4. It’s cheap, as every crawl costs on the order of cents (while we experimented with bigger and more expensive models, smaller models also offered great results after some tweaking)

This kind of LLM-based automation frees up human reviewers to do the more nuanced work of actually blocking the notice. As mentioned before, poorly applied blocking rules can break core site elements. That’s why we retained human review as part of the process: filter list maintainers and community members help confirm and tweak Cookiecrumbler suggestions before they go live. We’re also working on improving the quality of blocking rule suggestions offered by Cookiecrumbler.

Cookiecrumbler also supports different locations and languages. Cookie consent notices can look very different depending on a visitor’s geographic region or language settings. Cookiecrumbler has support for various geographical vantage points, so that it can see the website as a user in a particular region would see it.

Cookie notice in Thai

Cookie notice in Thai

Cookiecrumbler runs completely on Brave’s backend, and uses publicly available lists of top websites for detection purposes. We’re exploring how we can build Cookiecrumbler into the browser to bring smart cookie notice detection closer to the user. But, like everything we do at Brave, this will only happen after a full privacy review so we ensure Cookiecrumbler works with user privacy and choice in mind.

How Cookiecrumbler works

Website list creation

First, Brave creates a custom version of the Tranco list that’s tailored to different regions (since cookie consent notices are often region-specific). A prioritized list of websites per region helps us focus on the most popular websites for users in multiple countries.

Automated crawling

Next, a crawling script running on our CI servers takes these lists of websites and calls the Cookiecrumbler API for each website and region.

Cookiecrumbler

Once the API call is made, the Cookiecrumbler process starts:

  1. The Cookiecrumbler API receives the request from the crawling script running in CI, and launches Puppeteer (a headless browser) to load the website. We use proxies to load the website from the desired region.
  2. We identify and gather candidate HTML elements from the page that might be cookie consent notices.
  3. These elements get passed to the LLM, which identifies whether they’re cookie consent notices. It also suggests a fix, if applicable.
  4. Cookiecrumbler sends this response back to the crawling script.

Publishing crawl results

After a crawl is done, we take the list of websites for which Cookiecrumbler identified a cookie notice (along with the region for that cookie notice), and publish this list to a new GitHub repository.

Cookie notice GitHub Issues

Privacy without site breakage

Brave already offers robust and privacy-first cookie consent notice blocking by default. Cookiecrumbler extends that with an automated, large-scale process to detect new and changing banners. In combination with our work on removing brittle generic rules, we’re able to use Cookiecrumbler to ship a cookie consent notice blocking experience that doesn’t cause frequent website breakage.

We first unveiled Cookiecrumbler (under a different name) at the Ad Filtering Dev Summit last year. Since then, we’ve made significant progress in lowering false positives, adding support for multiple languages, and ensuring coverage across multiple geographical vantage points. We’ve already seen the benefits of Cookiecrumbler, with fewer reports of notice blocking-related breakage and higher user retention and growth overall. By detecting and addressing potential breakage points proactively, we’re delivering on Brave’s commitment to a privacy-first and user-first Web.

Related articles

Ready for a better Internet?

Brave’s easy-to-use browser blocks ads by default, making the Web faster, safer, and less cluttered for people all over the world.