Updates from Brave Research

Dr. Ben Livshits, Chief Scientist

Brave Research is a highly dynamic team of researchers and developers whose goal is to push the envelope when it comes to some of the more adventurous aspects and needs of the Brave browser and the underlying ecosystem. Most people at Brave Research hold a PhD in computer science, although pretty much everyone on the team is very practical and is involved in writing code on a daily basis. The way we think about the mission of Brave Research is two-fold: we are the “special forces” that kick into action whenever there are problems that are both relatively unexplored and also of great value to Brave. Our second mission is to create research output in the form of papers, which we aim to publish at top-tier, highly competitive conference venues. We also closely engage with academia and have PhD-level interns and university professors who work closely with Brave researchers, with a goal of advancing the state-of-the-art, especially when it comes to privacy, security, machine learning on the edge, as well as decentralization. 

Our work covers a broad range of topics, which include machine learning, data privacy and security, performance, and cryptography, to name only a few. A lot of our work tends to be pragmatic and even opportunistic in nature, driven by real-life problems that the browser throws at us. The flow of hard problems is virtually guaranteed given the complexity of the code base as well as the number of previously-unsolved problems that Braves tries to address. 

In this blog post, we cover some of the recent work that has come out in the form of both blog posts and conference publications. One of the goals is to showcase the diversity of topics that we work on, as well as to connect the work we do to some of the features that are appearing in the browser. More details on the research publications can be found here: /research/

Product-driven Innovation

We start with two examples of projects that have been driven by clear product needs, one in the case of the BAT ad ecosystems and the other for the browser itself.  

Themis — Towards Decentralizing the Brave Advertising Ecosystem

We recently shared a series of blog posts that focus on decentralizing the Brave ads ecosystem, with a creative use of cryptography and blockchain tech to provide integrity of ad-related accounting so that advertisers and browser users can check them. This work uses a combination of interesting cryptographic primitives that include partially-homomorphic encryption and zero-knowledge proofs, together with more traditional techniques such as distributed key generation to design and implement a protocol we call Themis. The complete version can be found in our pre-print on Arxiv.

SpeedReader 

The SpeedReader paper was published in WWW’19. The goal is to provide an improved  reader mode for the browser, which results in faster rendering and a lighter network footprint. After a lot of work by researchers, developers, and designers, SpeedReader is now shipping in the Beta version of Brave. You can find more information about SpeedReader from another one of our blog posts and try it for yourself in the browser by clicking the icon next to the URL bar. Although it has taken us a while to move SpeedReader from research to product, this is one of the most obvious recent examples of such work, along with some of the innovation around the machine learning models behind Brave Ads.

Academic Publications

Below we cover some of the most prominent recent academic publications, which have or will appear in top-conference academic venues. 

IEEE S&P 2021: Detecting Filter List Evasion With Event-Loop-Turn Granularity JavaScript Signatures (paper)

One of the main ways Brave improves Web privacy is by blocking trackers.  One way Brave discovers ads and trackers is by using crowdsourced and expert curated filter lists, such as EasyList, EasyPrivacy and the uBlock Origin’s lists. While useful, these lists are easily circumventable, through trivial countermeasures: moving resources to new servers, changing the names of files or URL path attributes, inlining code into pages, or combining tracking code with bening JavaScript. While these countermeasures are well known, the privacy community has lacked a useful, web scale defense.

This work addresses these problems in content blocking through the following contributions: First, we implement a novel system to build per-event-loop-turn signatures of JavaScript code by instrumenting the Blink and V8 runtimes. Second, we apply these signatures to measure filter list evasion, by using EasyList and EasyPrivacy as ground truth and finding other code that behaves identically. We build ~2m signatures of privacy-and-security behaviors from 11,212 unique scripts blocked by filter lists, and find 3,589 more unique scripts including the same harmful code, affecting 12.48% of websites measured. Third, we taxonomize common filter list evasion techniques. Finally, we present defenses; filter list additions where possible, and a proposed, signature based system in other cases.

As part of this project, we also shared the implementation of our signature-generation system, the dataset from applying our system to the Alexa 100K, and 586 AdBlock Plus compatible filter list rules to block instances of currently blocked code being moved to new URLs.

SIGMETRICS 2020, Who Filters the Filters: Understanding the Growth, Usefulness and Efficiency of Crowdsourced Ad Blocking (paper)

Ad and tracking blocking extensions are popular tools for improving web performance, privacy and aesthetics. Content blocking extensions generally rely on filter lists to decide whether a web request is associated with tracking or advertising, and so should be blocked. Millions of web users rely on filter lists to protect their privacy and improve their browsing experience. Despite their importance, the growth and health of filter lists are poorly understood. Filter lists are maintained by a small number of contributors who use undocumented heuristics and intuitions to determine what rules should be included. Lists quickly accumulate rules, and rules are rarely removed. As a result, users’ browsing experiences are degraded as the number of stale, dead or otherwise not useful rules increasingly dwarf the number of useful rules, with no attenuating benefit. An accumulation of “dead weight” rules also makes it difficult to apply filter lists on resource-limited mobile devices. 

This paper improves the understanding of crowdsourced filter lists by studying EasyList, the most popular filter list. We measure how EasyList affects web browsing by applying EasyList to a sample of 10,000 websites. We find that 90.16% of the resource blocking rules in EasyList provide no benefit to users in common browsing scenarios.  Finally, we propose optimizations for popular ad-blocking tools that (i) allow EasyList to be applied on performance constrained mobile devices and (ii) improve desktop performance by 62.5%, while preserving over 99% of blocking coverage. We expect these optimizations to be most useful for users in non-English locales, who rely on supplemental filter lists for effective blocking and protections.

MLSys 2020: Privacy-Preserving Bandits (paper)

Some of the reasons we explore privacy-focused machine learning techniques revolve around improving the quality of client-side ad matching. Contextual bandit algorithms (CBAs) often rely on personal data to provide recommendations. Centralized CBA agents utilize potentially sensitive data from recent interactions to provide personalization to end-users. Keeping the sensitive data locally, by running a local agent on the user’s device, protects the user’s privacy; however, the agent requires longer to produce useful recommendations, as it does not leverage feedback from other users.

This paper proposes a technique we call Privacy-Preserving Bandits (P2B); a system that updates local agents by collecting feedback from other local agents in a differentially-private manner. Comparisons of our proposed approach with a non-private, as well as a fully-private (local) system, show competitive performance on both synthetic benchmarks and real-world data. Specifically, we observed only a decrease of 2.6% and 3.6% in multi-label classification accuracy, and a CTR increase of 0.0025 in online advertising for a privacy budget ε ≈ 0.693. These results suggest P2B is an effective approach to challenges arising in on-device privacy-preserving personalization.

CHI 2020, Evaluating the End-User Experience of Private Browsing Mode (paper)

We started this work in order to better understand how browser users interpret private mode guarantees and what that means for Brave, in terms of better communicating what these modes do and do not do. In this paper, we investigate why users of private browsing mode misunderstand the benefits and limitations of private browsing. 

We design and conduct a three-part study: (1) an analytic evaluation of the user interface of private mode in different browsers; (2) a qualitative user study to explore user mental models of private browsing; (3) a participatory design study to investigate why existing browser disclosures, the inbrowser explanations of private mode, do not communicate the actual protection of private mode. We find the user interface of private mode in different browsers violated well-established design guidelines and heuristics. 

Further, most participants had incorrect mental models of private browsing, influencing their understanding and usage of private mode. We also find existing browser disclosures did not explain the primary security goal of private mode. Drawing from the results of our study, we extract a set of recommendations to improve the design of disclosures.

IEEE S&P 2020, AdGraph: A Graph-Based Approach to Ad and Tracker Blocking (paper)  

User demand for blocking advertising and tracking online is large and growing. Existing tools, both deployed and described in research, have proven useful, but lack either the completeness or robustness needed for a general solution. Existing detection approaches generally focus on only one aspect of advertising or tracking (e.g. URL patterns, code structure), making existing approaches susceptible to evasion.

In this work we present AdGraph, a novel graph-based machine learning approach for detecting advertising and tracking resources on the web. AdGraph differs from existing approaches by building a graph representation of the HTML structure, network requests, and JavaScript behavior of a webpage, and using this unique representation to train a classifier for identifying advertising and tracking resources. Because AdGraph considers many aspects of the context a network request takes place in, it is less susceptible to the single-factor evasion techniques that flummox existing approaches.

We evaluate AdGraph on the Alexa top-10K websites, and find that it is highly accurate, able to replicate the labels of human-generated filter lists with 95.33% accuracy, and can even identify many mistakes in filter lists. We implement AdGraph as a modification to Chromium. AdGraph adds only minor overhead to page loading and execution, and is actually faster than stock Chromium on 42% of websites and AdBlock Plus on 78% of websites. 

Overall, we conclude that AdGraph is both accurate enough and performant enough for online use, breaking comparable or fewer websites than popular filter list based approaches. AdGraph has by now become a foundation for some of the other projects that focus on better ad blocking and on other projects than focus on crawling the web and better understanding the results of the crawl. 

Usenix ATC 2020: Percival: Making In-Browser Perceptual Ad Blocking Practical With Deep Learning (paper)

Several techniques have been proposed to block ads, mostly based on filter-lists and manually-written rules. While a typical ad blocker relies on manually-curated block lists, these inevitably get out-of-date, thus compromising the ultimate utility of this ad blocking approach. In this paper we present PERCIVAL, a browser-embedded, lightweight, deep learning-powered ad blocker. 

PERCIVAL embeds itself within the browser’s image rendering pipeline, which makes it possible to intercept every image obtained during page execution and to perform blocking based on applying machine learning for image classification to flag potential ads. Our implementation inside both Chromium and Brave browsers shows a relatively minor rendering performance overhead of 4.55%, demonstrating the feasibility of deploying traditionally heavy models (i.e. deep neural networks) inside the critical path of the rendering engine of a browser. 

We show that our image-based ad blocker can replicate EasyList rules with an accuracy of 96.76%. To show the versatility of PERCIVAL’s approach we present case studies that demonstrate that PERCIVAL 1) does surprisingly well on ads in languages other than English; 2) PERCIVAL also performs well on blocking first-party Facebook ads, which have presented issues for other ad blockers. PERCIVAL proves that image-based perceptual ad blocking is an attractive complement to today’s dominant approach of block lists.

WWW 2020: Keeping Out the Masses: Understanding the Popularity and Implications of Internet Paywalls (paper)

Because of our focus on compensating web publishers, we tried to understand how paywalls work online. The most common content funding method, online advertising, is rife with well-known performance and privacy harms, and an intractable subject-agent conflict: many users do not want to see advertisements, depriving web sites of needed funding. 

Because of these negative aspects of advertisement-based funding, paywalls are an increasingly popular alternative for websites. This shift to a “pay-for-access” web is one that has potentially huge implications for the web and society. Instead of a system where information (nominally) flows freely, paywalls create a web where high quality information is available to fewer and fewer people, leaving the rest of the web users with less information, that might be also less accurate and of lower quality. Despite the potential significance of a move from an “advertising-but-open” web to a “paywalled” web, we find this issue understudied. This work addresses this gap in our understanding by measuring how widely paywalls have been adopted, what kinds of sites use paywalls, and the distribution of policies enforced by paywalls. 

A partial list of our findings include that (i) paywall use has increased, and at an increasing rate (2× more paywalls every 6 months), (ii) paywall adoption differs by country (e.g., 18.75% in US, 12.69% in Australia), (iii) paywall deployment significantly changes how users interact with the site (e.g., higher bounce rates, less incoming links), (iv) the median cost of an annual paywall access is 108 USD per site, and (v) paywalls are in general trivial to circumvent. Finally, we present the design of a novel, automated system for detecting whether a site uses a paywall, through the combination of runtime browser instrumentation and repeated programmatic interactions with the site. We intend this classifier to augment future, longitudinal measurements of paywall use and behavior.

WWW 2020: Filter List Generation for Underserved Regions (paper)

Filter lists play a large and growing role in protecting and assisting web users. The vast majority of popular filter lists are crowd-sourced, where a large number of people manually label resources related to undesirable web resources (e.g. ads, trackers, paywall libraries), so that they can be blocked by browsers and extensions. Because only a small percentage of web users participate in the generation of filter lists, a crowd-sourcing strategy works well for blocking either uncommon resources that appear on “popular” websites, or resources that appear on a large number of “unpopular” websites. 

A crowd-sourcing strategy will perform poorly for parts of the web with small “crowds”, such as regions of the web serving languages with (relatively) few speakers. This work addresses this problem through the combination of two novel techniques: (i) deep browser instrumentation that allows for the accurate generation of request chains, in a way that is robust in situations that confuse existing measurement techniques, and (ii) an ad classifier that uniquely combines perceptual and page-context features to remain accurate across multiple languages. We apply our unique two-step filter list generation pipeline to three regions of the web that currently have poorly maintained filter lists: Sri Lanka, Hungary, and Albania. 

We generate new filter lists that complement existing filter lists. Our complementary lists block an additional 3,349 of ad and ad-related resources (1,771 unique) when applied to 6,475 pages targeting these three regions. We hope that this work can be part of an increased effort at ensuring that the security, privacy, and performance benefits of web resource blocking can be shared with all users, and not only those in dominant linguistic or economic regions.

Conclusions

This blog post is a brief summary of some of the work that has come out of Brave Research in the last several months. The focus is deliberately broad, ranging from cryptography, to machine learning, to privacy, to techniques that improve the quality of browser-based tracker and ad blocking, to improving web standards. We wanted to highlight some projects that are now shipping in the product (such as SpeedReader) and benefiting millions of users, as well as those that will take a while longer to find their natural home within the Brave ecosystem, while creating resonance and raising Brave’s prestige in the academic circles.

Related articles

Why Brave Disables FLoC

Brave opposes FLoC, a recent Google proposal that would have your browser share your browsing behavior and interests by default with every site and advertiser with which you interact.

Read this article →

Ready for a better Internet?

Brave’s easy-to-use browser blocks ads by default, making the Web faster, safer, and less cluttered for people all over the world.