The Mounting Cost of Stale Ad Blocking Rules
This blog post describes ongoing work conducted at Brave by Antoine Vastel, Peter Snyder, and Ben Livshits. It is the second in a series of research-oriented posts that share both present investigations and future vision. We are constantly looking to improve and automate the accuracy and speed of ad blocking built into Brave, and our previous post outlined a machine learning approach to ad blocking.
Abstract: Brave’s research team studied EasyList, a popular filter list used to protect users’ privacy and security by blocking URLs associated with online advertising, tracking and malware. Brave found that a large percentage (> 90%) of EasyList appears to provide little benefit for common browsing cases, due to its large size and accumulation of stale (rarely used or even expired) rules. Brave also calculated that removing stale rules from EasyList reduced the median cost of filtering a URL by 63.4% and reduced the cost of filtering in the entire dataset by 84.4%, which represents substantial time savings.
In the future, Brave plans to apply these findings to improve the experience of its users and to reduce performance costs.
Note: This research was done with the Brave browser and the performance implications may apply only to Brave.
Brave uses a variety of techniques to protect users’ privacy and security. One such technique is using resource filter lists, or lists of rules that instruct Brave to block URLs associated with online advertising, tracking and malware. Brave draws on popular lists like EasyList and EasyPrivacy, in addition to our own curated set of rules.
Filter lists have proven useful in protecting web users, but they come with drawbacks. Most significantly, these lists are large, and growing larger. This would be great, if each rule proved helpful in blocking unwanted resources. However, applying rules isn’t free: each rule takes time and energy to apply, slowing down the browsing experience and consuming battery life. This suggests a trade-off between utility and breadth, and that we can improve the browsing experience by applying rules only where the expected benefit (number of blocked resources) is sufficient to justify the cost (time and energy spent applying each rule on the web).
We expect that filter lists pick up large numbers of rules that are immediately useful but whose usefulness decreases over time, as advertisers and trackers change URLs, evade trackers, or go out of business. Prior research has found accumulations of no-longer useful rules in other crowdsourced security rule lists, and we expect to find similar results in browser filter lists.
We further expect that the ratio of stale rules to useful rules increases over time, since it’s easier to measure the benefit of adding a new rule than to measure the cost of removing an old one.
Our post describes how we tested these exceptions by measuring EasyList (the most popular filter list), and how Brave plans on using these findings to improve the experience for our users.
What Is in EasyList?
- element rules, which instruct the client to hide page elements, and
- exception rules, which instruct the client to fetch resources matching the pattern, even if the resource would be blocked by some prior rule.
EasyList is very large, consisting of approximately 70k rules at the time of this writing. About 35k are network rules, 30k are element rules, and 5k are exception rules.
Growth of EasyList Over Time
EasyList is large, and has been growing since its start. We measured the number of rules added to, and removed from, EasyList, by the EasyList maintainers, since 2013. As the above graph shows, EasyList has year-on-year added more rules than it removed, suggesting an increasing cost of applying EasyList for filtering.
For the most part, the graph shows a constant and smooth trend upwards in both the number of new rules added and the number of rules removed. The number of new rules consistently outpaces the number of rules removed.
There are two exceptions to this trend though, one in early 2013, and another in late 2015. In early 2013, Fanboy’s list (another popular filter list) was merged with EasyList, causing a sharp change in EasyList’s makeup. The sharp change in mid-2015 appears to have been the result of a short term error or re-organization on the part of the list maintainers, since there were a large number of changes, but no significant variation in rules in the list.
Measurement Differences With Brave
The measurements in this post were all performed using Brave’s ad blocking library. Brave uses its own highly optimized, C++ based ad-blocking library, to make sure resources are filtered as quickly as possible. Brave’s filtering library supports most of the rule formats mentioned that appear in EasyList, but not all of them.
For a variety of performance reasons, Brave does not support 100% of the AdBlockPlus format (the format EasyList is written in). For this reason, the results in this post will differ from the same measurements performed with other filtering tools. We expect these differences would be small and not affect the general findings (that there are a large number of rare-to-never used rules in EasyList, with non-trivial performance implications), but mention the details here for completeness.
The most significant difference is that Brave’s adblock library does not currently support any cosmetic / element filtering, so we exclude those rules from further consideration. Brave’s adblock library also does not implement several AdBlockPlus filter options, and ignores those directives. In most cases this is not significant, as these unsupported options either appear in less than .01% of EasyList rules, such as “elemhide”, “websocket”, etc. The one exception here is the “popup” rule, which appears in 1,666 rules, and is the only option in 657 rules.
EasyList Applied at the Web
The growth of EasyList isn’t in-and-of-itself concerning, as long as newly added rules are beneficial to users. If, though, EasyList’s size reflects an accumulation of expired or rarely used rules, then there is a lot of wasted computation, and a lot of wasted time, happening on users’ machines.
To answer this question, we applied EasyList to both the Alexa 5k, a curated list of the 5,000 most popular sites on the web, and a random sampling of 5,000 sites from the Alexa 1,000,000 (ensuring no duplicate sites). Our measurement was in several steps:
- Use Selenium and the DevTools Protocol to record every URL requested when rendering and executing a website.
- Add additional automation to randomly select three distinct same-domain URLs from anchor tags on a page.
- Used the above automation to visit the homepage of each site, and a maximum of three child pages, and recorded all URLs requested for images, script files, and other web resources.
- Determine which of those URLs would be blocked by the version of EasyList fetched on that day, using Brave’s optimized ad-block implementation.
The results presented in this post describe the above steps applied to EasyList and the Alexa listings as of Saturday, July 13th, 2018. All measurements were performed through AWS Lambda. We’ve provided the code for the Lambda function on github.
Approximately 20% of the domains we requested either did not reply, or replied with error codes. We attribute this to anti-crawling techniques being applied to the well-know AWS IPs we crawled from.
As a result, we successfully crawled 8,085 domains, and 30,280 individual pages. We found that the vast majority of EasyList rules are not used when browsing popular websites; 3,268 of 39,198 (~8%) of network and exception rules were used during our crawls (these measurements exclude element rules).
We also found that the rules in EasyList were not equally useful, even when only considering the rules that were used at least once. For example, we found that only 201 rules accounted for 90% of blocking activity. In fact, 99.5% of rules were used 10 times or less on the ~30k pages we visited.
Finally, we also measured what kinds of resources are blocked by EasyList. As the above graph shows, images were most frequently blocked by EasyList, followed by script and iframe requests. This difference in blocking distribution matters, because different requests can have significant follow-on impacts to the browsing experience. Blocking an image request might save the user some network use, while blocking a stylesheet or an iframe might save the users additional sub-resource fetches.
Costs of Applying EasyList
After we found that a large percentage (> 90%) of EasyList appears to provide little benefit for common browsing cases, we measured the cost imposed by the long tail of infrequently used rules.
Browsers (and browser extensions) need to be pessimistic when applying filter rules to a URL request; clients cannot allow the request to occur and then retroactively revoke it, since the “cost” would already be borne. Browsers must “pause” every time a request is made, apply all of the network filters to the URL, and only then possibly continue with the request. As a result, even small time overheads caused by URL filtering can have a substantial cumulative effect on the browsing experience.
To measure, we compared the cost of applying all 39,198 network rules in EasyList to 200,000 of the URLs fetched during our above described crawl (4,032,693 URLs, 2,105,674 distinct, from 49,249 domains). We repeated this test 5 times, and on average it took 0.26ms per URL, and a total time of 51.2 seconds. Note that we conducted these measurements using Brave’s optimized C++ ad-block implementation, and expect that applying EasyList in other tools would take longer, though we did not test such.
We then repeated these measurements using only the subsection of EasyList that matched URLs encountered during our crawl, to approximate the cost of the unused rules in EasyList. The difference was a full order of magnitude in time savings. Removing the rarely and unused rules from EasyList reduced the median cost of filtering a URL by 63.4%, from 0.063ms to 0.023ms, and reduced the cost of filtering in the entire dataset by 84.4%, from 51.2 seconds to 8.0 seconds.
Crowdsourced URL filter lists are a useful tool for protecting the privacy and security of browser users. Over time, though, these lists can build up stale rules, rules which provide little to no benefit to users, while imposing a small performance cost. And while the per-rule cost is small, the large quantity of such rules in a list the size of EasyList can become substantial.
Our findings suggest that users would benefit from a regular pruning of such rule lists. Automated crawls, such as the one described in this post, are just one such way rule lists could be kept tidy.
Brave has several plans for these findings: first, we may remove rarely-used rules from the versions of EasyList we serve to clients. Second, we are considering several optimizations on the static blocking model, such as applying popular rules in a blocking manner, applying unpopular rules after URLs have been fetched, and moving any rules from the second set that “hit” into the former set. And third, we’re expanding our measurement strategy to include other popular filter lists, including EasyPrivacy.
Continue reading for news on ad blocking, features, performance, privacy and Basic Attention Token related announcements.
IPFS, the peer-to-peer hypermedia protocol designed to make the Web faster, safer, and more open, has been integrated into Brave, the fast, privacy-oriented browser, reinventing the Web for users, publishers and advertisers.
Over the past several months, the Brave team has been working with Protocol Labs on adding InterPlanetary File System (IPFS) support in Brave. This is the first deep integration of its kind and we’re very proud to outline how it works in this post.
Brave launched the Basic Attention Token (BAT) in May 2017 with the aim of realizing the vision executing on the mission of the Basic Attention Token white paper.