Towards Realistic and Reproducible Web Crawl Measurements

Jordan Jueckstock (North Carolina State University), Shaown Sarker (North Carolina State University), Peter Snyder (Brave Software), Aidan Beggs (North Carolina State University), Panagiotis Papadopoulos (Telefonica Research), Matteo Varvello (Nokia Bell Labs), Ben Livshits (Brave Software, Imperial College London), Alexandros Kapravelos (North Carolina State University) | Measurements

Accurate web measurement is critical for understanding and improving security and privacy online. Implicit in these measurements is the assumption that automated crawls generalize to the experiences of typical web users, despite significant anecdotal evidence to the contrary. Anecdotal evidence suggests that the web behaves differently when approached from well-known measurement endpoints, or with well-known measurement and automation frameworks, for reasons ranging from DDOS detection, hiding malicious behavior, or bot detection.

This work improves the state of web privacy and security by investigating how, and in what ways, privacy and security measurements change when using typical web measurement tools, compared to measurement configurations intentionally designed to match “real” web users. We build a web measurement framework encompassing network endpoints and browser configurations ranging from off-the-shelf defaults commonly used in research studies to configurations more representative of typical web users, and we note the effect of realism factors on security and privacy relevant measurements when applied to the Tranco top 25k web domains.

We find that web privacy and security measurements are significantly affected by measurement vantage point and browser configuration, and conclude that unless researchers carefully consider if and how their web measurement tools match real world users, the research community is likely systematically missing important signals. For example, we find that browser configuration alone can cause shifts in 19% of known ad and tracking domains encountered, and similarly affects the loading frequency of up to 10% of distinct families of JavaScript code units executed. We also find that choice of measurement network points have similar, though less dramatic, effects on privacy and security measurements. To aid the measurement replicability, and to aid future web research, we share our dataset and precise measurement configurations.

View paper

Ready for a better Internet?

Brave’s easy-to-use browser blocks ads by default, making the Web cleaner, faster, and safer for people all over the world.

close

Almost there…

You’re just 60 seconds away from the best privacy online

If your download didn’t start automatically, .

  1. Download Brave

    Click “Save” in the window that pops up, and wait for the download to complete.

    Wait for the download to complete (you may need to click “Save” in a window that pops up).

  2. Run the installer

    Click the downloaded file at the top right of your screen, and follow the instructions to install Brave.

    Click the downloaded file, and follow the instructions to install Brave.

  3. Import settings

    During setup, import bookmarks, extensions, & passwords from your old browser.

Need help?

Get better privacy. Everywhere!

Download Brave mobile for privacy on the go.

Download QR code
Click this file to install Brave Brave logo
Click this file to install Brave Brave logo
Click this file to install Brave Brave logo