Towards Realistic and Reproducible Web Crawl Measurements
Jordan Jueckstock (North Carolina State University), Shaown Sarker (North Carolina State University), Peter Snyder (Brave Software), Aidan Beggs (North Carolina State University), Panagiotis Papadopoulos (Telefonica Research), Matteo Varvello (Nokia Bell Labs), Ben Livshits (Brave Software, Imperial College London), Alexandros Kapravelos (North Carolina State University) | Measurements
Accurate web measurement is critical for understanding and improving security and privacy online. Implicit in these measurements is the assumption that automated crawls generalize to the experiences of typical web users, despite significant anecdotal evidence to the contrary. Anecdotal evidence suggests that the web behaves differently when approached from well-known measurement endpoints, or with well-known measurement and automation frameworks, for reasons ranging from DDOS detection, hiding malicious behavior, or bot detection.
This work improves the state of web privacy and security by investigating how, and in what ways, privacy and security measurements change when using typical web measurement tools, compared to measurement configurations intentionally designed to match “real” web users. We build a web measurement framework encompassing network endpoints and browser configurations ranging from off-the-shelf defaults commonly used in research studies to configurations more representative of typical web users, and we note the effect of realism factors on security and privacy relevant measurements when applied to the Tranco top 25k web domains.
Ready for a better Internet?
Brave’s easy-to-use browser blocks ads by default, making the Web cleaner, faster, and safer for people all over the world.Download Brave