Towards Realistic and Reproducible Web Crawl Measurements
Jordan Jueckstock (North Carolina State University), Shaown Sarker (North Carolina State University), Peter Snyder (Brave Software), Aidan Beggs (North Carolina State University), Panagiotis Papadopoulos (Telefonica Research), Matteo Varvello (Nokia Bell Labs), Ben Livshits (Brave Software, Imperial College London), Alexandros Kapravelos (North Carolina State University) | Measurements
Accurate web measurement is critical for understanding and improving security and privacy online. Implicit in these measurements is the assumption that automated crawls generalize to the experiences of typical web users, despite significant anecdotal evidence to the contrary. Anecdotal evidence suggests that the web behaves differently when approached from well-known measurement endpoints, or with well-known measurement and automation frameworks, for reasons ranging from DDOS detection, hiding malicious behavior, or bot detection.
This work improves the state of web privacy and security by investigating how, and in what ways, privacy and security measurements change when using typical web measurement tools, compared to measurement configurations intentionally designed to match “real” web users. We build a web measurement framework encompassing network endpoints and browser configurations ranging from off-the-shelf defaults commonly used in research studies to configurations more representative of typical web users, and we note the effect of realism factors on security and privacy relevant measurements when applied to the Tranco top 25k web domains.
We find that web privacy and security measurements are significantly affected by measurement vantage point and browser configuration, and conclude that unless researchers carefully consider if and how their web measurement tools match real world users, the research community is likely systematically missing important signals. For example, we find that browser configuration alone can cause shifts in 19% of known ad and tracking domains encountered, and similarly affects the loading frequency of up to 10% of distinct families of JavaScript code units executed. We also find that choice of measurement network points have similar, though less dramatic, effects on privacy and security measurements. To aid the measurement replicability, and to aid future web research, we share our dataset and precise measurement configurations.