Web scraping
What is Web scraping?
Web scraping is the process of collecting publicly accessible content from a website and saving it in a database, file, or spreadsheet for later analysis. Web scraping is usually done leveraging bots (or Web crawlers) in combination with a control interface. Bots and crawlers are software designed to visit multiple websites, or pages within a website, and collect desired data about the contents of those sites and pages. Web scraping collects the data generated by crawling targeted websites, and processes the data to find the specific data desired. Web scraping can also be a manual process, but automated tools are far more efficient, and more common.
Web crawling is the term used to describe moving from site to site, link to link, to find content on the Web and index various pages on websites. Web scraping differs from Web crawling, as scraping refers to the act of collecting and filtering the contents of each site or page. Scraping may focus on specific data, such as the price of certain items on Amazon, or references to certain people or companies on news outlets. Web crawling and Web scraping can be used together or independently.
What is Web scraping used for?
Web scraping can be used to collect a variety of data. Any data that’s publicly accessible on the Internet is accessible by Web scrapers for things like:
- Personal information: Bots can search specifically for street addresses, email addresses, phone numbers, employment or schooling records, or other indicators used for profiling people.
- Commerce applications: Bots are often used to collect details on pricing, availability, and reviews of items for sale.
- Current events: Bots can help monitor current events and investment and market activity. A company might use a scraper to watch for breaking news on news sites.
- Research competitors’ offerings: For example, a travel company could use a bot to pretend to be a customer, fill out a reservation form, and see what rates a competitor offers for various dates, locations, etc.
- LLM/machine learning: Bots can scrape large amounts of data about a particular subject to feed into artificial intelligence (AI) models.
- Phishing sites: A bad actor could use a scraper to get everything necessary to create a copycat site that looks just like the original site.
Web scrapers don’t often collect data that requires a login to access, or that’s not found by visiting a root website address (and its associated links). However, if the person running the scraper has details necessary for accessing the more limited areas of the Web (sometimes—though not always—part of the Deep Web), those sources of data are within reach of a scraper.
How does Web scraping work?
There are several methods used to scrape data from the Web. One common method is to direct a bot to visit specific websites and download the HTML content. The bot can then filter the downloaded HTML for the desired information. This somewhat relies on the predictable, organized website content that HTML usually provides. Another method might have a bot visit many websites searching for a specific word or term. The simplest, if most inefficient, method doesn’t use a bot at all—instead, it consists of manually cutting and pasting the target information from a browser display.
Web scraping bots can churn through sites and data quickly because all they do is copy or search the HTML coding—they don’t actually take the time to display the resulting page, like a browser will do for a real user. Companion software can provide a front end that makes it easy for non-programmers to design and run their own Web scrapers. Some scrapers are even available as browser extensions, although these may be more limited in abilities or speed since they use a browser that does display the results. Large-scale scrapers are usually hosted in the cloud to optimize performance.
Data retrieved from scraping may need additional work to organize (or “clean”) and structure into an analyzable format. This capability may be part of the Web scraping software.
How can I prevent my data from being collected by Web scrapers?
Think about what you post online in public spaces like social media. Use privacy settings to keep personal information on social media accessible only to friends and out of reach of scrapers. Whenever possible, make sure your personal information is behind walls like logins or privacy settings. If you have a personal website (e.g. a blog or CV), consider incorporating some of the methods discussed above to limit Web scraping activity on your website.