WebSweep: Collecting Website Text for Research

WebSweep helps researchers capture what was publicly visible on a given date, preserve the raw HTML as a reproducible archive, and turn those pages into analysis-ready text.

Use WebSweep when you:

  • have a list of public websites or domains
  • want a repeatable workflow for many domains
  • mainly need HTML text and metadata from public pages

In this tutorial, we use the example of FIRMBACKBONE. It is the Dutch research infrastructure to provides secure, FAIR access to comprehensive data on all registered organizations in the Netherlands, including web-based data. We would like to collect information of corporate websites, for example to track the scope and depth of coverage of the energy transition. The same workflow can be used for universities, NGOs, government organisations, local news sites, project websites, or any other public set of domains.

WebSweep components

  • Crawler: downloads public pages, follows links within the same domain, and stores raw HTML in per-domain .zip archives plus an overview_urls file
  • Extractor: reads those downloaded pages and turns them into page-level structured records such as text, metadata, and contact/location fields
  • Consolidator: merges page-level records into one row per domain, which is usually the easiest format for downstream analysis

Step 1: Install WebSweep

pip install websweep pandas

Step 2: Prepare a website list

Create a file called company_websites.csv.

url,identifier
https://www.akzonobel.com/,akzonobel
https://www.randstad.com/,randstad
https://www.wolterskluwer.com/,wolters_kluwer
https://www.signify.com/,signify
https://www.aholddelhaize.com/,ahold_delhaize

This is only one concrete example. For a real project, replace these URLs with the websites in your own sample.

Step 3: Run the workflow

The example below uses five corporate websites. It crawls one level deep, caps the crawl at 5 pages per domain.

from pathlib import Path

import pandas as pd
from websweep import Crawler, Extractor, Consolidator

targets = pd.read_csv("company_websites.csv")
urls = list(zip(targets["url"], targets["identifier"]))

output_folder = Path("company_sweep")

crawler = Crawler(
    target_folder_path=output_folder,
    max_level=1,
    max_pages_per_domain=5,
)
crawler.crawl_base_urls(urls)

Extractor(target_folder_path=output_folder).extract_urls()
Consolidator(target_folder_path=output_folder).consolidate()

Step 4: Inspect the output and archive structure

After the run, you will find files like these:

company_sweep/
├── overview_urls.duckdb
├── consolidated_data/
│   └── consolidated.ndjson
├── crawled_data/
│   ├── aholddelhaize.com.zip
│   ├── akzonobel.com.zip
│   ├── randstad.com.zip
│   ├── signify.com.zip
│   └── wolterskluwer.com.zip
└── extracted_data/
    └── extracted_data_2026-03-20_0-1000000.ndjson

The two .ndjson (newline delimited JSON) files contain the extraced and consolidated data.

If you unzip one domain archive, you can also inspect the saved page layout:

wolterskluwer.com/
└── 2026-03-20/
    └── www.wolterskluwer.com

That raw archive is valuable for reproducibility: you keep the downloaded HTML, not only the cleaned text.

Step 5: Inspect the extracted and consolidated output

extracted_data_2026-03-20_0-1000000.ndjson contains page-level rows. One domain can contribute multiple page-level observations. consolidated.ndjson contains one row per domain.

To make the page-level structure clear, in our example, extracted_date would look like:

extracted_data_2026-03-20_0-1000000.ndjson
http://127.0.0.1:8765/index.html    | Research Lab Demo Research Lab Demo This d...
http://127.0.0.1:8765/methods.html  | Methods Methods We collect public web page...
http://127.0.0.1:8765/projects.html | Projects Projects Example projects include...

For the five-site corporate example above, the domain-level output looks like this:

consolidated.ndjson
ahold_delhaize | aholddelhaize.com | we are Ahold Delhaize We published...
akzonobel      | akzonobel.com     | AkzoNobel | AkzoNobel Skip to main...
randstad       | randstad.com      | The global leader in the HR servic...

You can read .ndjson (newline delimited JSON) files as:

import pandas as pd

results = pd.read_json(
    "company_sweep/consolidated_data/consolidated.ndjson",
    lines=True,
)

results["text_length"] = results["text"].fillna("").str.len()

print(results[["identifier", "domain", "text_length"]])

Each row is one domain. Besides text, the consolidated file can also contain aggregated fields such as email, phone, zipcode, and address when these are found on the website.

How to avoid getting blocked

WebSweep already stays within the same domain, checks robots.txt, uses one concurrent request per domain, and backs off when it encounters blocking signals such as 403 or 429. Even so, a careful workflow helps:

  • use a low depth and page cap when testing a new site list
  • if needed, increase the wait settings in the Python example above
  • run a pilot first when you are unsure how protective the sites are
  • respect robots.txt, terms of use, and your institution’s research rules
  • scrape only public pages relevant to your research question
  • check overview_urls.* for 403, 429, and timeout errors

If a site returns very little text or repeated errors, inspect that domain manually before scaling up.

Optional: CLI workflow

If you prefer the command line, WebSweep also supports a CLI workflow:

websweep init --headless
websweep crawl
websweep extract
websweep consolidate

During websweep init --headless, WebSweep will ask for your output folder, source CSV, and a few configuration choices.

WebSweep gives researchers a practical way to build reproducible website archives, extract comparable text across many kinds of websites, and rerun the same workflow later to monitor changes over time.

For more options and examples, see the WebSweep documentation.