Industry glossary

Web scraping terms for data-driven teams.

A practical vocabulary for evaluating scraper APIs, anti-bot access, data quality, delivery formats, and marketplace intelligence projects.

Crawling & Extraction

Crawler

A system that discovers and visits URLs at scale, usually before extraction logic runs.

Scraper

The extraction layer that turns pages, APIs, or rendered DOM content into structured records.

Parser

Rules or models that map raw HTML and JSON into fields like title, price, rating, or availability.

Selector Drift

A breakage pattern where CSS selectors stop matching after a website redesign.

JavaScript Rendering

Running a real browser session so dynamic React, Vue, or SPA content becomes visible before extraction.

Headless Browser

A browser controlled programmatically without a visible UI, commonly used for dynamic pages.

Anti-Bot & Access

WAF

A web application firewall such as Cloudflare, DataDome, Akamai, or PerimeterX that filters suspicious traffic.

CAPTCHA

A challenge designed to distinguish humans from automated traffic.

Browser Fingerprint

A combination of device, browser, canvas, TLS, and behavior signals used to identify automation.

Residential Proxy

An IP route associated with consumer networks, often used for region-specific access and block reduction.

Session Persistence

Keeping cookies, IP, headers, and browser state stable across requests to mimic normal browsing.

Rate Limit

A website rule that restricts request volume per user, IP, session, or time window.

Data Quality

Schema

The agreed field structure for a dataset, including names, types, required fields, and nested objects.

Freshness

How recently a record was collected or verified against the source website.

Deduplication

Removing duplicate records caused by repeated pages, pagination overlap, or marketplace relisting.

Normalization

Converting source-specific formats into consistent units, currencies, dates, categories, and field names.

Confidence Score

A quality signal that estimates whether extracted values match the expected page and schema.

Re-Scrape

A retry or replacement collection used when records fail validation or source pages change.

Commerce & Marketplace Data

SKU

A stock keeping unit used to identify a product variant such as size, color, or bundle.

Buy Box

The primary seller offer shown on a marketplace product page, especially relevant for Amazon tracking.

Sold Comps

Recently sold listings used to estimate resale market value on platforms like eBay, Mercari, or StockX.

Availability

Whether a product, room, job, listing, or offer can currently be purchased or booked.

Share of Search

How visible a brand or seller is across keyword search result pages.

Review Velocity

The rate at which new reviews appear, often used as a demand or reputation signal.

Delivery & Operations

Webhook

An HTTP push that sends data to your endpoint when a job completes or new records are ready.

Batch Export

A file-based delivery mode such as CSV, JSON, Excel, or Parquet delivered on a schedule.

Backfill

Historical collection used to populate past records before recurring monitoring begins.

SLA

A service-level agreement covering uptime, support response, job recovery, or delivery windows.

Data Pipeline

The full flow from collection through parsing, validation, storage, and delivery.

Change Detection

Monitoring pages or records for price, availability, content, or ranking changes.