Headers, Pagination, and CAPTCHAs: The Three Walls of Web Scraping
Most scrapers don't fail because the parsing logic is wrong — they fail because of headers, pagination, or CAPTCHAs. This post covers all three at the level of "what's actually blocking you" rather than "what does HTTP say."
Part 1: Web scraping headers
The headers you send say more about whether you'll be blocked than the IP you send from. A real Chrome request carries 12–15 headers in a specific order, with values that match each other (the User-Agent's stated platform must match the sec-ch-ua-platform hint, etc.). A default requests or curl call sends 3, in the wrong order, with mismatched values.
The headers that matter most:
- User-Agent. Default
python-requests/2.xis blocked by most defended sites. Use a current Chrome / Firefox UA — and update it; UAs older than 6 months are themselves a signal. - Accept / Accept-Language / Accept-Encoding. Real browsers always send these. Missing values are a tell.
- Sec-Fetch-* headers. Chrome attaches
sec-fetch-site,sec-fetch-mode,sec-fetch-destto every request. Most non-browser HTTP clients omit them. - Sec-Ch-Ua-* client hints. Modern Chrome sends these on top-level requests; their values must be coherent with the User-Agent.
- Cookie. A request that should be authenticated but arrives without a cookie is obviously a bot.
- Referer. Real users arrive from somewhere. A direct request to a deep product URL with no Referer is suspicious on most sites.
Beyond headers, anti-bot vendors fingerprint the TLS handshake itself (JA3 / JA4) — the order of cipher suites, extensions, and supported curves. requests and stock curl have signatures vendors recognise on sight. Fixes: curl-impersonate, tls-client in Python, or a real browser via Playwright.
Part 2: Pagination patterns
Pagination is where a scraper that "works on the first page" silently breaks. The four patterns you'll meet:
- Numbered pages (
?page=2). The simplest. Loop until you stop seeing the next-page link or the result list goes empty. Watch for soft 200 responses that hide "no more results" inside the HTML. - Cursor / token-based. The response carries a
next_cursorfield; you pass it on the next request. More resilient against duplicate / missing pages than numbered pagination. - Offset / limit (
?offset=200&limit=50). Common on REST APIs. Watch for backends that change ordering between requests — you'll get duplicates and gaps. Add a stable sort key. - Infinite scroll. No URL changes; the page fires an XHR for each batch. Inspect DevTools → Network for the actual API call and treat it as cursor-based or offset-based depending on what it returns.
Two pagination traps worth naming. Hard caps: many sites silently cap results at 1000 or 10000 records and quietly stop returning new ones. If the count of records you collect is suspiciously round, you're hitting a cap — split the query into smaller filters (by category, by date range, by ZIP) until each shard fits under the cap. Order drift: if the site re-sorts on every request, paginating by page number gives you duplicates and gaps; switch to a deterministic cursor or stable filter.
Part 3: CAPTCHAs
CAPTCHAs are not a single thing — they're a category. The category determines what (if anything) you should do.
- Cloudflare Turnstile / hCaptcha invisible. Often passes silently when your fingerprint and behavior look human. The fix is upstream — better headers, real browser, behavioral jitter — not "solve the CAPTCHA."
- reCAPTCHA v2 (image grids). Designed to be hard to automate. Solver services exist and work but are slow, expensive, and may be against the target site's TOS.
- reCAPTCHA v3 (score-based). Returns a risk score 0–1; the site decides what threshold to act on. Improving fingerprint and behavior usually moves your score enough to clear.
- Custom anti-bot challenges. Akamai Bot Manager, DataDome, PerimeterX, Kasada — each runs JavaScript challenges in the browser. Solving them programmatically is a moving target; mature anti-bot vendors update challenges every few weeks. The practical answer is a real browser with stealth or a managed scraping API that handles the challenges as a service.
Two principles. First, treat CAPTCHAs as a downstream symptom of bad fingerprint, not a problem to solve. If you fix the fingerprint, most CAPTCHAs stop firing. Second, respect the legal layer. Some jurisdictions interpret bypassing access controls as a CFAA-adjacent issue; CAPTCHAs are ambiguous here, but it's a fact-pattern worth scoping with counsel before automating around at scale. See our legal updates hub for current case law.
The practical hierarchy
- Send the right headers in the right order with coherent values.
- Match TLS fingerprint to the browser you're impersonating.
- Add behavioral jitter — pacing, mouse movement on real-browser scrapes.
- Inspect pagination patterns first; never trust
?page=Nwithout sanity checks. - Treat CAPTCHAs as a fingerprint diagnostic, not a wall to brute-force.
- When the maintenance cost exceeds the data value, hand the fetch layer to a managed service.
Past the cat-and-mouse stage?
We deliver structured CSV / JSON on a schedule — proxies, headers, anti-bot, and pagination all included.