EngineeringApril 28, 2026 · 11 min read

Web Scraping with Scrapy: Spiders, Pipelines, and Rotating Proxies

Scrapy remains the default for large-scale crawling in 2026. Async by design, mature ecosystem, built-in queueing, and a clean separation between fetching, parsing, and post-processing. This guide walks through a working project, item pipelines, rotating proxies, and the Scrapy-Playwright bridge for JavaScript-heavy targets.

When Scrapy is the right choice

Scrapy shines when the work shape is "many pages, deduplication, persistence, and respect for crawl etiquette." Its scheduler, request fingerprinting, and middleware system save you from reinventing pipeline plumbing. It struggles when the target is heavily JavaScript-rendered (compose with Scrapy-Playwright) or hard-fingerprinted by anti-bot (compose with a scraping API or a headless cluster).

Install & project

pip install scrapy
scrapy startproject shop
cd shop
scrapy genspider products example.com

A working spider

# shop/spiders/products.py
import scrapy

class ProductsSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products?page=1"]

    custom_settings = {
        "DOWNLOAD_DELAY": 0.5,
        "CONCURRENT_REQUESTS": 8,
        "USER_AGENT": "Mozilla/5.0 ([email protected])",
    }

    def parse(self, response):
        for card in response.css(".product-card"):
            yield {
                "title": card.css(".title::text").get(default="").strip(),
                "price": card.css(".price::text").get(default="").strip(),
                "url":   response.urljoin(card.css("a::attr(href)").get()),
            }

        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run with scrapy crawl products -O products.jsonl. Scrapy handles deduplication, retries, and concurrency for free. Set a User-Agent that identifies your data team — it's polite, and many sites whitelist named crawlers that they would block as anonymous.

Item pipelines

# shop/pipelines.py
import re

class CleanPricePipeline:
    def process_item(self, item, spider):
        raw = item.get("price", "")
        m = re.search(r"[\d.,]+", raw)
        if m:
            item["price_value"] = float(m.group().replace(",", ""))
        return item

# settings.py
ITEM_PIPELINES = {"shop.pipelines.CleanPricePipeline": 300}

Pipelines are where you put cleanup, validation, deduplication against a database, and persistence. Keep spiders dumb and pipelines smart — that separation pays off as schemas drift.

Rotating proxies

# settings.py — gateway-style rotating proxy
DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110,
}
HTTP_PROXY  = "http://USER:[email protected]:7777"
HTTPS_PROXY = "http://USER:[email protected]:7777"

# Or assign per-request in the spider
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, meta={"proxy": HTTP_PROXY})

For heavy crawling, prefer a single gateway URL where the provider rotates the exit IP per request. Pool-based rotation libraries like scrapy-rotating-proxies still work but duplicate logic that good proxy providers do server-side.

JavaScript-heavy targets: Scrapy + Playwright

pip install scrapy-playwright
playwright install chromium

# settings.py
DOWNLOAD_HANDLERS = {
    "http":  "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

# In a spider:
def start_requests(self):
    yield scrapy.Request(
        "https://spa.example.com/products",
        meta={"playwright": True, "playwright_include_page": True},
    )

This gives you Scrapy's scheduler and pipelines with Playwright's rendering when needed. Use it sparingly — full-page renders are 50–200× more expensive than HTTP requests, so route only the URLs that actually need it.

Anti-bot reality check

Out of the box, Scrapy's HTTP fingerprint is recognised by every major anti-bot vendor. Mitigations include using Scrapy-Playwright with stealth, fronting with curl-impersonate, or routing fetches through a managed scraping API and keeping Scrapy as the orchestration layer. All three patterns are common in production.

If you'd rather skip the cat-and-mouse and receive structured data on a schedule, see our tools category guide and our managed extraction service.

Operating Scrapy at scale is real work.

If you'd rather receive structured CSV than maintain proxies and pipelines, we deliver scheduled data on a fixed monthly contract.