Examples
Crawl a million website pages without hiding failures
Scrape the archive, not the easy page sample
In this example we:
- Read 1,000,000 URLs from a file.
- Scrape them in 500-URL chunks.
- Stream JSONL rows with either parsed fields or errors.
The stale pages, redirects, 503s, and broken HTML are the dataset. A happy-path sample filters out the part you most need to measure.
Step 1: Chunk URLs
The client only plans work. Each worker gets a chunk.
with open("urls.txt") as f:
urls = [u.strip() for u in f if u.strip()]
CHUNK = 500
chunks = [urls[i:i + CHUNK] for i in range(0, len(urls), CHUNK)]
Step 2: Fetch and parse inside the worker
The worker keeps one HTTP client open, backs off on temporary failures, and returns an error row when a page fails.
def scrape_chunk(urls: list[str]) -> list[dict]:
import random, time, httpx
from selectolax.parser import HTMLParser
out = []
with httpx.Client(http2=True, timeout=20.0, follow_redirects=True) as client:
for url in urls:
for attempt in range(4):
try:
r = client.get(url)
if r.status_code in (429, 503):
time.sleep(2 ** attempt + random.random())
continue
r.raise_for_status()
tree = HTMLParser(r.text)
title = tree.css_first("title")
out.append({"url": url, "title": title.text(strip=True) if title else None})
break
except (httpx.HTTPError, httpx.TimeoutException) as e:
if attempt == 3:
out.append({"url": url, "error": str(e)})
time.sleep(0.5 + random.random() * 0.5)
return out
Step 3: Stream the crawl
Chunks stream back as they finish, so the scrape can run for hours without building one giant result list.
from burla import remote_parallel_map
for rows in remote_parallel_map(
scrape_chunk,
chunks,
func_cpu=1,
func_ram=2,
max_parallelism=1000,
generator=True,
grow=True,
):
for row in rows:
f.write(json.dumps(row) + "\n")
What's the point?
Scraping gets weird fast. DNS, TLS, parser misses, per-site politeness, and failure logging all matter.
This design is plain on purpose: chunk URLs, reuse a client, parse the fields you need, return an error row when something fails. Once the output exists, you can compute failure rates by host or retry only bad chunks.