Limit parallelism for APIs, databases, and websites
Keep Burla jobs inside external service limits.
Limit parallelism for APIs, databases, and websites
Use this when the slowest or most fragile part of your job is outside Burla. Do not use every available CPU when an API quota, website, database, or model provider is the real limit. The unit of work is usually a chunk of IDs, URLs, prompts, files, or database ranges. Each worker should reuse one client or connection inside that chunk. The output should include successes and failures so you can retry only the work that failed.
Parallelism is not always the target. Sometimes the target is finishing the whole job without breaking the contract with another system.
Start from the external limit
Write down the real limit first.
Examples:
- API: 1,000 requests per second
- website: 2 requests per second per worker, plus a global worker cap
- Postgres: 200 safe write connections
- LLM provider: 60,000 tokens per minute
- vector database: 100 concurrent upsert batches
Then choose:
- chunk size
- per-worker pacing
max_parallelism
The rough formula is:
global throughput = live workers * per-worker throughput
If each worker makes one request per second and you set max_parallelism=500, your job tries to make about 500 requests per second.
Chunk IDs for an API backfill
Plan chunks on the client.
def chunks(items, size):
return [items[i:i + size] for i in range(0, len(items), size)]
with open("user_ids.txt") as f:
user_ids = [line.strip() for line in f if line.strip()]
id_chunks = chunks(user_ids, 1000)
Put pacing and provider behavior next to the HTTP call.
def enrich_users(user_ids):
import os
import time
import httpx
rows = []
headers = {"Authorization": f"Bearer {os.environ['API_TOKEN']}"}
with httpx.Client(timeout=30.0, headers=headers) as client:
for user_id in user_ids:
response = client.get(f"https://api.example.com/v1/users/{user_id}")
if response.status_code == 429:
rows.append({"user_id": user_id, "ok": False, "status": 429})
else:
response.raise_for_status()
rows.append({"user_id": user_id, "ok": True, "profile": response.json()})
time.sleep(1.0)
return rows
Cap live workers with max_parallelism.
import json
from burla import remote_parallel_map
with open("profiles.jsonl", "w") as f:
for rows in remote_parallel_map(
enrich_users,
id_chunks,
func_cpu=1,
func_ram=2,
max_parallelism=500,
generator=True,
grow=True,
):
for row in rows:
f.write(json.dumps(row) + "\n")
The JSONL file is the output and the retry manifest. Failed rows are visible.
Keep one database connection per worker
For databases, count connections before CPUs.
If each worker opens one connection and Postgres can safely handle 80 write connections, start with max_parallelism=80.
def load_file_to_postgres(key):
import gzip
import json
import os
import boto3
import psycopg2
from psycopg2.extras import execute_values
body = boto3.client("s3").get_object(Bucket="my-events", Key=key)["Body"].read()
rows = [json.loads(line) for line in gzip.decompress(body).splitlines()]
values = [(row["event_id"], row["user_id"], row["ts"]) for row in rows]
connection = psycopg2.connect(os.environ["DATABASE_URL"])
with connection, connection.cursor() as cursor:
execute_values(
cursor,
"INSERT INTO events(event_id, user_id, ts) VALUES %s ON CONFLICT DO NOTHING",
values,
page_size=1000,
)
connection.close()
return {"key": key, "rows": len(values)}
for report in remote_parallel_map(
load_file_to_postgres,
s3_keys,
func_cpu=1,
func_ram=2,
max_parallelism=80,
generator=True,
grow=True,
):
print(report["key"], report["rows"])
The bottleneck here is not Python. It is the sink.
Be polite to websites
For static pages, one worker should keep one HTTP client open for a chunk of URLs.
def scrape_urls(urls):
import random
import time
import httpx
from selectolax.parser import HTMLParser
rows = []
with httpx.Client(http2=True, timeout=20.0, follow_redirects=True) as client:
for url in urls:
try:
response = client.get(url)
if response.status_code in (429, 503):
rows.append({"url": url, "ok": False, "status": response.status_code})
else:
response.raise_for_status()
title = HTMLParser(response.text).css_first("title")
rows.append({"url": url, "ok": True, "title": title.text(strip=True) if title else None})
except httpx.HTTPError as error:
rows.append({"url": url, "ok": False, "error": str(error)})
time.sleep(0.5 + random.random() * 0.5)
return rows
url_chunks = chunks(urls, 500)
import json
with open("scrape-results.jsonl", "w") as f:
for rows in remote_parallel_map(
scrape_urls,
url_chunks,
func_cpu=1,
func_ram=2,
max_parallelism=200,
generator=True,
grow=True,
):
for row in rows:
f.write(json.dumps(row) + "\n")
For pages that need JavaScript, use a browser image or a browser-specific tool. Do not pretend httpx tested the same thing.
Model providers and token limits
For an LLM provider, the limit is often tokens per minute, not requests per second.
Estimate tokens per input, then choose a chunk size and worker count that stay below the limit.
PROMPTS_PER_WORKER = 20
SECONDS_BETWEEN_PROMPTS = 2.0
MAX_WORKERS = 50
prompt_chunks = chunks(prompts, PROMPTS_PER_WORKER)
This tries to send about 25 prompts per second across the job:
50 workers * 1 prompt every 2 seconds = 25 prompts per second
If the provider bills or limits by token, reduce worker count when prompts or outputs get longer.
Choose the first value for max_parallelism
Start lower than the theoretical limit.
Examples:
- API allows 1,000 requests per second. Start at 500.
- Postgres has 200 available connections. Start at 80.
- Website tolerated 100 workers in a test. Start at 50.
- GPU quota allows 16 workers. Start at 8.
- Vector database allows 100 upserts. Start at 40.
Raise the cap after you see clean logs, stable latency, and no growing error rate.