Examples

Examples

Real Burla workloads for ML, data pipelines, production IO, and scientific computing.

ML, embeddings, and search


GPU embeddings on A100s	Embed 50,000 Wikipedia articles with a CUDA image, CPU download stage, GPU embedding stage, and shared vector artifacts.	gpu-embedding-demo.md	gpu-embedding-demo.png
Batch inference without serving	Load a Hugging Face model once per worker and score Parquet batches without building an endpoint.	ml-inference-batch.md	ml-inference-batch.png
Embed the whole arXiv	Cluster 2.7M abstracts and find isolated papers by running the embedding job at corpus scale.	arxiv-fossils.md	arxiv-fossils.png
Label-free visual search over the Met	Fetch and embed Open Access museum images, then use FAISS to find visual matches without labels.	met-weirdest-art.md	met-weirdest-art.png
Multimodal Airbnb analysis	Run listings, photos, CLIP, YOLOv8, reviews, and bootstrap confidence intervals across the public corpus.	airbnb-burla.md	airbnb-burla.png

Full-corpus analysis


571M Amazon reviews	Read 275 GB of JSONL with HTTP Range requests, deterministic scoring, and heap-based reducers.	amazon-review-distiller.md	amazon-review-distiller.png
NYC taxi history	Scan 2.76B taxi and FHV trips to find ghost, emergent, and recovered city zones.	nyc-ghost-neighborhoods.md	nyc-ghost-neighborhoods.png
9.49M Flickr photos	Reverse-geocode public photos and build country signatures from user-written tags.	world-photo-index.md	world-photo-index.png
NOAA rain extremes	Stream every yearly GHCN-Daily CSV, keep top heaps, and reduce station-level extremes.	ghcn-rainiest-day.md	ghcn-rainiest-day.png
One million GitHub READMEs	Export README Parquet from BigQuery, shard deterministic summarizers, and reduce category stats.	github-repo-summarizer.md	github-repo-summarizer.png

Production data jobs


S3 to Postgres ETL	Transform 10,000 gzipped JSON files while protecting Postgres with `max_parallelism`.	python-etl-no-airflow.md	python-etl-no-airflow.png
Millions of image resizes	Chunk image keys, resize with Pillow, and stream progress as workers write outputs.	image-dataset-resize.md	image-dataset-resize.png
One Parquet file per worker	Compute per-file QA stats without starting Spark for a simple file-parallel job.	parquet-parallel.md	parquet-parallel.png
Pandas apply in parallel	Partition a Parquet dataset and run ordinary pandas code on each worker.	pandas-apply-parallel.md	pandas-apply-parallel.png
Enrich millions of users through a rate-limited API	Backfill user profiles while keeping provider limits explicit in chunk size, sleeps, and `max_parallelism`.	rate-limited-api-requests.md	rate-limited-api-requests.png
Crawl a million website pages without hiding failures	Scrape static HTML with polite pacing, retries, error rows, and a global concurrency cap.	parallel-web-scraping.md	parallel-web-scraping.png

Scientific and geospatial work


Genome alignment	Run `bwa` and `samtools` in a custom image with one FASTQ pair per worker.	bioinformatics-alignment.md	bioinformatics-alignment.png
GDAL raster processing	Compute NDVI one Sentinel tile at a time with `rasterio` and shared outputs.	gdal-raster-processing.md	gdal-raster-processing.png
Billion-path Monte Carlo	Return sums and squared sums from independent chunks, then reduce locally.	monte-carlo-simulation.md	monte-carlo-simulation.png