Examples

Real Burla workloads for ML, data pipelines, production IO, and scientific computing.

GPU embeddings on A100s Embed 50,000 Wikipedia articles with a CUDA image, CPU download stage, GPU embedding stage, and shared vector artifacts. gpu-embedding-demo.md gpu-embedding-demo.png
Batch inference without serving Load a Hugging Face model once per worker and score Parquet batches without building an endpoint. ml-inference-batch.md ml-inference-batch.png
Embed the whole arXiv Cluster 2.7M abstracts and find isolated papers by running the embedding job at corpus scale. arxiv-fossils.md arxiv-fossils.png
Label-free visual search over the Met Fetch and embed Open Access museum images, then use FAISS to find visual matches without labels. met-weirdest-art.md met-weirdest-art.png
Multimodal Airbnb analysis Run listings, photos, CLIP, YOLOv8, reviews, and bootstrap confidence intervals across the public corpus. airbnb-burla.md airbnb-burla.png

Full-corpus analysis

571M Amazon reviews Read 275 GB of JSONL with HTTP Range requests, deterministic scoring, and heap-based reducers. amazon-review-distiller.md amazon-review-distiller.png
NYC taxi history Scan 2.76B taxi and FHV trips to find ghost, emergent, and recovered city zones. nyc-ghost-neighborhoods.md nyc-ghost-neighborhoods.png
9.49M Flickr photos Reverse-geocode public photos and build country signatures from user-written tags. world-photo-index.md world-photo-index.png
NOAA rain extremes Stream every yearly GHCN-Daily CSV, keep top heaps, and reduce station-level extremes. ghcn-rainiest-day.md ghcn-rainiest-day.png
One million GitHub READMEs Export README Parquet from BigQuery, shard deterministic summarizers, and reduce category stats. github-repo-summarizer.md github-repo-summarizer.png

Production data jobs

S3 to Postgres ETL Transform 10,000 gzipped JSON files while protecting Postgres with max_parallelism. python-etl-no-airflow.md python-etl-no-airflow.png
Millions of image resizes Chunk image keys, resize with Pillow, and stream progress as workers write outputs. image-dataset-resize.md image-dataset-resize.png
One Parquet file per worker Compute per-file QA stats without starting Spark for a simple file-parallel job. parquet-parallel.md parquet-parallel.png
Pandas apply in parallel Partition a Parquet dataset and run ordinary pandas code on each worker. pandas-apply-parallel.md pandas-apply-parallel.png
Enrich millions of users through a rate-limited API Backfill user profiles while keeping provider limits explicit in chunk size, sleeps, and max_parallelism. rate-limited-api-requests.md rate-limited-api-requests.png
Crawl a million website pages without hiding failures Scrape static HTML with polite pacing, retries, error rows, and a global concurrency cap. parallel-web-scraping.md parallel-web-scraping.png

Scientific and geospatial work

Genome alignment Run bwa and samtools in a custom image with one FASTQ pair per worker. bioinformatics-alignment.md bioinformatics-alignment.png
GDAL raster processing Compute NDVI one Sentinel tile at a time with rasterio and shared outputs. gdal-raster-processing.md gdal-raster-processing.png
Billion-path Monte Carlo Return sums and squared sums from independent chunks, then reduce locally. monte-carlo-simulation.md monte-carlo-simulation.png