Examples
Examples
Real Burla workloads for ML, data pipelines, production IO, and scientific computing.
ML, embeddings, and search
| GPU embeddings on A100s | Embed 50,000 Wikipedia articles with a CUDA image, CPU download stage, GPU embedding stage, and shared vector artifacts. | gpu-embedding-demo.md | gpu-embedding-demo.png |
| Batch inference without serving | Load a Hugging Face model once per worker and score Parquet batches without building an endpoint. | ml-inference-batch.md | ml-inference-batch.png |
| Embed the whole arXiv | Cluster 2.7M abstracts and find isolated papers by running the embedding job at corpus scale. | arxiv-fossils.md | arxiv-fossils.png |
| Label-free visual search over the Met | Fetch and embed Open Access museum images, then use FAISS to find visual matches without labels. | met-weirdest-art.md | met-weirdest-art.png |
| Multimodal Airbnb analysis | Run listings, photos, CLIP, YOLOv8, reviews, and bootstrap confidence intervals across the public corpus. | airbnb-burla.md | airbnb-burla.png |
Full-corpus analysis
| 571M Amazon reviews | Read 275 GB of JSONL with HTTP Range requests, deterministic scoring, and heap-based reducers. | amazon-review-distiller.md | amazon-review-distiller.png |
| NYC taxi history | Scan 2.76B taxi and FHV trips to find ghost, emergent, and recovered city zones. | nyc-ghost-neighborhoods.md | nyc-ghost-neighborhoods.png |
| 9.49M Flickr photos | Reverse-geocode public photos and build country signatures from user-written tags. | world-photo-index.md | world-photo-index.png |
| NOAA rain extremes | Stream every yearly GHCN-Daily CSV, keep top heaps, and reduce station-level extremes. | ghcn-rainiest-day.md | ghcn-rainiest-day.png |
| One million GitHub READMEs | Export README Parquet from BigQuery, shard deterministic summarizers, and reduce category stats. | github-repo-summarizer.md | github-repo-summarizer.png |
Production data jobs
| S3 to Postgres ETL | Transform 10,000 gzipped JSON files while protecting Postgres with max_parallelism. |
python-etl-no-airflow.md | python-etl-no-airflow.png |
| Millions of image resizes | Chunk image keys, resize with Pillow, and stream progress as workers write outputs. | image-dataset-resize.md | image-dataset-resize.png |
| One Parquet file per worker | Compute per-file QA stats without starting Spark for a simple file-parallel job. | parquet-parallel.md | parquet-parallel.png |
| Pandas apply in parallel | Partition a Parquet dataset and run ordinary pandas code on each worker. | pandas-apply-parallel.md | pandas-apply-parallel.png |
| Enrich millions of users through a rate-limited API | Backfill user profiles while keeping provider limits explicit in chunk size, sleeps, and max_parallelism. |
rate-limited-api-requests.md | rate-limited-api-requests.png |
| Crawl a million website pages without hiding failures | Scrape static HTML with polite pacing, retries, error rows, and a global concurrency cap. | parallel-web-scraping.md | parallel-web-scraping.png |
Scientific and geospatial work
| Genome alignment | Run bwa and samtools in a custom image with one FASTQ pair per worker. |
bioinformatics-alignment.md | bioinformatics-alignment.png |
| GDAL raster processing | Compute NDVI one Sentinel tile at a time with rasterio and shared outputs. |
gdal-raster-processing.md | gdal-raster-processing.png |
| Billion-path Monte Carlo | Return sums and squared sums from independent chunks, then reduce locally. | monte-carlo-simulation.md | monte-carlo-simulation.png |