One million READMEs at once.
We streamed ~1.2 million real GitHub READMEs through a Burla cluster with 500+ CPUs running in parallel, ran deterministic summary heuristics on each one, and shipped every result to this page. No LLM. All rules. Reproducible.
—repos summarized
500+CPUs in parallel
14categories
—languages
Matches
What's on GitHub, in one glance
Every README we processed, bucketed into one of 14 categories by keyword heuristics.
Browse by category
14 flavors of open source. Click one to see the most prominent repos inside it.
What we found
Nine patterns in the texture of open source — only visible at this scale.
How it works
- Pull. BigQuery
bigquery-public-data.github_repos— joinfiles+contents+languages. One README per repo (biggest). Export as zstd parquet. - Upload once. Stream the parquet to the cluster's shared filesystem (
/workspace/shared). - Fan out.
remote_parallel_map(summarize_shard, 600 shards)on Burla — every worker reads its stripe of the parquet and runs the summarizer on ~2,000 READMEs each. - Summarize. For each README: title, tldr, install method, category, badges, code blocks, token counts.
- Reduce. 16 buckets, each merges ~40 shard files in parallel. Top-K per category by "quality" score.
- Analyze locally. TF-IDF over 14 category "documents" surfaces distinctive keywords. Client-side search index sampled to 6,000 repos.
Source: Burla-Cloud on GitHub. Pipeline code in pipeline.py, scale.py, reduce.py, analysis.py.