One million READMEs at once.

We streamed ~1.2 million real GitHub READMEs through a Burla cluster with 500+ CPUs running in parallel, ran deterministic summary heuristics on each one, and shipped every result to this page. No LLM. All rules. Reproducible.

—repos summarized

500+CPUs in parallel

14categories

—languages

Browse categories → Read the findings

What's on GitHub, in one glance

Every README we processed, bucketed into one of 14 categories by keyword heuristics.

Browse by category

14 flavors of open source. Click one to see the most prominent repos inside it.

What we found

Nine patterns in the texture of open source — only visible at this scale.

How it works

Pull. BigQuery bigquery-public-data.github_repos — join files + contents + languages. One README per repo (biggest). Export as zstd parquet.
Upload once. Stream the parquet to the cluster's shared filesystem (/workspace/shared).
Fan out. remote_parallel_map(summarize_shard, 600 shards) on Burla — every worker reads its stripe of the parquet and runs the summarizer on ~2,000 READMEs each.
Summarize. For each README: title, tldr, install method, category, badges, code blocks, token counts.
Reduce. 16 buckets, each merges ~40 shard files in parallel. Top-K per category by "quality" score.
Analyze locally. TF-IDF over 14 category "documents" surfaces distinctive keywords. Client-side search index sampled to 6,000 repos.

Source: Burla-Cloud on GitHub. Pipeline code in pipeline.py, scale.py, reduce.py, analysis.py.

One million READMEs at once.

Matches

What's on GitHub, in one glance

Browse by category

What we found

How it works