One million GitHub READMEs
Summarize a million READMEs without calling an LLM
In this example we:
- Export 1,200,000 GitHub READMEs from BigQuery.
- Upload the Parquet to Burla shared storage.
- Run deterministic summarizers over 600 shards and reduce into frontend JSON.
I like this one because the first instinct is to ask an LLM. That would make individual rows prettier and the aggregate harder to trust.
Step 1: Put the Parquet where workers can read it
The worker reads a stripe of /workspace/shared/grs/readmes.parquet and emits one JSON shard.
SHARD_OUT = "/workspace/shared/grs/shards"
PARQUET_PATH = "/workspace/shared/grs/readmes.parquet"
CATEGORIES = {
"ml": {"tensorflow": 4, "pytorch": 4, "embedding": 2, "llm": 4},
"web": {"react": 3, "django": 2, "graphql": 3, "frontend": 2},
"devops": {"docker": 3, "kubernetes": 4, "terraform": 4},
}
Step 2: Fan out the shards
The map stage runs summarize_shard across 600 workers.
jobs = [(i, args.shards) for i in range(args.shards)]
results = remote_parallel_map(
summarize_shard,
jobs,
func_cpu=args.func_cpu,
func_ram=args.func_ram,
grow=True,
max_parallelism=args.parallelism,
)
Step 3: Reduce counters and examples
The reducer keeps heaps per category and language, plus document-frequency counters for TF-IDF.
def reduce_bucket(bucket_idx: int, n_buckets: int, top_per_cat: int, top_per_lang: int, sample_cap: int) -> dict:
files = sorted(f for f in os.listdir("/workspace/shared/grs/shards") if f.endswith(".json"))
my_files = [f for i, f in enumerate(files) if i % n_buckets == bucket_idx]
by_cat, by_lang, doc_freq = {}, {}, {}
cat_heaps = {}
for fn in my_files:
with open(os.path.join("/workspace/shared/grs/shards", fn)) as f:
rows = json.load(f).get("rows", [])
for row in rows:
cat = row.get("category", "other")
quality = row.get("badges", 0) * 1.5 + row.get("code_blocks", 0) * 0.3
heapq.heappush(cat_heaps.setdefault(cat, []), (quality, row["repo"], row))
What's the point?
Pretty summaries of famous repos are the boring version. I care about README culture at scale: install instructions, badges, code fences, category words, cloned templates, and empty placeholders.
A model would make the rows sound smoother. I do not want smoother here. I want counts I can debug. If a category looks wrong, I can inspect the keyword weights and rerun the reduce.