571M Amazon reviews
Distill 571 million Amazon reviews
In this example we:
- Stream 275GB of Amazon review JSONL from HuggingFace.
- Split the files into byte ranges instead of downloading everything first.
- Keep top-K heaps for profanity, caps, rants, and exclamation storms.
The goal is the Wall of Rants. But the real question is bigger than funny examples: which categories produce this stuff, and how often?
Step 1: Plan byte ranges
Each category file is huge, so we turn it into roughly 500MB jobs.
def plan_chunks(chunk_mb: int = 500) -> list[tuple[str, int, int, str]]:
from huggingface_hub import HfApi
files = [(i.path, i.size) for i in HfApi().list_repo_tree(
"McAuley-Lab/Amazon-Reviews-2023",
path_in_repo="raw/review_categories",
repo_type="dataset",
) if getattr(i, "size", 0) > 0]
jobs = []
for path, size in files:
span = chunk_mb * 1024 * 1024
for start in range(0, size, span):
jobs.append((path, start, min(start + span, size), f"{Path(path).stem}_{start}"))
return jobs
Step 2: Stream records safely
The worker asks HuggingFace for a byte range, discards the first partial line when needed, and parses JSON rows.
def stream_reviews(file_path: str, start: int, end: int):
resp = requests.get(
HF_BASE + file_path,
headers={"Range": f"bytes={start}-{end - 1}"},
stream=True,
timeout=300,
)
buf = b""
first_line = True
for raw in resp.iter_content(chunk_size=1 << 16):
buf += raw
lines = buf.split(b"\n")
buf = lines.pop()
if first_line and start > 0 and lines:
lines.pop(0)
first_line = False
for line in lines:
yield json.loads(line)
Step 3: Map and reduce
Workers keep tiny summaries. The reducer merges those summaries into the final leaderboards.
results = remote_parallel_map(
process_main,
jobs,
func_cpu=1,
func_ram=4,
grow=True,
max_parallelism=1000,
)
[result] = remote_parallel_map(reduce_main, [0], grow=True)
What's the point?
A sample can find funny reviews. It cannot tell you whether Video Games is actually more profane than Beauty, or whether one 10,594-exclamation review is rare or part of a pattern.
I also like that this version does not need an LLM. Regexes, counters, lengths, caps, punctuation, and heaps are enough for the first pass. If you want model labels later, run them on the tiny candidate set after the reduce.