THE ARXIV REVIEW VOL. I · APRIL 2026 A BURLA DEMO
Linguistic archaeology

The loneliest paper in science is about Norwegian corporate tax filings.

We embedded every single arXiv abstract ever posted — 2,710,783 of them — with MiniLM-L6-v2, clustered the lot into 400 topics, and asked the 384-dimensional geometry three questions whose answers no one at arXiv has ever curated: which topics collapsed, which appeared overnight, and which single paper sits the furthest from anything else anyone has ever written.

arXiv:2203.12842 · "Financial statements of companies in Norway." Its fifth-nearest neighbor across 2.71 M papers has cosine similarity 0.138. Nothing else is remotely about the same thing.
Methodology & compute
Abstracts embedded2,710,783
Embedding modelall-MiniLM-L6-v2 · 384-d
Clusters (MiniBatch k-means)400
Burla wall-clock~25 min
Serial-equivalent~75–90 min
Reduce stage alone142 s
Peak parallel workers16
Three findings
§1 — Fossils

Ten research topics that peaked, then quietly collapsed

Measured by the ratio of last-5-year paper rate to peak-year rate. #1 is 2000s-era Randall-Sundrum braneworld cosmology, now running at 16.7% of its peak. #3 is pandemic-era epidemic modelling, already at 19.7%.

Read the full list ↗
§2 — Newborns

Ten topics that appeared essentially overnight

Clusters where the majority of their all-time papers were written in the last 24 months. #1 is LLM evaluation, safety, and alignment — 55.3% of its entire history is from the last two years.

Read the full list ↗
§3 — The outlier

The single paper furthest from anything else in science

The one arXiv paper whose closest neighbor is further away than any other paper's closest neighbor. It's not what anyone predicted — and its abstract is not about machine learning.

Read the case ↗

Source: the full arXiv metadata snapshot (Cornell University / Hugging Face, refreshed monthly). Embeddings: sentence-transformers/all-MiniLM-L6-v2 (ONNX via fastembed). Index: faiss.IndexIVFFlat with inner-product metric. Clustering: sklearn.MiniBatchKMeans(k=400). Orchestration: Burla's remote_parallel_map — three stages, one worker for discovery, sixteen for embedding, one for reduce.

Source code & artifacts: github.com/Burla-Cloud/arxiv-fossils