THE ARXIV REVIEW
VOL. I · APRIL 2026
A BURLA DEMO
Linguistic archaeology
The loneliest paper in science is about Norwegian corporate tax filings.
We embedded every single arXiv abstract ever posted — 2,710,783 of them — with MiniLM-L6-v2, clustered the lot into 400 topics, and asked the 384-dimensional geometry three questions whose answers no one at arXiv has ever curated: which topics collapsed, which appeared overnight, and which single paper sits the furthest from anything else anyone has ever written.
arXiv:2203.12842 · "Financial statements of companies in Norway." Its fifth-nearest neighbor across 2.71 M papers has cosine similarity 0.138. Nothing else is remotely about the same thing.
Methodology & compute
| Abstracts embedded | 2,710,783 |
| Embedding model | all-MiniLM-L6-v2 · 384-d |
| Clusters (MiniBatch k-means) | 400 |
| Burla wall-clock | ~25 min |
| Serial-equivalent | ~75–90 min |
| Reduce stage alone | 142 s |
| Peak parallel workers | 16 |
Three findings
§1 — Fossils
Measured by the ratio of last-5-year paper rate to peak-year rate. #1 is 2000s-era Randall-Sundrum braneworld cosmology, now running at 16.7% of its peak. #3 is pandemic-era epidemic modelling, already at 19.7%.
Read the full list ↗
§2 — Newborns
Clusters where the majority of their all-time papers were written in the last 24 months. #1 is LLM evaluation, safety, and alignment — 55.3% of its entire history is from the last two years.
Read the full list ↗
§3 — The outlier
The one arXiv paper whose closest neighbor is further away than any other paper's closest neighbor. It's not what anyone predicted — and its abstract is not about machine learning.
Read the case ↗