Start betting like a sharp.
Twenty horses go to the gate Saturday at Churchill Downs. The morning-line odds tell you what the betting market thinks each horse’s chances are. The model below tells you what the math thinks. Where the model is much higher than the market, you bet; where it’s much lower, you fade. Post 6:57 PM ET.
If you’ve never bet a horse race (or never read a model like this), these six boxes cover everything you need before scrolling. Everything below is just applying these ideas to the 2026 field.
Show is the easiest to hit, so it pays the smallest. Win pays the most.
6.5-1 means a $1 win bet returns $6.50 profit if the horse wins.
To turn odds into probability: 1 / (odds + 1). So 6.5-1 becomes 1 / 7.5 = 13.3%, the “market implied” chance.
Compare the model’s win % to the market’s implied %. Three buckets, three different actions.
A Beyer Speed Figure is a single number (usually 60 to 120) that grades how fast a horse ran. Higher is faster. It adjusts for track and distance, so a 100 anywhere is a 100.
2026 field tops out at 106 (Further Ado).
Run the same scenario over and over with random luck mixed in, and count what happens. For this Derby:
If Further Ado wins 17.8% of those races, that’s his model win %.
Running the model on past races we already know the outcome of, to see if it would have picked the winners.
We tested 5,000 weight combinations against 16 historical Derbies (2010 through 2025). The best one scored 126 / 160 using a 10-5-2-1-0 ranking metric, picking the actual winner first in 11 of 16 years. That weight set is what runs the model today, and a 2,000-permutation null test on Burla confirmed the score is not search noise.
Caveat: 16 races is still a small sample. Strong against the null, not a guarantee.
Each horse is scored on these data points, then weighted by the backtest.
The 5,000-combo backtest, scored across 16 historical Derbies, decided which of these matter most. The top three weights all leaned toward what the horse can actually do: stamina test (19%), year-level Beyer (16%), and dosage score (16%). Trainer / jockey edges still matter, but less than the headline narrative usually claims.
Three longshot value plays. After scratches reshuffled the field, the headline favorites all collapsed back to fair. The model’s edge now lives entirely in the bottom of the morning line, where 50-1 and 30-1 prices still pay multiples of the model’s probability after takeout.
Drew into the field from the also-eligible list when Right To Party scratched on Friday. Doug F. O’Neill is two-for-Derby (won 2012 with I’ll Have Another, 2016 with Nyquist). The model has him at 2.78× the market-implied probability, the largest gap on the board. Buy a small win ticket at 50-1: Kelly sizing lands around 4% of bankroll, and the multiplier still pays cleanly after takeout. Expect variance.
Drew in from the also-eligible list when Silent Tactic scratched on Wednesday. Beyer 84 is bottom-tier, John Ennis trains. The model still has him at 2.07× the market-implied probability. Small win saver at 50-1: Kelly says ~2% of bankroll. Pure value play, no other angle.
Bob Baffert is back in the Derby with Martin Garcia up. Beyer 96 is below the figure leaders but not bottom-of-field. Post 4 sits in the rail-side cluster. The model has him at 1.78× the market-implied probability. Buy a small win ticket at 30-1: Kelly says ~3% of bankroll. The longshot multiplier still clears takeout cleanly.
At 50-1 the market treats Intrepido as a throwaway. The model disagrees: 3.5% probability vs 2.0% implied is a 1.73× multiplier. The catch is the Pace style puts him head-to-head with three other front-runners, so the likeliest scenario is he burns out on the first turn. Treat him as a tertiary saver, not a main bet. The headline favorite Renegade (4-1 ML) is the cleanest fade in this model. The rail draw at post 1 is historically the worst gate in the Derby (no winner from post 1 in our 2010-2025 sample), and the model has him at 3.9% vs 20.0% implied. The market is paying for his favorite badge, not his odds of winning.
Model Win % comes from 1 trillion Monte Carlo simulations. Market Win % is what the morning-line odds imply, with the market’s own rank below it. Value compares the two, before takeout. Each row has a reason box explaining the call.
Swipe the table to see Beyer, run style, and place / show probabilities.
| Post | Horse | Odds | Beyer | Style | Model Win% | Market Win% | Place% | Show% | Value |
|---|
Where small longshots actually live. Exotics let you bet structure (top-2, top-3, top-4 finish ordering) instead of trying to nail a single horse’s win at 6.5-1. Takeout is higher (~22%) but the payouts compound when the structure hits. All costs at minimum base unit.
Three-horse box covers all 6 permutations of the model’s top three by win probability. Further Ado on top is the highest-probability ordering by a wide margin (17.8% vs the next horse at 8.4%).
Key Further Ado in the win slot, wheel three horses across second and third. Robusta is the cheapest live longshot in this slot; if he hits second or third the trifecta pays heavy.
10-cent four-horse box covers the model’s clear #1 plus the second tier (The Puma) plus the two best longshot value bets. Most cost-efficient chaos hedge in a 20-horse field; pays big if any ordering hits.
Every input is measured, scraped, or pulled from a public source. No hand-typed numbers, no author guesses. Receipts in derby/derby_ingest.py and derby/derby_build.py.
| Input | Source | What it drives |
|---|---|---|
| 2026 morning line | kentuckyderby.com post-draw page (20 horses + ALEs) | The market-implied % column and the BET / FAIR / FADE call |
| 2026 speed and pace | Horse Racing Nation handicapping article (Beyer, Brisnet, TFUS, Last-1f, Last-3f) | Beyer column, run-style classification, pace-fit signal |
| Historical Derby results | Wikipedia 2010 through 2025 (305 starters, finishing order, fractions, track condition) | 5,000-combo back-test, ML training set, per-post historical win rate |
| Historical winner Beyers | Washington Post Beyer archive and BloodHorse recap articles | Year-level speed-figure feature in the ML model |
| Pedigree (DP, DI, CD) | PedigreeQuery (24 horses; 2026 field) | Stamina-test flag and dosage-score feature |
| Trainer and jockey at Churchill | TrackMaster Churchill StatsMaster + historical Derby win counts | Trainer and jockey signals in the composite score |
| Workouts and layoff | Churchill Downs press releases at churchilldowns.com | Cross-checking pace and run-style assignments |
| Race-day weather | National Weather Service (Churchill Downs lat/lon, KSDF observations) | Track-condition prior; fast track for Saturday |
One honest gap: per-horse Beyers for losing finishers in historical Derbies are DRF-paywalled. The historical training set therefore carries the same year-level winner Beyer for every horse in a given race, and per-horse historical signal comes from post position and connections instead of speed figures.
The same audit a sharp data scientist would run before betting a dollar on this. We ran it ourselves, including a real permutation test on Burla. Receipts in derby/derby_audit.py.
In plain English: the back-test score is points the model earns for putting the actual winner near the top of its rankings (10 for 1st, 5 for 2nd, 2 for 3rd, 1 for 4th, 0 after that). Higher is better, 126/160 is what we got. We tried to fool the test: we re-ran the full 5,000-combo search 2,000 times after secretly scrambling which horse actually won each historical Derby, then graded the model against the fake winners. If the model is genuinely picking real winners, the fake-winner versions should score much lower. Every single one of the 2,000 scrambled runs did. The model is finding real signal, not getting lucky on the rankings.
Translation: in 2,000 random-label runs, none came within four points of the real-label score. The framework is picking up real signal, not search noise.
derby_ingest.py dispatches a real Burla call across 114 scrape tasks and writes every payload to derby/data/raw/. Nothing is hard-coded as a fallback; the pipeline reproduces from raw bytes.
historical_results.csv: top-5 favorites by closing odds plus the actual winner, across 16 years. About 80 starters drive the search, no horse is hand-picked.
implied_prob = 1 / (odds + 1) is excluded from the feature set. The 8 inputs are post, per-post historical win rate, trainer and jockey Derby wins, year-level Beyer, dosage, run-style, and a muddy-track flag. Predictions are independent of the market they are compared against.
The 2026 inputs are measured (HRN, Wikipedia, NWS, the post-scratch morning line); the historical pipeline reads real Wikipedia results across 16 years; the back-test score clears its own null distribution by a wide margin. The ranking (Further Ado clear at the top, a tightly packed second tier, then Robusta, Great White and Litmus Test as the longshot value plays) is a defensible read of what is in the data.
Win probabilities to four decimals, the BET / FAIR / FADE labels as if they were perfectly calibrated, or the implication that more compute equals more truth. The model still cannot see pace figures, sectional times, current workouts, or the closing tote board. Treat the picks as a sharp first read, not a guarantee.
Burla
is a Python library for parallel cloud compute. You write a function,
call remote_parallel_map,
and it fans out across a cluster instantly. No Docker, no Kubernetes,
no orchestration glue.
For this pipeline, four separate Burla calls did all the heavy lifting: a 114-task scrape pulling 16 years of Wikipedia results plus the 2026 field; 164 ML model configurations trained in parallel; 5,000 weight combinations back-tested against 16 historical Derbies (2010 through 2025); and 1 trillion race simulations run as 10,000 parallel Burla workers in 47.8 minutes.
Full source code on GitHub. Burla docs at docs.burla.dev.
# Test 5,000 weight combinations across all 10 factors. # Back-test each against 2010-2025 Derby results. # Returns in 7 seconds on Burla (5,000 parallel workers). from burla import remote_parallel_map import numpy as np def backtest_weights(weights_list, factors, backtest_fields): weights = np.array(weights_list) total_score = 0 for year, horses in backtest_fields.items(): scores = [(h["name"], sum(weights[i] * h[f] for i, f in enumerate(factors))) for h in horses] scores.sort(key=lambda x: -x[1]) actual = next(h["name"] for h in horses if h["is_winner"]) rank = next(i for i, (n, _) in enumerate(scores) if n == actual) total_score += [10, 5, 2, 1, 0][min(rank, 4)] return {"weights": weights_list, "score": total_score} # Dirichlet sample: uniform over the probability simplex combos = np.random.dirichlet(np.ones(10), size=5000).tolist() args = [(c, FACTORS, BACKTEST_FIELDS) for c in combos] # 5,000 workers run simultaneously → results in 7 seconds results = remote_parallel_map(backtest_weights, args, grow=True) best = max(results, key=lambda r: r["score"]) # → stamina_test: 19.2%, beyer_norm: 16.1%, dosage_score: 15.9%
# Train 164 ML configs (GBM, RF, LogReg) in parallel. # Validate on 2022-2025 holdout (real data, no leak). Best log-loss 0.12. from burla import remote_parallel_map from sklearn.ensemble import GradientBoostingClassifier from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.metrics import log_loss import pandas as pd, numpy as np def train_and_eval(cfg, train_rows, holdout_rows, field_rows): pipe = Pipeline([("scaler", StandardScaler()), ("clf", GradientBoostingClassifier(**cfg))]) pipe.fit(X_train, y_train) score = log_loss(y_hold, pipe.predict_proba(X_hold)[:, 1]) field_probs = pipe.predict_proba(X_field)[:, 1].tolist() return {"cfg": cfg, "log_loss": score, "field_probs": field_probs} # 164 configs dispatched simultaneously to Burla cluster results = remote_parallel_map(train_and_eval, args_list, grow=True) results.sort(key=lambda r: r["log_loss"]) # Best: GBM depth=2, lr=0.10, subsample=0.9 → log-loss=0.1199 ensemble_probs = np.mean([r["field_probs"] for r in results[:5]], axis=0)
# 1,000,000,000,000 race simulations across 10,000 Burla workers. # Each worker runs 100,000,000 sims, returns position tallies. from burla import remote_parallel_map import numpy as np def simulate_race_batch(scores, n_sims, seed): rng = np.random.default_rng(seed) n = len(scores) counts = np.zeros((n, 4), dtype=np.int64) for _ in range(n_sims): noise = rng.normal(0, 1.8, n) # calibrated upset rate noisy = np.array(scores) + noise exp_s = np.exp(noisy - noisy.max()) probs = exp_s / exp_s.sum() order = rng.choice(n, size=4, replace=False, p=probs) for rank, idx in enumerate(order): counts[idx][rank] += 1 return {"counts": counts.tolist()} # 10,000 workers × 100,000,000 sims = 1,000,000,000,000 total in ~48 minutes args = [(log_probs, 100_000_000, seed) for seed in range(10_000)] results = remote_parallel_map(simulate_race_batch, args, grow=True) total = np.sum([r["counts"] for r in results], axis=0) # → Further Ado: 17.8% win The Puma: 8.4% Commandment: 7.3%
The real result isn’t the win probability. It’s the time it took to build a rigorous model from scratch.
If you ran all three compute steps locally, sequentially, with Python’s GIL limiting CPU threads, the sensitivity analysis alone would take ~20 minutes. The full pipeline would take the better part of an afternoon.
With Burla, the same pipeline runs in under 3 minutes of actual compute time. That changes how you iterate: you can test 5,000 weight hypotheses, not 50. You can train 164 model configurations before picking a winner, not 5. You can run a million simulations, not ten thousand.
The sensitivity analysis pointed at three signals splitting the weight roughly evenly: stamina test (19%), year-level Beyer (16%), and dosage score (16%). Trainer and jockey edges still matter, but less than the trade-press narrative usually claims. That clarity only emerged because we could afford to test 5,000 weight combinations on real 16-year back-test data instead of a handful of curated rows.