CODE-VERIFIED · NO LLM JUDGE

Coaching is reasoning.

Anyone can plug numbers into a formula. A coach reasons from them — reads the athlete, prescribes the work, prevents the injury. FitnessBench grades how well AI models do that, across six sports, against exercise science computed and checked in code. Models ace the formulas and stall on the reasoning.

115code-verified questions
6disciplines
7models benchmarked
Power–duration curve the science we test
power duration → critical power (sustainable threshold) sprint1 min5 minthreshold
Ground truth from Daniels
Coggan power zones
Riegel prediction
Mifflin-St Jeor metabolics
Gabbett workload ratios
Procedural — nothing memorizable
The finding

A calculator, not a coach.

Coaching is the same loop in every sport: read the athlete's numbers, reason, prescribe. We built that loop as questions — fitness from a race result, the pace it implies, the load that won't cause injury — with every answer computed from the literature and graded in code.

Models split cleanly. Single-formula lookups — BMI, one-rep-max, heart-rate zones — are nearly perfect. The moment a question needs several chained steps of reasoning, accuracy falls off a cliff. Today's models calculate. They don't yet coach.

Field-average accuracy by task type0–100%
Single-formula "plug-in" tasksBMI · 1RM · HR zones · energy · power-to-weight
Multi-step reasoning tasksVDOT inference · training-pace prescription
Averaged across every benchmarked model — frontier and open alike. No model clears 60% on training-pace prescription.

Best model's overall accuracy across six disciplines — flattered by the easy formulas every model can do.

Best model on the multi-step reasoning tasks — the actual coaching judgment, where the frontier stalls.

6 / 29

Disciplines & task types — including a multi-step tier — every answer computed from a named formula, scored at temperature 0.

The science we test

Real models behind every question.

FitnessBench doesn't ask for opinions. Each question is generated from an established exercise-science model, so the right answer is computed — and the grading is code.

Race-pace curverunning
pacedistance → 5Khalf

Pace per km climbs predictably with distance — the Riegel/VDOT basis for prediction.

Heart-rate zonesphysiology
Z1 easy Z3 tempo Z5 VO₂max lactate threshold

Five zones as a share of max HR; Karvonen sets the target bpm for each.

Workload sweet spotinjury
danger >1.5 sweet spot 0.8–1.3 weeks →

Acute:chronic load in the 0.8–1.3 band lowers injury risk; spikes push into the danger zone.

Coverage

Six disciplines. Every answer computed.

Wherever exercise science gives a verifiable answer, FitnessBench tests it — from race paces to power zones to injury workload. A named formula stands behind every question.

Running

Fitness, prescription and prediction from race results — plus training-load safety.

VDOTtraining pacerace predictionmileage safety

Cycling

Power-based training: zones, stress, sustainable thresholds and power-to-weight.

FTP zonesTSSW/kgcritical power

Swimming

Critical swim speed from time trials, pace, and CSS-based time prediction.

CSSswim pacetime prediction

Physiology

Heart-rate zones, basal and total energy expenditure, and body composition.

max HRKarvonenBMR / TDEEMET burnBMI

Strength

One-rep-max estimation and load prescription for a target rep range.

1RM estimateload prescription

Injury

Workload-spike risk, classic overuse-injury recognition, and evidence-based management.

ACWRinjury riskrecognitionPEACE & LOVE
Methodology

Why this number means something

Most benchmarks leak, saturate, or grade with another LLM. FitnessBench is built against each of those failure modes — the score is a measurement, not a vibe.

Four defenses

Computed ground truth — answers come from established models (Daniels & Gilbert, Coggan, Riegel, Mifflin-St Jeor, Gabbett), not opinion. Code-verified — responses are parsed and checked against a numeric tolerance, never by an LLM judge. Procedural — every question's numbers are randomized per seed, so nothing is memorizable. Correctness is the target — not a proxy, so there's no confound to game.

What it covers

Running — VDOT, paces, race prediction, mileage safety. Cycling — Coggan FTP zones, TSS, power-to-weight, critical power. Swimming — critical swim speed, pace, prediction. Physiology — max-HR, Karvonen, BMR/TDEE, MET burn, BMI. Strength — 1-rep-max and load prescription. Injury — acute:chronic workload ratio, overuse-injury recognition, and evidence-based acute management (PEACE & LOVE).

Leaderboard

Which model coaches best?

Overall accuracy across all six disciplines. Click any model to read the actual questions, its full reasoning, and where it got the science right or wrong.

Loading…

Cost & value

Accuracy you can afford to ship.

An AI coach runs on every workout, for every user. The right model is the one that's correct and cheap to serve. We fit accuracy against cost — the score is how far each model beats the price it charges.

Accuracy vs. cost$ per 1k questions →
Each dot is a model. The dashed line is the value frontier — models no other beats on both accuracy and cost.
Best valueaccuracy vs. its price
#Model$/1k QAccvs curve
For teams shipping AI coaching

You've decided the coach is the product. Can you prove it's right?

The public leaderboard tells you which base model to start from. The private benchmark tells you whether your coach — your prompts, your retrieval, your fine-tune — is actually correct, at a cost you can ship, and stops an upgrade from silently making it worse.

01

Pick a base model

Rank models by sport-science correctness and by cost-per-answer, broken out by discipline — so you don't ship on a model that can't prescribe a pace.

02

Benchmark your coach

Run your actual stack against the procedural question bank. Get a discipline-level scorecard against computed ground truth — not a focus group.

03

Gate every release

Wire FitnessBench into CI as a regression gate. Swap a model or change a prompt, and find out immediately if the coach got less correct.

Don't ship a coach that flunks its own sport.

Benchmark your model — or your whole coaching stack — on computed exercise science.