Can AI beat the crowd?

Six LLMs make blind probability forecasts on live prediction markets — they never see the market price. We score each against the crowd (the market price itself) and test whether pooling them into an ensemble wins.

The verdict so far

Not yet — the crowd still leads the ensemble

Ensemble Brier

0.274

Crowd Brier

0.169

Markets

Best vs. Crowd

🌀 Mistral Small 3.2

-0.149 Brier

Ensemble Skill

-0.105

Brier vs. crowd

Models Beating Crowd

0 / 6

on shared markets

Markets Resolved

138 still open

Skill vs. the Crowd

Brier improvement over the market price (positive = beat the crowd)

Skill leaderboard

Why the ensemble wins

#	Forecaster	Skill vs Crowd ▼	Brier	Log Loss	ECE	Resolution	Forecasts	Reliability
1	👥The Crowdbaseline	—	0.277	0.799	0.324	0.058	230 / 534	100.0%
2	🌀Mistral Small 3.2Mistral	-0.149	0.331	0.903	0.355	0.028	230 / 537	99.6%
3	🌱Seed 1.6 FlashByteDance	-0.153	0.335	0.914	0.372	0.024	230 / 537	99.8%
4	🎯Ensembleensemble	-0.161	0.343	0.935	0.381	0.019	230 / 534	100.0%
5	💎Gemini 3.1 Flash LiteGoogle	-0.182	0.364	1.057	0.405	0.033	230 / 537	99.6%
6	🐲Qwen3 235BAlibaba	-0.185	0.371	1.404	0.401	0.021	203 / 535	91.8%
7	🧠GPT-4.1 MiniOpenAI	-0.202	0.384	1.042	0.403	0.021	230 / 536	100.0%
8	🔮DeepSeek V4 FlashDeepSeek	-0.220	0.403	1.197	0.427	0.017	222 / 534	95.9%

Sorted by skill vs. the crowd. Lower Brier, log loss, and ECE are better; higher resolution means more informative forecasts. Reliability is the share of valid (non-errored) forecasts.