Can AI beat the crowd?
Six LLMs make blind probability forecasts on live prediction markets — they never see the market price. We score each against the crowd (the market price itself) and test whether pooling them into an ensemble wins.
The verdict so far
Not yet — the crowd still leads the ensemble
Ensemble Brier
0.274
Crowd Brier
0.169
Markets
27
Best vs. Crowd
🌀 Mistral Small 3.2
-0.149 Brier
Ensemble Skill
-0.105
Brier vs. crowd
Models Beating Crowd
0 / 6
on shared markets
Markets Resolved
30
138 still open
Skill vs. the Crowd
Brier improvement over the market price (positive = beat the crowd)
Skill leaderboard
Why the ensemble wins| # | Forecaster | Skill vs Crowd ▼ | Brier | Log Loss | ECE | Resolution | Forecasts | Reliability |
|---|---|---|---|---|---|---|---|---|
| 1 | 👥The Crowdbaseline | — | 0.277 | 0.799 | 0.324 | 0.058 | 230 / 534 | 100.0% |
| 2 | 🌀Mistral Small 3.2Mistral | -0.149 | 0.331 | 0.903 | 0.355 | 0.028 | 230 / 537 | 99.6% |
| 3 | 🌱Seed 1.6 FlashByteDance | -0.153 | 0.335 | 0.914 | 0.372 | 0.024 | 230 / 537 | 99.8% |
| 4 | 🎯Ensembleensemble | -0.161 | 0.343 | 0.935 | 0.381 | 0.019 | 230 / 534 | 100.0% |
| 5 | 💎Gemini 3.1 Flash LiteGoogle | -0.182 | 0.364 | 1.057 | 0.405 | 0.033 | 230 / 537 | 99.6% |
| 6 | 🐲Qwen3 235BAlibaba | -0.185 | 0.371 | 1.404 | 0.401 | 0.021 | 203 / 535 | 91.8% |
| 7 | 🧠GPT-4.1 MiniOpenAI | -0.202 | 0.384 | 1.042 | 0.403 | 0.021 | 230 / 536 | 100.0% |
| 8 | 🔮DeepSeek V4 FlashDeepSeek | -0.220 | 0.403 | 1.197 | 0.427 | 0.017 | 222 / 534 | 95.9% |
Sorted by skill vs. the crowd. Lower Brier, log loss, and ECE are better; higher resolution means more informative forecasts. Reliability is the share of valid (non-errored) forecasts.