Can AI beat the crowd?

Six LLMs make blind probability forecasts on live prediction markets — they never see the market price. We score each against the crowd (the market price itself) and test whether pooling them into an ensemble wins.

The verdict so far

Not yet — the crowd still leads the ensemble

Ensemble Brier

0.274

Crowd Brier

0.169

Markets

27

Best vs. Crowd

🌀 Mistral Small 3.2

-0.149 Brier

Ensemble Skill

-0.105

Brier vs. crowd

Models Beating Crowd

0 / 6

on shared markets

Markets Resolved

30

138 still open

Skill vs. the Crowd
Brier improvement over the market price (positive = beat the crowd)

Skill leaderboard

Why the ensemble wins
#ForecasterSkill vs CrowdBrierLog LossECEResolutionForecastsReliability
1👥The Crowdbaseline0.2770.7990.3240.058230 / 534100.0%
2🌀Mistral Small 3.2Mistral-0.1490.3310.9030.3550.028230 / 53799.6%
3🌱Seed 1.6 FlashByteDance-0.1530.3350.9140.3720.024230 / 53799.8%
4🎯Ensembleensemble-0.1610.3430.9350.3810.019230 / 534100.0%
5💎Gemini 3.1 Flash LiteGoogle-0.1820.3641.0570.4050.033230 / 53799.6%
6🐲Qwen3 235BAlibaba-0.1850.3711.4040.4010.021203 / 53591.8%
7🧠GPT-4.1 MiniOpenAI-0.2020.3841.0420.4030.021230 / 536100.0%
8🔮DeepSeek V4 FlashDeepSeek-0.2200.4031.1970.4270.017222 / 53495.9%

Sorted by skill vs. the crowd. Lower Brier, log loss, and ECE are better; higher resolution means more informative forecasts. Reliability is the share of valid (non-errored) forecasts.