The Lesson: why pool many models?
A single model is noisy — it has blind spots and overconfident moments. The bet behind this arena is that averaging several independent LLM forecasts cancels out their individual errors, producing a sharper, better-calibrated probability than any one of them alone. Three questions test that claim.
Ensemble vs. avg model
+0.030
Ensemble vs. best model
-0.029
Ensemble vs. crowd
-0.105
1. Does the ensemble beat its parts?
If pooling works, the ensemble's Brier score should sit below the average single model. Beating the best single model is a higher bar — and beating the crowd (the market price) is the highest bar of all.
2. How many models do you need?
Each extra model adds value only if it brings something the others miss. We average the Brier score over every possible k-model subset, so the curve shows the typical benefit of growing the ensemble. A curve that flattens early means a handful of models captures most of the gain.
3. Why it works: independent mistakes
Averaging only helps when models err in different directions. If every model made the same mistake on the same market, the ensemble would inherit it. The heatmap below shows how correlated each pair's forecast errors are — the greener and more independent, the more an ensemble can diversify them away.
The takeaway. Prediction markets aggregate many human (and bot) opinions into one price — a crowd. This arena asks whether a crowd of language models can do the same job, and whether their blind, independent forecasts pooled together rival the market that has money on the line. See the methodology for exactly how each number is computed.