About
Six LLMs make blind forecasts on real prediction markets — then we ask whether pooling them into an ensemble beats the crowd
Traditional LLM benchmarks leak into training data. Models can memorize answers to MMLU, HumanEval, and other static datasets, so scores drift further from genuine ability every time a benchmark is scraped into a training corpus.
Prediction markets test something that cannot be memorized: forecasting a future that has not happened yet. When a model estimates the probability that an event resolves YES next week, there is no answer key to recall — only reasoning, research, and calibration.
But this project asks a sharper question than “which model forecasts best?” It asks how much you gain by combining them. That is the entire point of the ensemble, and it only works if the individual forecasts are genuinely independent.
Inspired by Forecaster Arena.
- We pull short-horizon binary markets from Polymarket — mostly those resolving within about a week — so forecasts settle quickly and the leaderboard reflects real outcomes, not open bets.
- Each model is shown only the market's question and description and asked for one number: the probability of YES. Crucially it never sees the market price. A model that could peek at the price could just echo it, and an ensemble of price-echoers would teach us nothing.
- The market price is scored as its own forecaster, “the Crowd.” A market at 62¢ is a 62% forecast. The crowd pools the money and opinions of many humans and bots, so it is a strong, hard-to-beat baseline.
- The ensemble is simply the unweighted mean of the valid model probabilities for each market — no tuning, no weighting, zero extra API cost. If a model fails to return a usable forecast, it is left out of that market's average rather than coerced into a default.
- Rounds run automatically twice daily (10:00 and 22:00 UTC). Settlement checks for resolved markets every few hours and computes each forecaster's scores. A fresh weekly cohort opens every Monday.
| Forecaster | Provider | Cost (in / out per 1M) |
|---|---|---|
| 🔮DeepSeek V4 Flash | DeepSeek | $0.10 / $0.20 |
| 🐲Qwen3 235B | Alibaba | $0.071 / $0.10 |
| 🌱Seed 1.6 Flash | ByteDance | $0.075 / $0.30 |
| 🧠GPT-4.1 Mini | OpenAI | $0.40 / $1.60 |
| 💎Gemini 3.1 Flash Lite | $0.25 / $1.50 | |
| 🌀Mistral Small 3.2 | Mistral | $0.075 / $0.20 |
| 🎯Ensemble (mean) | Aggregate | $0.00 |
| 👥The Crowd | Polymarket | baseline |
We deliberately pick capable but inexpensive recent models, and lean on provider diversity (US, EU, and China labs). Diverse models tend to make uncorrelated mistakes — and uncorrelated errors are exactly what averaging can cancel out.
- Brier score — mean squared error between the forecast and the outcome. 0 is perfect; 0.25 is a coin flip. Lower is better.
- Log loss — punishes confident mistakes far more harshly than Brier, with probabilities clamped away from 0 and 1.
- Calibration (ECE) — when a forecaster says 70%, does it happen about 70% of the time? Visualized on each profile's calibration chart.
- Skill vs. crowd — the headline number: crowd Brier minus the forecaster's Brier on the same resolved markets. Positive means it beat the market.
- Paper P&L — a clearly-labeled secondary view that Kelly-stakes each forecaster's edge over the crowd. It is a sanity check on the skill numbers, never the headline.
Every forecaster — the six models, the ensemble, and the crowd — is scored on exactly the same shared set of resolved markets, so the comparison is apples-to-apples.
- Blind forecasts: hiding the market price is what makes the forecasts independent — and independence is what makes the ensemble meaningful.
- Failures are visible: a model that errors, times out, or returns an unparseable answer is marked failed and excluded from scoring, never coerced into a default probability. The leaderboard's reliability column is that valid-response rate.
- Plugins web search (not the
:onlinesuffix): models can research without silently swapping versions. - temperature: 0 for every model, so runs are reproducible.
- Full audit trail: every forecast stores its prompt, raw response, cost, and score in the database.
- Budget cap: a hard limit on API spend; rounds stop automatically when it is reached.
An open teaching project on forecast aggregation. Not financial advice. No real money is wagered.