Methodology
How blind forecasts are collected, scored against the crowd, and pooled into an ensemble
Each model is shown a market's question and description and asked for a single number: the probability that it resolves YES. Crucially, the model never sees the market price. This is what makes the comparison meaningful — a model that could see the price could just echo it, and an ensemble of price-echoers would tell us nothing.
Because the forecasts are blind, they are genuinely independent opinions. That independence is the entire premise behind pooling them: independent errors can cancel, correlated errors cannot.
The market price is treated as its own forecaster, “the Crowd.” A market trading at 62¢ is a 62% forecast that the event happens. The crowd aggregates the opinions (and money) of many humans and bots, so it is a strong, hard-to-beat baseline.
Every model and the ensemble are scored on exactly the same resolved markets as the crowd, so skill vs. crowd — the difference in Brier score on that shared set — is an apples-to-apples measure of who forecast better.
Markets come from Polymarket's Gamma API. We deliberately pick short-horizon binary markets — roughly those resolving within seven days — and filter out illiquid or near-certain ones (prices are kept away from the 0/1 extremes). Short horizons mean forecasts resolve quickly, so the leaderboard reflects real, settled outcomes rather than open bets.
Once a market resolves, each valid forecast is scored on several axes:
- Brier score — mean squared error between the forecast and the outcome. 0 is perfect; 0.25 is a coin flip. Lower is better.
- Log loss — penalizes confident mistakes far more harshly than Brier. Probabilities are clamped away from 0 and 1 so a single wrong “certainty” doesn't produce an infinite score.
- Expected calibration error (ECE) — bins forecasts by confidence and measures the average gap between stated confidence and actual hit rate. The calibration chart visualizes the same data.
- Brier decomposition — Brier = reliability − resolution + uncertainty. Reliability (lower is better) is calibration within each confidence bucket; resolution (higher is better) is how decisively a forecaster separates winners from losers; uncertainty is the irreducible base-rate variance, identical for everyone on the shared set.
- Skill vs. crowd — the headline number: crowd Brier minus the forecaster's Brier on shared markets. Positive means it beat the market.
The ensemble's forecast for a market is simply the mean of the valid model probabilities for that market — no weighting, no tuning. It costs nothing extra to compute. If a model failed to return a usable forecast, it is left out of that market's average rather than substituted with a default.
The Lesson page probes three questions about this average:
- Does it beat its parts? Ensemble Brier vs. the average single model, the best single model, and the crowd.
- How many models do you need? We average the Brier of every k-model subset to show the marginal value of each added model.
- Why does it work? The Pearson correlation of forecast errors between each pair of models. Low correlation means independent mistakes, which is exactly what averaging can cancel out.
Every forecast row records whether the model returned a valid response. A model that errors, times out, or returns an unparseable answer is marked as a failure and excluded from scoring — it is never quietly coerced into a default probability. The leaderboard's “reliability” column is that valid-response rate, so pipeline and model failures show up honestly instead of inflating or deflating the skill numbers.
As a clearly-labeled secondary metric, we simulate Kelly-staking each forecaster's edge over the crowd at the crowd's own odds. It answers “could you have made money acting on this disagreement?” The crowd scores exactly zero by construction, since it never disagrees with itself. This is a sanity check on the skill numbers, not the headline.
Competition is grouped into weekly cohorts (ISO week ids like 2026-W22) so results can be compared across different market conditions. Rounds collect forecasts twice daily; settlement checks for resolved markets every few hours; a new cohort opens each Monday.
Market data comes from the Polymarket Gamma API and model calls go through OpenRouter at temperature 0 for reproducibility. Every forecast — prompt, raw response, cost, and score — is stored in a Turso (libSQL) database for full auditability.
- Sample sizes are small, especially early on. Brier differences of a few thousandths are not statistically meaningful yet — treat the leaderboard as indicative.
- Web search is available to models via OpenRouter's plugins API, but search quality varies by provider and run.
- The paper P&L ignores execution costs, slippage, and liquidity. It is illustrative only.
- The ensemble is an unweighted mean. Smarter aggregation (weighting by past skill, extremizing) could do better and is not attempted here.
An open teaching project on forecast aggregation. All code, data, and methodology are designed for transparency and reproducibility.