Keywords: forecasting, markets, trading, LLM, evaluation, eval, consistency, robustness
TL;DR: It is difficult to evaluate AI forecasters; we do market-based consistency evals on recent LLM forecasters and show plenty of inconsistency.
Abstract: Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how to benchmark and evaluate those *instantaneously*? Following the consistency check framework, we measure forecasting performance on certain topics according to how consistent the predictions on different logically related questions are. The main consistency metric we use is one of arbitrage: for example, if a forecasting AI predicts 60% probability for both the Democratic and Republican parties to win the 2024 US presidential election, an arbitrageur could trade against the forecaster's predictions and make a profit. We build an automated evaluation system: starting from the instruction "query the forecaster's predictions on the topic of X," our evaluation system generates a set of base questions, instantiates the consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We conclude with the possible applications of our work in steering and evaluating superhuman AI oracle systems.
Submission Number: 39
Loading