Keywords: forecasting, evaluation, LLMs, criticism, leakage, data
TL;DR: Forecasting evals are tricky to do without information leakage and difficult to extrapolate performance from.
Abstract: Benchmarking Language Models (LMs) at their ability to forecast world events holds potential as an evaluation for whether they truly possess effective world models. Recent works have claimed LLMs achieve human-level forecasting performance. In this position paper, we argue that **evaluating LLM forecasters presents unique challenges beyond those faced in standard LLM evaluations, raising concerns about the trustworthiness of current and future performance claims.** We identify two broad categories of challenges: (1) difficulty in trusting evaluation results due to temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting ability. Through systematic analysis of these issues and concrete examples from prior work, we demonstrate how evaluation flaws can lead to overly optimistic assessments of LLM forecasting capabilities.
Submission Number: 24
Loading