Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

ACL ARR 2026 January Submission6038 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: scientific ideas judgment evaluation, semi-verifiable benchmarking, scalable evaluation via public metadata, agent evaluation

Abstract: Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models’ judgments about these scientific ideas. Towards this goal, we introduce \textbf{\gls{pot}}, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers’ agendas). \gls{pot} freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human–model misalignment against signals such as peer-review awards. In addition, \gls{pot} provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30K+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agentic performance, while the benefit of tool use is strongly task dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, \gls{pot} supports scalable evaluation of agents on future-facing scientific idea judgment tasks.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, language resources

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 6038

Loading