Keywords: scientific ideas judgment evaluation, semi-verifiable benchmarking, scalable evaluation via public metadata, agent evaluation
Abstract: Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models’ judgments about these scientific ideas. Towards this goal, we introduce \textbf{\gls{pot}}, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers’ agendas). \gls{pot} freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human–model misalignment against signals such as peer-review awards. In addition, \gls{pot} provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30K+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agentic performance, while the benefit of tool use is strongly task dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, \gls{pot} supports scalable evaluation of agents on future-facing scientific idea judgment tasks.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, language resources
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 6038
Loading