Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

Srujan P Mule; Aniketh Garikaparthi; Manasi Patwardhan

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

Srujan P Mule, Aniketh Garikaparthi, Manasi Patwardhan

Published: 29 Apr 2026, Last Modified: 13 May 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: AI for Scientific Discovery, Automated Research Agents, Idea Evaluation, Objective Evaluation, Verifiable Reward Model, Empirical Verification, Scientific Benchmarking

TL;DR: We train efficient reward models to predict which research idea will win empirically without running any experiments. Using benchmark leaderboards, we convert SOTA progress into pairwise comparisons. The resulting models outperform GPT-5 by over 15%.

Abstract: As language models accelerate scientific research by automating hypothesis generation, a critical evaluation bottleneck emerges. Current evaluation paradigms rely heavily on language-model judgments over subjective criteria like "excitement" or "novelty", which often fail to correlate with practical, empirical success. We study \emph{comparative empirical forecasting}: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30\% acc.), SFT dramatically boosts performance to 77.1\%, outperforming GPT-5 (61.1\%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35\% acc. with interpretable justifications. Through additional robustness and out-of-distribution evaluations, we show that the model is robust to surface-level heuristics and transfers well to both a cross-domain test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as robust, objective evaluators, offering a scalable and practical evaluation framework for AI-assisted scientific discovery.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Type: Research Paper

Archival Status: Non-archival

Submission Number: 94

Loading