Track: Track 1: Technical Foundations for a Post-AGI World
Keywords: Automated Scientific Discovery, Comparative Empirical Forecasting, Research Idea Evaluation, Large Language Models (LLMs), Small Language Models (SLMs), Reinforcement Learning, Interpretable Reasoning.
TL;DR: We study comparative empirical forecasting to filter research ideas . Fine-tuned 8B models predict success with 77.1% accuracy (beating GPT-5) and use RL for interpretable reasoning, enabling scalable oversight for automated discovery.
Abstract: As potential AGI-level systems begin to reshape science, hypothesis generation represents the most immediate shift, with language models (LMs) generating research ideas at a scale far exceeding the capacity to validate them through experimentation.
This creates the risk of wastefully allocating resources to ideas that fail to translate into real-world gains. As a step towards filtering the most promising ideas, we study comparative empirical forecasting: given a research goal and two candidate ideas, predict which one will achieve better empirical performance \emph{before} any experiments are run. We construct a dataset of 11,488 idea pairs grounded in objective benchmark outcomes from PapersWithCode, and find that 8B-parameter models fine-tuned on this data achieve 77.1\% accuracy, outperforming frontier models like GPT-5 (61.1\%). Further, we attempt to establish interpretable grounds for \emph{trust} by training these models to articulate their reasoning via Reinforcement Learning with Verifiable Rewards, achieving 71.35\% accuracy. Such models can serve as transparent filters enabling oversight as the scientific process continues to evolve rapidly.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 20
Loading