Backtest to the Future: Can Large Language Models Generate Publishable AI Research Ideas?

Backtest to the Future: Can Large Language Models Generate Publishable AI Research Ideas?

Agents4Science 2025 Conference Submission304 Authors

16 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM; AI; Research Idea Generation; Backtesting

TL;DR: Even with training frozen at least half a year before ICLR 2025, LLMs generate ideas that align with later ICLR papers, as shown by a 700-idea backtest under a reproducible method.

Abstract: Large language models (LLMs) increasingly assist with research ideation, yet systematic evidence of their capabilities is scarce. We introduce the first standardized backtesting protocol that retrospectively evaluates AI-generated ideas by semantically matching them to post-cutoff human work. Seven contemporary LLMs with training cut-off time before 2025 produced 700 AI research ideas, which we compared—using OpenAI’s text-embedding-3-small—to 11,672 ICLR 2025 OpenReview abstracts. The results show strong alignment (89.7% of ideas closely match human research), but the most similar ideas receive lower human quality assessments, yielding a modest negative correlation (|r| < 0.1). This exploitation–exploration split suggests current LLMs excel at plausible, incremental directions grounded in existing literature while struggling with the creative divergence typical of breakthrough work. Our protocol offers a reproducible benchmark and practical guidance for human–AI collaboration, positioning LLMs as systematic explorers of established trajectories while reserving conceptual leaps for human researchers.

Submission Number: 304

Loading