Keywords: LLM; AI; Research Idea Generation; Backtesting
TL;DR: Even with training frozen at least half a year before ICLR 2025, LLMs generate ideas that align with later ICLR papers, as shown by a 700-idea backtest under a reproducible method.
Abstract: Large language models (LLMs) increasingly assist with research ideation, yet systematic evidence of their capabilities is scarce. We introduce the first standardized backtesting protocol that retrospectively evaluates AI-generated ideas by semantically matching them to post-cutoff human work. Seven contemporary LLMs with training cut-off time before 2025 produced 700 AI research ideas, which we compared—using OpenAI’s text-embedding-3-small—to 11,672 ICLR 2025 OpenReview abstracts. The results show strong alignment (89.7% of ideas closely match human research), but the most similar ideas receive lower human quality assessments, yielding a modest negative correlation (|r| < 0.1). This exploitation–exploration split suggests current LLMs excel at plausible, incremental directions grounded in existing literature while struggling with the creative divergence typical of breakthrough work. Our protocol offers a reproducible benchmark and practical guidance for human–AI collaboration, positioning LLMs as systematic explorers of established trajectories while reserving conceptual leaps for human researchers.
Submission Number: 304
Loading