Do Language Models Provide Useful Priors for Autonomous Scientific Search? A Calibration Study

Published: 01 Mar 2026, Last Modified: 01 Mar 2026P-AGIEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Technical Foundations for a Post-AGI World
Keywords: autonomous scientific discovery, large language models, black-box optimization, Bayesian optimization, hybrid search, inductive bias, post-AGI, calibration study
TL;DR: LLMs outperform random search but lag behind Bayesian optimization for continuous scientific search, while hybrid LLM+TPE performs best yet remains high-variance and unreliable.
Abstract: As large language models (LLMs) are increasingly positioned as potential autonomous scientific agents, a central open question is whether they can reliably generate hypotheses that drive iterative discovery. We study this question in a minimal autonomous search loop where a model proposes candidate solutions, receives scalar objective feedback, and iteratively refines proposals. Using continuous black-box optimization as a controlled proxy for scientific search, we compare random search, Tree-structured Parzen Estimator (TPE), LLM-driven proposal generation, and a hybrid TPE+LLM scheme under equal evaluation budgets. Across five independent seeds on a shifted ellipsoid benchmark, we find that LLM-only search performs better than random sampling but substantially worse than TPE, while hybridization achieves the best mean final performance. However, both LLM and hybrid methods exhibit high variance across seeds, indicating limited reliability. These results suggest that current LLMs do not encode sufficiently strong inductive biases for autonomous discovery and must be coupled with explicit optimization machinery in post-AGI scientific systems.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Akarsh_Jha1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 23
Loading