On the Effects of Reasoning Effort and Prompt-Based Diversification on Scientific Ideation Diversity

Published: 30 May 2026, Last Modified: 03 Jun 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: scientific ideation, large language models, test-time scaling, diversity, AI scientist
Abstract: Frontier large language models (LLMs) performing extended chain-of-thought reasoning have advanced closed-ended task performance, motivating interest in AI Scientist systems that automate stages of the research pipeline. In such systems, scientific ideation matters as the most upstream stage, where the diversity of generated research ideas bounds the downstream search space. While reasoning effort improves closed-ended task accuracy, its effect on open-ended scientific ideation diversity has not been systematically measured. We generate over 300,000 ideas across three frontier LLMs (Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro), three reasoning-effort levels (`low`, `medium`, `high`), and the LiveIdeaBench keyword set. We evaluate diversity with lexical metrics, embedding-based metrics across three embedders, and a pairwise LLM-as-a-Judge rubric across two judge models, yielding over 1,500,000 pairwise judgments. For comparison, we additionally evaluate two prompt-based diversification methods—Verbalized Sampling and String Seed of Thought—at the `low` and `high` reasoning-effort levels. Across these analyses, three main findings emerge. (1) Increasing reasoning effort raises within-keyword embedding pair distance by 13–36% from `low` to `high`, with no detectable shift in LLM-judged originality, feasibility, and clarity ratings. (2) Verbalized Sampling at `low` effort matches or exceeds default-prompt `high`-effort embedding diversity on quartile-defined keyword subsets, using 80–100% fewer reasoning tokens per idea, with no detectable decline in LLM-judged quality ratings. (3) In embedding space, idea distributions produced by varying reasoning effort and by varying prompt are nearest-neighbor distinguishable across all model–embedder–keyword-subset combinations. These findings are consistent across embedders and judges, providing a large-scale empirical map of how reasoning effort shifts open-ended scientific-ideation diversity, and surface concrete directions for future work—most centrally, downstream evaluation of whether the observed shifts correspond to scientifically meaningful differences—toward the design of AI Scientist systems.
Submission Number: 91
Loading