Keywords: Large Language Models, Research Idea Generation, Idea Evaluation Benchmark, Pairwise Comparison
Abstract: With the rapid advancement of large language models (LLMs), automated research idea generation has attracted growing interest. Recent approaches allow LLMs to retrieve relevant literature and propose novel ideas across diverse scientific domains. However, existing evaluation practices remain fragmented and lack unified standards, often relying on direct LLM-based scoring, which limits reliability and consistency across heterogeneous idea sets. To address these challenges, we propose LigBench, an automated benchmark for fine-grained, reliable, and human-aligned assessment of AI research ideas. LigBench ensures consistent evaluation by combining structured idea formalization, large-scale reference data, and iterative pairwise comparison. We also introduce PAIR-IQ, a curated dataset from real academic papers annotated with debiased OpenReview scores, supporting pairwise judgment model training and serving as a reference for objective comparative evaluation. Experiments show that LigBench delivers stable, interpretable, and consistent evaluations, improving agreement with expert judgments, while models trained on PAIR-IQ achieve higher ranking accuracy and robustness, establishing a scalable benchmark for objective research idea assessment.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Generation, Language Modeling, NLP Applications, Resources and Evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 9622
Loading