LigBench: A Unified and Human-Aligned Benchmark for LLM-based Research Idea Generation

LigBench: A Unified and Human-Aligned Benchmark for LLM-based Research Idea Generation

ACL ARR 2026 January Submission9622 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Research Idea Generation, Idea Evaluation Benchmark, Pairwise Comparison

Abstract: With the rapid advancement of large language models (LLMs), automated research idea generation has attracted growing interest. Recent approaches allow LLMs to retrieve relevant literature and propose novel ideas across diverse scientific domains. However, existing evaluation practices remain fragmented and lack unified standards, often relying on direct LLM-based scoring, which limits reliability and consistency across heterogeneous idea sets. To address these challenges, we propose LigBench, an automated benchmark for fine-grained, reliable, and human-aligned assessment of AI research ideas. LigBench ensures consistent evaluation by combining structured idea formalization, large-scale reference data, and iterative pairwise comparison. We also introduce PAIR-IQ, a curated dataset from real academic papers annotated with debiased OpenReview scores, supporting pairwise judgment model training and serving as a reference for objective comparative evaluation. Experiments show that LigBench delivers stable, interpretable, and consistent evaluations, improving agreement with expert judgments, while models trained on PAIR-IQ achieve higher ranking accuracy and robustness, establishing a scalable benchmark for objective research idea assessment.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: Generation, Language Modeling, NLP Applications, Resources and Evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 9622

Loading