Keywords: critical question generation, benchmark, reward modeling, dataset, human preference alignment
Abstract: Peer review relies on substantive, evidence-based questions, but existing LLM-based approaches often generate surface-level queries. We find that LLM-generated questions take over 50\% of their question tokens from a paper’s first page, while human reviewers draw on the full text. Human questions are also more insightful, showing effort and grounding, whereas LLM questions mostly reflect surface style. To address this, we extract 151k candidate questions from ICLR 2024 reviews and filter them through a multi-stage filtering process into Probe-15K, a set of 15.5k high-quality questions. From this, we create ProbeVote-500, where human annotators score questions along effort, evidence, and grounding. Using these labels, we train IntelliReward, a reward model built from a frozen Autoregressive LLM with trainable multi-head transformers over the final 50 token states. This architecture outperforms API-based SFT finetuning (Gemini 2.5 Flash, GPT-4.1) as baselines for reward. Applying DAPO with IntelliReward, we train IntelliAsk, a question-generation model aligned with human preferences and substantially stronger than existing fine-tuned review models. Finally, by releasing Probe-15K, ProbeVote-500, and IntelliReward, we provide an automatic evaluation benchmark for reviewer questions that measures groundedness, effort, and evidence.
Primary Area: datasets and benchmarks
Submission Number: 24668
Loading