FirstPass: Grounding AI Scientific Judgment in Multi-Round Editorial Outcomes

Prabhjot Singh; Somnath Luitel; Manmeet Singh; Josh Durkee

FirstPass: Grounding AI Scientific Judgment in Multi-Round Editorial Outcomes

Prabhjot Singh, Somnath Luitel, Manmeet Singh, Josh Durkee

Published: 30 May 2026, Last Modified: 07 Jun 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.

Track: Track 1: Original Research/Position/Education/Attention Track

Keywords: scientific peer review, multi-round dialogue, loss masking, editorial outcome prediction, domain generalization, AI co-author, fine-tuning, Nature Communications, revision-cycle prediction, outcome-grounded evaluation

TL;DR: Response-only loss masking is an architectural prerequisite for long-input scientific classification: a 7B model trained on 3,668 multi-round Nature Communications dialogues achieves 80.5% accuracy predicting editorial revision cycles.

Abstract: AI systems for peer review fail on three fronts: they train on Computer Science and Machine Learning venues alone, ignore the iterative dialogue that validates science, and evaluate on stylistic mimicry rather than real editorial judgment. We introduce FirstPass, a dataset and fine-tuned model that addresses all three. Curating 3,668 complete multi-round peer-review dialogues from Nature Communications across five scientific domains (biology, chemistry, neuroscience, physics, and earth science), we exploit mandatory transparent peer review (instituted November 2022) and verify 100\% content integrity by automated audit. We fine-tune Qwen2.5-7B-Instruct via Low-Rank Adaptation (LoRA) on three tasks: review generation, reviewer updating, and revision-cycle prediction. Our key finding is that response-only loss masking is a prerequisite, not an optimization: without it, accuracy is 62.0\%, below the majority baseline; with it, FirstPass achieves 80.5\% accuracy and F1-macro 78.2\% on predicting editorial outcomes (Standard vs. Extended revision cycles), outperforming Gemini-3.1-flash-lite-preview zero-shot by 10.4 percentage points and all baselines with statistical significance (McNemar $p < 0.001$). On generation, FirstPass produces reviews averaging 1,187 words, substantially closer to human references (2,155 words) than any baseline, achieving ROUGE-L 0.154 with significant gains over Qwen and DeepSeek zero-shot ($p < 0.001$). Deployed in the pre-submission loop as an anticipatory scientific co-author, FirstPass simulates expert critique and predicts revision cycle outcomes before submission, giving authors the judgment a trusted colleague would provide, with consistent cross-domain performance across five disciplines.

Submission Number: 93

Loading