Keywords: Computational argumentation, Scientific AI assistants, Large language models
TL;DR: Frontier LLMs fail at scientific argumentation despite strong math skills; our reasoning-aware training enables 3B models to match much larger systems, proving scientific discourse requires explicit supervision over scale
Abstract: Scientific discourse depends on argumentative reasoning: identifying claims, evaluating evidence, and constructing coherent responses. While recent advances in reasoning-capable language models have demonstrated strong performance on mathematical and logical benchmarks, their ability to engage in scientific argumentation remains unclear. We present the first systematic evaluation of language models across eight tasks spanning argument mining, rebuttal generation, and discourse-level reasoning using research papers, peer reviews, and grant proposals. Our study reveals that even frontier models with strong general reasoning skills struggle with domain-specific argumentative tasks, highlighting a fundamental capability gap. To address this, we introduce a training framework that explicitly scaffolds the argumentative reasoning process in language models, substantially improving their competence in scientific discourse. The resulting compact models approach or exceed the performance of much larger proprietary systems and generalize to unseen conversational settings, demonstrating reasoning transfer beyond task-specific supervision. These findings underscore that effective scientific argumentation is not an emergent property of scale, but requires explicit reasoning-aware training, and they point toward practical pathways for building AI systems that can contribute meaningfully to scientific discourse.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18838
Loading