Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach

ACL ARR 2026 January Submission9522 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Argument Quality, LLMs, Bradley-Terry Model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. This research investigates the extent to which LLMs can effectively perform this task. It focuses on the zero-shot capabilities of LLMs in approximating expert rankings of argument quality across three dimensions: logical, rhetorical, and dialectic. It also examines the model's specific strengths and weaknesses within a prompt-engineering and pairwise evaluation framework using a Bradley-Terry model to infer latent strength scores and obtain a ranking of arguments. Although none of the models tested (GPT-4, Gemini 2.0 Flash, and LLaMA 3.3) achieved strong alignment with the human-provided gold standard, GPT-4 demonstrated the most consistent overall performance, followed by Gemini 2.0 Flash, with LLaMA 3.3 ranking third across most dimensions. LLMs show promising potential but still fall short of replicating expert-level evaluations.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Contribution Types: Model analysis & interpretability, Reproduction study
Languages Studied: English
Submission Number: 9522
Loading