Keywords: Argument Quality, LLMs, Bradley-Terry Model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. This research investigates the extent to which LLMs can effectively perform this task.
It focuses on the zero-shot capabilities of LLMs in approximating expert rankings of argument quality across three dimensions: logical, rhetorical, and dialectic. It also examines the model's specific strengths and weaknesses within a prompt-engineering and pairwise evaluation framework using a Bradley-Terry model to infer latent strength scores and obtain a ranking of arguments.
Although none of the models tested (GPT-4, Gemini 2.0 Flash, and LLaMA 3.3) achieved strong alignment with the human-provided gold standard, GPT-4 demonstrated the most consistent overall performance, followed by Gemini 2.0 Flash, with LLaMA 3.3 ranking third across most dimensions.
LLMs show promising potential but still fall short of replicating expert-level evaluations.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Contribution Types: Model analysis & interpretability, Reproduction study
Languages Studied: English
Submission Number: 9522
Loading