Linguistic Reasoning: How Multi-Agent Debate Courtroom Simulations Can Improve Recidivism Predictions

Linguistic Reasoning: How Multi-Agent Debate Courtroom Simulations Can Improve Recidivism Predictions

ACL ARR 2025 February Submission4653 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper explores new trends to improve reasoning using multiagent linguistic debate and increased test-time compute. We offer a new benchmark to quantify how well unstructured linguistic reasoning can predict young adult recidivism on tabular statistical data. We compare popular small open LLMs with leading commercial LLMs and traditional statistical machine learning models. Two methods of linguistic reasoning are tested: (a) StandardLLM using popular single chain-of-thought (CoT) prompts and variants, and (b) AgenticSimLLM using multi-agent debate. The latter simulates a simplified multi-turn courtroom debate between prosecutor and defense agents with a decision by a judge agent. The simulation is loosely based on a US bench trial, which constrains reasoning based on roles, rules, and debate planning. Results show that SOTA commercial LLMs can use linguistic approaches to improve statistical reasoning over tabular datasets, although the current generation of leading smaller open LLMs struggle. Compared to internal reasoning models like OpenAI o3 or DeepSeek-r1, the AgenticSimLLM framework provides explicit fine-grained control over test-time reasoning with intuitive human-like reasoning explainability. Our ensemble of almost 90 unique combinations of models, sizes, and prompting strategies also shows that MAD simulations provide more stable performance with greater correlation between accuracy and F1-score metrics. Data, results, and code will be available at github.com/anon under the MIT license.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: Evaluation Methodologies, Evaluation and Metrics, Prompting and Inference Methods, Argument Mining and Analysis, , Generalization and Multi-task Reasoning, Transparency and Accountability, Explanation Faithfulness and Free-text Explanations

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 4653

Loading