Collaborative Reasoner: Self-Improving Social Agents with Synthetic Conversations

Ansong Ni; Ruta Desai; Yang Li; Xinjie Lei; Dong Wang; Jiemin Zhang; Jane Yu; Ramya Raghavendra; Gargi Ghosh; Shang-Wen Li; Asli Celikyilmaz

Collaborative Reasoner: Self-Improving Social Agents with Synthetic Conversations

Ansong Ni, Ruta Desai, Yang Li, Xinjie Lei, Dong Wang, Jiemin Zhang, Jane Yu, Ramya Raghavendra, Gargi Ghosh, Shang-Wen Li, Asli Celikyilmaz

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: collaborative reasoning, social agents, synthetic data, self-improvement

TL;DR: We present Collaborative Reasoner (Coral), a framework to evaluate and improve the collaborative reasoning skills of language models.

Abstract: With increasingly powerful large language models (LLMs) and LLM-based agents tackling an ever-growing list of tasks, we envision a future where numerous LLM agents work seamlessly with other AI agents and humans to solve complex problems and enhance daily life. To achieve these goals, LLM agents must develop collaborative skills such as effective persuasion, assertion and disagreement, which are often overlooked in the prevalent single-turn training and evaluation of LLMs. In this work, we present Collaborative Reasoner (Coral), a framework to evaluate and improve the collaborative reasoning abilities of language models. In particular, tasks and metrics in Coral necessitate agents to disagree with incorrect solutions, convince their partners of a correct solution, and ultimately agree as a team to commit to a final solution, all through a natural multi-turn conversation. Through comprehensive evaluation on six collaborative reasoning tasks covering domains of coding, math, scientific QA and social reasoning, we show that current models cannot effectively collaborate due to undesirable social behaviors, collapsing even on problems that they can solve singlehandedly. To improve the collaborative reasoning capabilities of LLMs, we propose a self-play method to generate synthetic multi-turn preference data and further train the language models to be better collaborators. Experiments with Llama-3.1, Ministral and Qwen-2.5 models show that our proposed self-improvement approach consistently outperforms finetuned chain-of-thought performance of the same base model, yielding gains up to 16.7% absolute. Human evaluations show that the models exhibit more effective disagreement and produce more natural conversations after training on our synthetic interaction data.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 25361

Loading