Keywords: scorable negotiation game, LLM reasoning, self-supervised learning
Abstract: The reasoning ability of large language models (LLMs) has canonically relied on costly expert-labeled data, a resource now nearing depletion. Existing alternatives, such as Chain-of-Thought prompting or multi-agent debate, either suffer from prompt sensitivity or assume binary correctness, limiting their applicability to open-ended reasoning tasks. This raises a fundamental challenge: how can we construct scalable supervision that drives diverse yet stable reasoning without external labels? Verbal interaction offers the most natural source of new supervision signals; among the tasks that feature such interaction between AI agents, negotiation stands out as particularly suited for reasoning enhancement. We introduce Language model Self-play via Scorable negotiation Game (LSSG), a paradigm that frames reasoning enhancement as a two-player negotiation game with continuous, outcome-based rewards. Unlike prior numerical- or annotation-based games, our formulation pioneers negotiation in the language space, providing dense, interpretable signals for stable optimization at scale. LSSG combines behavioral cloning from real dialogues with self-play refinement that balances diversity and stability, yielding sustainable reasoning improvement. Across seven benchmarks, including WinoGrande, CSQA, CB, SST2, LogiQA2, MedMCQA, and CMMLU, LSSG consistently outperforms strong baselines. These results demonstrate LSSG as a scalable and robust paradigm for long-term reasoning self-supervision in LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3126
Loading