Incentive-Compatible Truthfulness: Engineering Rationality in Adversarial LLM Consensus via Reinforcement Learning from Market Signals (RLMS)

Published: 28 Dec 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoningEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, RLMS, AI Safety, Mechanism Design, GRPO, Multi-Agent Systems
TL;DR: A Dual-Process architecture that uses GRPO to induce a latent "Safety Chain-of-Thought," enabling agents to rationally abstain from unsafe or ambiguous queries by aligning their reasoning with objective signals rather than subjective preferences.
Abstract: The deployment of Large Language Models (LLMs) in highstakes consensus protocols reveals a critical alignment failure: Stubborn Compliance. Standard alignment paradigms, specifically Reinforcement Learning from Human Feedback (RLHF), optimize for conversational helpfulness, inducing a sycophantic bias where agents prioritize instruction fulfillment over intrinsic truthfulness or safety. In decentralized systems governed by crypto-economic primitives, this equates to terminal irrationality, as agents wager value on invalid state transitions (e.g., malicious code generation). This paper formalizes Reinforcement Learning from Market Signals (RLMS), a framework where agent alignment is derived from objective economic feedback rather than subjective human preference. We introduce a hybrid architecture employing Group Relative Policy Optimization (GRPO) to induce a latent “Safety Chain-of-Thought” (CoT), coupled with an inference-time prefix forcing mechanism. We mathematically formalize the “Abstention Frontier” and demonstrate through a 50-round Monte Carlo simulation that our Rational Agents achieve long-term solvency while standard architectures face inevitable ruin. Finally, we propose a roadmap for future work in mechanistic interpretability and cryptographic commit-reveal schemes for robust multi-agent truthfulness
Submission Number: 98
Loading