Incentive-Compatible Truthfulness: Engineering Rationality in Adversarial LLM Consensus via Reinforcement Learning from Market Signals (RLMS)
Keywords: Large Language Models, RLMS, AI Safety, Mechanism Design, GRPO, Multi-Agent Systems
TL;DR: A Dual-Process architecture that uses GRPO to induce a latent "Safety Chain-of-Thought," enabling agents to rationally abstain from unsafe or ambiguous queries by aligning their reasoning with objective signals rather than subjective preferences.
Abstract: The deployment of Large Language Models (LLMs) in highstakes consensus protocols reveals a critical alignment failure:
Stubborn Compliance. Standard alignment paradigms, specifically Reinforcement Learning from Human Feedback (RLHF),
optimize for conversational helpfulness, inducing a sycophantic
bias where agents prioritize instruction fulfillment over intrinsic truthfulness or safety. In decentralized systems governed
by crypto-economic primitives, this equates to terminal irrationality, as agents wager value on invalid state transitions (e.g.,
malicious code generation). This paper formalizes Reinforcement Learning from Market Signals (RLMS), a framework
where agent alignment is derived from objective economic feedback rather than subjective human preference. We introduce
a hybrid architecture employing Group Relative Policy Optimization (GRPO) to induce a latent “Safety Chain-of-Thought”
(CoT), coupled with an inference-time prefix forcing mechanism. We mathematically formalize the “Abstention Frontier”
and demonstrate through a 50-round Monte Carlo simulation
that our Rational Agents achieve long-term solvency while standard architectures face inevitable ruin. Finally, we propose a
roadmap for future work in mechanistic interpretability and
cryptographic commit-reveal schemes for robust multi-agent
truthfulness
Submission Number: 98
Loading