Hail to the Thief: Exploring Attacks and Defenses in Decentralized GRPO

18 Sept 2025 (modified: 18 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: rl, adversarial attacks, distributed, grpo
TL;DR: A novel adversarial attack and defense strategy for decentralized GRPO
Abstract: Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralized training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralized GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to $100$\% in as few as 50 iterations. To defend against such attacks, we propose two defenses for two settings depending on the assumptions of models used (whether everyone trains the same model or different models). We show that these defenses can achieve stop rates of up to $100$\%, making the attack impossible.
Primary Area: reinforcement learning
Submission Number: 11808
Loading