Decentralized Byzantine-Resilient Multi-Agent Reinforcement Learning with Reward Machines in Temporally Extended Tasks

ICLR 2026 Conference Submission21876 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning, Multi-agent systems, Adversarial agents
TL;DR: Decentralized Reinforcement Learning for Learning Resilient Policies against Adversarial Agents in Temporally Extended Tasks
Abstract: In resilient cooperative multi-agent reinforcement learning (c-MARL), a fraction of agents exhibit Byzantine behavior, sending fabricated or deceptive information to hinder the learning process. Unlike existing approaches that often rely on a central controller or impose stringent behavior requirements on agents, we propose a fully decentralized method using reward machines (RMs) that can learn an optimal policy for temporally extended tasks. We introduce a belief-based Byzantine detection mechanism for discrete-time multi-agent reinforcement learning (MARL), where defender (non-Byzantine) agents iteratively update probabilistic suspicions of peers using observed actions and rewards. RMs allow us to encode the temporal dependencies in the reward structure of the task and guide the learning process. Our methods introduce tabular Q-learning and actor-critic algorithms with reward machines to learn a robust consensus mechanism to isolate the influence of Byzantine agents, in order to ensure effective learning by defender agents. We establish theoretical guarantees, demonstrating that our algorithms converge to an optimal policy. We further evaluate our method against baselines in two case studies to show its effectiveness and performance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 21876
Loading