Fault Tolerance in Multi Agent Systems

Savitha Suresh; Akshay Narayan

Fault Tolerance in Multi Agent Systems

Savitha Suresh, Akshay Narayan

Published: 17 Dec 2025, Last Modified: 17 Dec 2025WoMAPF OralEveryoneRevisionsCC BY 4.0

Keywords: MARL, RL, Fault-Tolerance

Abstract: Multi-Agent Reinforcement Learning (MARL) has demonstrated strong performance in cooperative and competitive environments, but its deployment in real-world systems remains limited by vulnerability to faulty agents. Faults may arise from hardware malfunctions, communication delays, adversarial interference, or degraded sensors, and can propagate through the system if not addressed. This work investigates mechanisms for fault tolerance in MARL systems, with a focus on preserving stability, efficiency, and cooperative behavior when subsets of agents fail or behave unpredictably. We propose a framework that integrates behavior masking, reward shaping, and attention mechanisms to mitigate the impact of faulty agents. The masking component enables the policy to selectively downweight agents exhibiting faulty behavior, preventing corrupted trajectories from dominating the learning signal. This is combined with reward shaping strategies, such as penalties for oscillations and inactivity, that guide learning away from failure-prone trajectories. Experimental results in cooperative benchmarks show that our approach significantly improves performance compared to standard MARL baselines. Agents are able to maintain task performance when faced with agent failures. When agents are faulty from the start of an episode, our attention mechanism with behaviour masking achieves a 22% improvement over the baseline RNN at 20 million steps. In more challenging scenarios where agents become faulty mid-episode, our method achieves a 28% improvement compared to the baseline, demonstrating stronger robustness under dynamic faults. This work contributes toward making MARL more practical for safety-critical domains such as swarm robotics and distributed sensing. By explicitly embedding fault tolerance into the learning process, we move closer to scalable, reliable, and autonomous multi-agent systems capable of operating under real-world uncertainty.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 19

Loading