Learning Provably Correct Synchronous Crash Fault Tolerant Distributed Protocols With Minimal Human Knowledge

Learning Provably Correct Synchronous Crash Fault Tolerant Distributed Protocols With Minimal Human Knowledge

TMLR Paper9284 Authors

28 May 2026 (modified: 19 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Provably correct distributed protocols, which are a critical component of modern distributed systems, are highly challenging to design and have often required decades of human effort. These protocols allow multiple agents to coordinate to come to a common agreement in an environment with uncertainty and failures. As a starting point, this work focuses on synchronous Crash Fault Tolerance (CFT) protocols, a widely used family of distributed protocols, in a bounded setting with a small number of agents. We formulate protocol design as a search problem over strategies in a game with imperfect information, and the desired correctness conditions are specified in Satisfiability Modulo Theories (SMT). However, standard methods for solving multi-agent games fail to learn correct protocols in this setting, even when the number of agents is small. We propose a learning framework, GGMS, which integrates a specialized variant of Monte Carlo Tree Search with a transformer-based action encoder, a global depth-first search to break out of local minima, and repeated feed-back from a model checker. Protocols output by GGMS are verified correct via exhaustive model checking for all executions within the bounded setting. We further prove that, under certain assumptions, the search process is complete: if a correct protocol exists, GGMS will eventually find it. In experiments, we show that GGMS can learn correct protocols for larger settings than existing methods.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Kuldeep_S._Meel2

Submission Number: 9284

Loading