Abstract: Methods like multi-agent reinforcement learning struggle to scale with growing population size. Mean-field games (MFGs) are a game-theoretic approach that can circumvent this by finding a solution for an abstract infinite population, which can then be used as an approximate solution for the $N$-agent problem. However, classical mean-field algorithms usually only work under restrictive conditions. We take steps to address this by introducing networked communication to MFGs, in particular to settings that use a single, non-episodic run of $N$ decentralised agents to simulate the infinite population, as is likely to be most reasonable in real-world deployments. We prove that our architecture's sample guarantees lie between those of earlier theoretical algorithms for the centralised- and independent-learning architectures, varying dependent on network structure and the number of communication rounds. However, the sample guarantees of the three theoretical algorithms do not actually result in practical convergence times. We thus contribute practical enhancements to all three algorithms allowing us to present their first empirical demonstrations. We then show that in practical settings where the theoretical hyperparameters are not observed, giving fewer loops but poorer estimation of the Q-function, our communication scheme still respects the earlier theoretical comparison: it considerably accelerates learning over the independent case, which hardly seems to learn at all, and often performs similarly to the centralised case, while removing the restrictive assumption of the latter. We provide ablations and additional studies showing that our networked approach also has advantages over both alternatives in terms of robustness to update failures and to changes in population size.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=J9WGHU78gb&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: Action Editor R8Yc Alec Koppel asked us to request the same action editor and reviewers as for our initial submission, since Reviewer ChYz (the holdout recommendation for reject) did not respond to our significant rebuttal to them posted on 27 Oct 2025.
Our revisions incorporate all of the concerns raised by Action Editor R8Yc in their decision, and all the concerns of Reviewer ChYz, which we had previously indicated we were happy to make in our rebuttal to Reviewer ChYz. In particular, we:
- Emphasised the minimal conceptual gap between theory and practice regarding the replay buffers. We did so by:
- Moving our ablation study of the replay buffer (previously the last experiment) to become our first experiment, and reduced its framing as an ablation to instead emphasise this experiment’s bridging role between our theoretical results and the other empirical results. As such, we start by empirically demonstrating our original ‘theoretical’ algorithms, and the fact that these algorithms do not appear to improve their returns at all without extremely high numbers of inner loops that would take impractically long to run on standard computers, taking many days or even many weeks. This bridges the gap to the rest of our experiments, where we use our buffer.
- We add Remark 6.1 in Sec. 6.1, which explains that while the buffer may mean that our specific theoretical sample guarantees are not exactly preserved, we still expect the ranking of the sample guarantees for the three architectures to remain the same. This is because although the use of samples in learning has changed, the underlying machinery that drives the difference in performance between the architectures has not. The independent architecture will still have worse sample guarantees than the central-agent one due to bias caused by policy divergence, and our networked communication and adoption can still reduce this divergence as before. Thus our theoretical results still give heuristic insight to explain our experimental results that use the buffer, with these latter showing networked populations outperforming independent ones while underperforming or performing similarly to the central-agent populations, as predicted by our original theory. We leave updating the specific theoretical sample guarantees in light of the buffer to future work.
- Updated the abstract to consider the significance and practicality of our contributions with respect to our motivation.
- Reorganised the introduction to begin with MARL, and to better explain the usefulness, relevance and fundamentals of MFGs, incorporating our answers to Reviewer ChYz.
- Added diagrams illustrating the solutions to MFGs, how to find solutions to MFGs and the various proposed architectures.
- Moved citations closer to the example applications in the list of example applications.
- Further justified our list of desired qualities for deployed MFG algorithms, incorporating our answer to Reviewer ChYz.
- Moved the citations of works using forward-backward equations from the footnote to the related work section.
- Moved the suggested paragraphs from the introduction to the related work, jumping sooner to ‘almost all prior work relies on a centralised node…’ in the introduction.
- Emphasised the novelty of our experience replay buffer, incorporating our answer to Reviewer ChYz.
- Further explained the purpose of anonymity in MFGs, incorporating our answer to Reviewer ChYz.
- Further explained when we might want to model with an infinite population, incorporating our answer to Reviewer ChYz.
- Further justified our tabular algorithm and postponement of non-tabular algorithms to future work, incorporating our answer to Reviewer ChYz.
- Added further explanations for every assumption, definition, lemma and theorem, incorporating our answers to Reviewer ChYz.
We also keep the changes already made during the previous discussion process in light of the other reviewers' suggestions. We hope that these requested changes, in addition to the recommendations for acceptance from the two other reviewers, are sufficient to change the assessment of our paper. Thank you very much for your time and effort.
Assigned Action Editor: ~Alec_Koppel1
Submission Number: 6945
Loading