Abstract: This paper establishes the time consistent property, i.e., the dynamic programming principle (DPP), for learning mean-field controls (MFCs). The key idea is to define the correct form of the Q function, called the IQ function, for learning MFCs. This particular form of IQ function reflects the essence of MFCs and is an “integration” of the classical Q function over the state and action distributions. The DPP in the form
of the Bellman equation for this IQ function generalizes the classical DPP of Q-learning to the McKean-Vlasov system. It also generalizes the DPP for MFCs to the learning framework. In addition, to accommodate model-based learning for MFCs, the DPP for the associated value function is derived. Finally, numerical experiments are presented to illustrate the time consistency of this IQ function.
0 Replies
Loading