Towards a General Transfer Approach for Policy-Value Networks

Published: 07 Dec 2023, Last Modified: 07 Dec 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Transferring trained policies and value functions from one task to another, such as one game to another with a different board size, board shape, or more substantial rule changes, is a challenging problem. Popular benchmarks for reinforcement learning (RL), such as Atari games and ProcGen, have limited variety especially in terms of action spaces. Due to a focus on such benchmarks, the development of transfer methods that can also handle changes in action spaces has received relatively little attention. Furthermore, we argue that progress towards more general methods should include benchmarks where new problem instances can be described by domain experts, rather than machine learning experts, using convenient, high-level domain specific languages (DSLs). In addition to enabling end users to more easily describe their problems, user-friendly DSLs also contain relevant task information which can be leveraged to make effective zero-shot transfer plausibly achievable. As an example, we use the Ludii general game system, which includes a highly varied set of over 1000 distinct games described in such a language. We propose a simple baseline approach for transferring fully convolutional policy-value networks, which are used to guide search agents similar to AlphaZero, between any pair of games modelled in this system. Extensive results---including various cases of highly successful zero-shot transfer---are provided for a wide variety of source and target games.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. Updated final sentence (now final 2 sentences) of abstract, as per request from **Reviewer wV6n**. 2. Mentioned loss of plasticity on pages 8 and 9, and added appropriate references, as per request from **Reviewer wV6n**. 3. Added one more citation to (Wu, 2019), as per request from **Reviewer wV6n**. 4. Updated Figure 1, as per requests from** Reviewers xv6k and nhxA**. 5. Several improvements / corrections to phrasing, as requested by **Reviewer nhxA**. 6. The main submission file now also includes all appendices (no need to download separate PDF). **Changes from revision posted on 25/10/2023:** 1. Clarified and edited some text in the first few paragraphs of Section 3, to make the procedures used for mapping channels and transferring parameters more clear. Added Subsections 3.1 and 3.2 as completely new subsections for Section 3 (both largely based on what used to be only in an appendix: the appendix is still there too, with some more detail). **Addresses comments from Reviewers wV6n, xv6k, and nhxA**. 2. Explicitly mentioned the example of how channels are sometimes heuristically mapped, as is the case for Piece Type channels (based on tree edit distances) to Section 3. **Addresses comment from Reviewer xv6k**. 3. Added a new appendix and table (in this revision: Appendix D and Table 1) explaining, as an example, how channels are mapped between the games of *Minishogi* and *Shogi*, as suggested by **Reviewer wV6n**. 4. Briefly discussed the number of models trained per source game, and implications in terms of statistical reliability of results, in Section 4. **Addresses comment from Reviewer xv6k**. 5. The figure with four scatterplots for four different subsets of finetuning results, with ratios of training epochs between source and target domain on the $x$-axis, has been moved into an Appendix (this is now Figure 8). As **Reviewer xv6k** correctly pointed out, there was not much of an observable relationship between the $x$- and $y$-axes. Instead, the same space is in the main paper is now occupied by a new figure (now Figure 3), which has the zero-shot transfer playing strength on the $x$-axis. This lets us make some more observations about the extent to which negative transfer occurs (if at all) during finetuning, or only during the initial transfer. The discussion of results in Subsection 4.2 has also been extended for this. **Addresses comment from Reviewer xv6k**. 6. Moved the figure illustrating the two different board shapes used for *Pentalath* (used to be Figure 3) down into the Appendices (now Figure 6). 7. Clarified in Subsection 4.1 that the untrained UCTs back up average outcomes of 10 random rollouts per iteration of MCTS. **Based on discussion with Reviewer nhxA**.
Assigned Action Editor: ~Michael_Bowling1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1539