Abstract: Reinforcement learning has achieved remarkable success across diverse scenarios. However, learning optimal policies within partially observable games remains a formidable challenge. Crucial privileged information in states is often shrouded during gameplay, yet ideally, it should be accessible and exploitable during training. Previous studies have concentrated on formulating policies based wholly on partial observations or oracle states. Nevertheless, these approaches often face hindrances in attaining effective generalization. To surmount this challenge, we propose the actor–cross-critic (ACC) learning framework, integrating both partial observations and oracle states. ACC achieves this by coordinating two critics and invoking a maximization operation mechanism to switch between them dynamically. This approach encourages the selection of the higher values when computing advantages within the actor–critic framework, thereby accelerating learning and mitigating bias under partial observability. Some theoretical analyses show that ACC exhibits better learning ability toward optimal policies than actor–critic learning using the oracle states. We highlight its superior performance through comprehensive evaluations in decision-making tasks, such as QuestBall, Minigrid, and Atari, and the challenging card game DouDizhu.
External IDs:dblp:journals/tciaig/LiZWXX25
Loading