Visual Encoders for Imitation Learning in Modern Video Games

Published: 01 Apr 2025, Last Modified: 30 Apr 2025ALAEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Imitation Learning, Visual Encoders, Video Games
TL;DR: Comparative study of different end-to-end trained and pre-trained visual encoders for imitation learning in modern video games
Abstract: Video games have served as useful benchmarks for the decision-making community, but going beyond Atari games towards modern games has been prohibitively expensive for the vast majority of the research community. Prior work in modern video games typically relied on game-specific integration to obtain game features and enable online training, or on existing large datasets. An alternative approach is to train agents using imitation learning to play video games purely from images. However, this setting poses a fundamental question: which visual encoders obtain representations that retain information critical for decision making? To answer this question, we conduct a systematic study of imitation learning with publicly available pre-trained visual encoders compared to the typical task-specific end-to-end training approach in Minecraft, Counter-Strike: Global Offensive, and Minecraft Dungeons. Our results show that end-to-end training can be effective with comparably low-resolution images and only minutes of demonstrations, but significant improvements can be gained by utilising pre-trained encoders such as DINOv2 depending on the game. In addition to enabling effective decision making, we show that pre-trained encoders can make decision-making research in video games more accessible by significantly reducing the cost of training.
Supplementary Material: pdf
Type Of Paper: Full paper (max page 8)
Anonymous Submission: Anonymized submission.
Submission Number: 14
Loading