ApmNet: Toward Generalizable Visual Continuous Control with Pre-trained Image Models

Haitao Wang; Hejun Wu

ApmNet: Toward Generalizable Visual Continuous Control with Pre-trained Image Models

Haitao Wang, Hejun Wu

Published: 01 Jan 2024, Last Modified: 08 Apr 2025ECML/PKDD (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we propose ApmNet, an effective asymmetric two-stream framework for generalizable visual continuous control. Since pre-trained image models (PIM) have achieved remarkable success in many fields, researchers have attempted to extend PIM to visual continuous control tasks. However, directly applying the PIM to the visual control tasks is difficult due to the absence of “task-specific" information in pre-training. Fine-tuning the PIM for a control task can be an effective solution. However, it is prone to overfitting, potentially causing the PIM to lose the generalization ability gained through pre-training. To address these issues, ApmNet adopts an asymmetric dual-stream network design. ApmNet uses a frozen pre-trained Masked Autoencoder (MAE) as a visual backbone for policy learning. The reconstructed distorted view as data augmentation introduces a very powerful generalization ability to ApmNet. Then ApmNet uses a separate pre-trained network with an adapter module to fuse different pre-trained representations and generate actions. The adapter module can ensure that ApmNet bridges the distribution shift between pre-training data and vision states. After training, ApmNet abandoned MAE and only relied on the separate pre-trained network with an adapter as the final vision backbone. Through this asymmetric strategy, ApmNet can achieve good generalization ability by only fine-tuning a small number of parameters. Extensive experiments on DMControl, DMControl-GB, and MetaWorld benchmarks verify the effectiveness of our ApmNet. Empirical evidence suggests that ApmNet outperforms previous state-of-the-art methods in terms of sample efficiency and generalization.

Loading