ERPV: Enhancing Visual Reinforcement Learning with Partially Reliable Knowledge from VLMs

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large-scale vision-language models, Reinforcement learning, Decision making, Knowledge distillation
Abstract: Visual Reinforcement Learning (VRL) aims to learn optimal control policies from scratch, a process that often suffers from low exploration efficiency. Integrating large-scale vision-language models (VLMs) offers a promising solution, as they provide rich prior knowledge about the environment. However, VLMs are only partially reliable when directly applied to VRL: the inferred actions may be wrong in certain states, and the inability to identify reliable action alignment can result in excessive exploration by the agent. We propose ERPV, a novel method that effectively enhances VRL with partially reliable knowledge from VLMs. ERPV introduces two key modules: (1) Value-aware Policy Guidance, which estimates the reliability of VLMs across different states and adaptively selects trustworthy VLM-inferred actions to guide policy learning; (2) VLMs-guided Entropy Regularization, which reduces over-exploration by comparing the confidence between VRL policy and VLMs-inferred actions. Extensive experiments show that, compared to the state of the art, ERPV achieves competitive performance in both policy effectiveness and sample efficiency under diverse, complex visual control tasks. The code has been placed in the supplementary materials.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 12083
Loading