Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL

ICLR 2025 Conference Submission4830 Authors

25 Sept 2024 (modified: 20 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning, device control, digital agents, foundation models
Abstract: Most paradigms for building foundation model agents rely on prompting or finetuning on existing demonstrations, but this is not sufficient in dynamic environments (e.g., mobile device control). In theory, while on-policy reinforcement learning (RL) should address these limitations, this approach itself is not quite effective at leveraging existing agentic data, especially when it is of low quality. An approach to address this issue is to use offline value-based RL but realizing value-based RL for agents has been elusive due to of stability and efficiency associated with running TD-learning at scale with vision-language models (VLMs). In this paper, we develop a scalable value-based RL approach called Digi-Q that makes it possible to train VLM agents with TD-learning. We situate our study in building GUI agents for Android devices. The key idea in Digi-Q is to perform TD-learning on a frozen, intermediate-layer representation of a VLM rather than training the whole VLM itself. Doing so successfully requires an initial phase of fine-tuning to prime VLM representations to feature actionable information that is critical for TD-learning. When done correctly, our approach is able to attain better performance per-unit compute FLOPS. To make maximal use of the learned Q-function, we devise a novel best-of-N policy extraction operator that imitates the best actions out of multiple candidate actions from the current policy as ranked by the value function. With no REINFORCE-style policy gradients that need careful tiuning and an efficient TD-learning approach, Digi-Q outperforms several strong prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 9.9% of relative improvement over prior best-performing offline RL method in this domain.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4830
Loading