Enhancing Zero-Shot VLM Reward Models Through Structure-Aware Fine-Tuning

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning, VLM, Zero-shot rewards, Reward alignment, Text-conditioned policies, Structure-aware fine-tuning (SAFT), RLHF, CLIP, Contrastive augmentation, Lipschitz regularization
TL;DR: Reward design in RL is hard. VLMs offer zero-shot rewards but are noisy and misaligned. SAFT fine-tunes them with structural priors, yielding smoother rewards, faster learning, better alignment, and less need for human annotation.
Abstract: Designing effective reward functions remains a major bottleneck in Reinforcement Learning (RL). Recent work uses large foundation Vision-Language Models (VLMs) as zero-shot reward models, computing text–observation similarity to bypass manual reward engineering. Although promising, these rewards are noisy, brittle, and misaligned with ground-truth objectives. We introduce Structure-Aware Fine-Tuning (SAFT), a lightweight, LoRA-based method that adapts frozen VLM reward models online using simple structural priors. SAFT enforces invariances and proportionality in the reward signal via augmentations and auxiliary losses, yielding smoother and more consistent reward landscapes. Experiments across classic control and robotic manipulation tasks show faster policy convergence, substantially improved alignment with ground-truth rewards, and elimination of the extensive human annotation effort that Preference-based Reinforcement Learning (PbRL) would otherwise require. These results establish structure-aware fine-tuning as a simple path toward stable, text-conditioned reinforcement learning.
Primary Area: reinforcement learning
Submission Number: 9716
Loading