VELR: Efficient Video Reward Feedback via Ensemble Latent Reward Models

Liyu Zhang; Kehan Li; Tao Zhou; Zeyi Huang; Chao Li; Jiming Chen

VELR: Efficient Video Reward Feedback via Ensemble Latent Reward Models

Liyu Zhang, Kehan Li, Tao Zhou, Zeyi Huang, Chao Li, Jiming Chen

16 Sept 2025 (modified: 27 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Model, Text-to-Video Generation, Generative Models, Reward Feedback Learning

Abstract: Reward feedback learning (ReFL) is effective for both text-to-image (T2I) and text-to-video (T2V) generation with image reward models (RMs). However, image RMs are misaligned with temporal objectives of T2V, motivating ReFL with video reward models. Nevertheless, directly deploying video RMs is impractical due to their large parameter size and the prohibitive cost of memory. To address this, we propose VELR: an efficient framework that employs ensemble latent reward models (LRMs) to predict rewards directly in latent space, bypassing expensive backpropagation through VAE decoders and video RMs. Specifically, we introduce the ensemble technique for the LRM, which enhances capacity, quantifies uncertainty, and mitigates reward hacking. VELR achieves a reduction of up to 150GB in memory, requiring as little as 12.4\% of the memory compared to standard ReFL. Experiments on OpenSora, CogVideoX-1.5, and Wan-2.1 with large-scale video RMs demonstrate that VELR achieves comparable performance as standard ReFL and enables efficient and robust video RM-based ReFL at scales previously unattainable.

Primary Area: generative models

Submission Number: 6523

Loading