Scaling Large Vision-Language Model RL Training via Efficient Load Balancing

Zerui Wang; Qinghao Hu; Chang Chen; Jiecheng Zhou; Haojie Duanmu; Xingcheng Zhang; Peng Sun; Dahua Lin

Scaling Large Vision-Language Model RL Training via Efficient Load Balancing

Zerui Wang, Qinghao Hu, Chang Chen, Jiecheng Zhou, Haojie Duanmu, Xingcheng Zhang, Peng Sun, Dahua Lin

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RL Training;Load Balancing;Seuqence Parallelism;Distributed Training

TL;DR: FlexRL removes data and computation bottlenecks in RL for Vision-Language Models by finely sharding sequences and decentralizing data loading, achieving up to 8.47× faster training on large clusters.

Abstract: Reinforcement learning (RL) is increasingly used to align vision--language models (VLMs), yet scaling RL for VLMs is bottlenecked by multimodal data handling and extreme workload skew. In typical RL pipelines, visual data loading and preprocessing are centralized, creating severe I/O and CPU/memory stragglers, while batches that mix short image-text prompts with long video contexts lead to large cross-GPU imbalance during rollouts, inference, and training. We present FlexRL, an end-to-end system that removes these bottlenecks. FlexRL introduces: (1) ShadowLoader, a distributed, metadata-driven pipeline that keeps only lightweight visual metadata on the controller, pushes decoding and preprocessing to worker-side preprocessors, and asynchronously materializes tensors to overlap I/O with GPU computation; (2) FlexUlysses, a cost-aware sub-sequence sharding and execution engine that adaptively splits sequences to balance compute and memory. Our evaluation shows that across multiple VLM scales and multimodal datasets on 128-GPU clusters, FlexRL improves end-to-end throughput by up to 8.47$\times$ over state-of-the-art RL systems.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 2095

Loading