On the Limits of Token Reduction for Efficient Unified Vision Language Training

Siyi Chen; Weiming Zhuang; Jingtao Li; Lingjuan Lyu

On the Limits of Token Reduction for Efficient Unified Vision Language Training

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lyu

Published: 27 Apr 2026, Last Modified: 27 Apr 2026EDGE PosterEveryoneRevisionsCC BY 4.0

Keywords: Efficient Training, Unified Vision Language Model, Attention Analysis

Paper Track: Long Paper (archival)

Abstract: Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training—task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 16

Loading