LLaVA-UHD v3: Progressive Visual Compression for Efficient Naive-Resolution Encoding in MLLMs

Shichu Sun; Yichen Zhang; Haolin Song; Zonghao Guo; Chi Chen; Yidan Zhang; Yuan Yao; Zhiyuan Liu; Maosong Sun

LLaVA-UHD v3: Progressive Visual Compression for Efficient Naive-Resolution Encoding in MLLMs

Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen, Yidan Zhang, Yuan Yao, Zhiyuan Liu, Maosong Sun

15 Sept 2025 (modified: 20 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Model

TL;DR: We introduce LLaVA-UHD v3, which achieves competitive performance with state-of-the-art MLLMs. With Progressive Visual Compression inside ViT, ViT-UHD improves efficiency by 2.4×, and LLaVA-UHD v3 reduces inference latency by 1.9×.

Abstract: Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global naive-resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient naive-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual modeling, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4$\times$, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9$\times$.We will release all code and checkpoints to support future research on efficient MLLMs.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5312

Loading