All Seeing Eyes: A Native-Resolution Vision-Language Framework for High-Fidelity Remote Sensing Image Understanding

Jingrui Zhang, Yong Zhang, Yimeng Xu, Zixuan Shangguan, Lijie Zhang, Lihao Yang, Yang Zhou, Xiaoyi Fan, Feng Liang

Published: 2025, Last Modified: 30 May 2026CloudCom 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The success of Vision Transformers (ViTs) has profoundly reshaped research paradigms in computer vision. However, similar to conventional CNN-based models, ViTs still require input images to be resized to fixed resolution. Mainstream open-source ViT implementations typically resize images into a square shape, which inevitably leads to information loss and increases computational and memory overhead. Moreover, prior studies have highlighted that the visual encoder in multimodal large language models (MLLMs), as the primary source of visual information, plays a crucial role in determining overall model understanding. In the field of satellite remote sensing, a distinctive challenge lies in the ultra-high resolution of imagery. For instance, a 2K remote sensing image may reach a resolution of <tex>$2560 \times 1440$</tex>, far exceeding the maximum resolution supported by open-source ViTs (e.g., <tex>$336\times 336$</tex>). This limitation results in even more severe information degradation for remote sensing scenarios. To address this, Google proposed NaViT, a vision encoder that supports native resolutions and aspect ratios. We argue that NaViT is particularly valuable for remote sensing image understanding. In this work, following the LLaVA paradigm, we replace the standard ViT with NaViT to investigate the benefits of processing images at their native resolution within MLLMs. Experimental results demonstrate that across several main-stream remote sensing benchmarks, native-resolution image understanding consistently delivers notable improvements across multiple evaluation metrics, validating its effectiveness in high-resolution scenarios.
Loading