End-to-End RAW Synergy for Elevated Vision-Language Reasoning

Kepeng Xu; Tong Qiao; Zhenyang Liu; Li Xu; Gang He

End-to-End RAW Synergy for Elevated Vision-Language Reasoning

Kepeng Xu, Tong Qiao, Zhenyang Liu, Li Xu, Gang He

Published: 14 Jun 2025, Last Modified: 16 Aug 2025MKLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Submission Type: Archive

Keywords: Vision Language Model, Image Process

Abstract: Visual Language Models (VLMs) typically rely on processed RGB images, leading to information loss that limits performance in challenging scenes like low-light or high dynamic range. Traditional Image Signal Processing (ISP) pipelines, optimized for human perception, discard crucial raw sensor data beneficial for machine understanding. To overcome this, we introduce Raw-VLM, an end-to-end model that enables VLMs to natively interpret raw image sensor data. Raw-VLM integrates a learnable ISP (GM-ISPNet) and a Raw-Tokenlizer module within its vision encoder (Raw-ViT). This differentiable frontend is jointly trained with the VLM, adaptively converting raw Bayer patterns into machine-centric representations that preserve vital semantic features while suppressing noise. Our approach addresses the information bottleneck, modality mismatch, and task-agnostic limitations of conventional RGB-based VLMs. Raw-VLM significantly improves performance on tasks such as raw image captioning (9% gain), visual question answering (5.4% gain), and reduces hallucinations (3.02% gain on POPE). By directly leveraging raw data, Raw-VLM enhances VLM capabilities in difficult scenarios, bridging the gap between sensor data and high-level semantic understanding.

Submission Number: 8

Loading