Submission Type: Archive
Keywords: Vision Language Model, Image Process
Abstract: Visual Language Models (VLMs) typically rely on processed RGB images, leading to information loss that limits performance in challenging scenes like low-light or high dynamic range. Traditional Image Signal Processing (ISP) pipelines, optimized for human perception, discard crucial raw sensor data beneficial for machine understanding. To overcome this, we introduce Raw-VLM, an end-to-end model that enables VLMs to natively interpret raw image sensor data. Raw-VLM integrates a learnable ISP (GM-ISPNet) and a Raw-Tokenlizer module within its vision encoder (Raw-ViT). This differentiable frontend is jointly trained with the VLM, adaptively converting raw Bayer patterns into machine-centric representations that preserve vital semantic features while suppressing noise. Our approach addresses the information bottleneck, modality mismatch, and task-agnostic limitations of conventional RGB-based VLMs. Raw-VLM significantly improves performance on tasks such as raw image captioning (9% gain), visual question answering (5.4% gain), and reduces hallucinations (3.02% gain on POPE). By directly leveraging raw data, Raw-VLM enhances VLM capabilities in difficult scenarios, bridging the gap between sensor data and high-level semantic understanding.
Submission Number: 8
Loading