Abstract: Recent advancements in large multimodal language models have significantly enhanced cross-modal understanding by effectively aligning vision and language. However, due to the high resolution, dense text, and complex layouts of document images, these methods still face challenges in Visual Document Understanding (VDU), primarily due to their limited ability to capture fine-grained information and relationships. To address these challenges, we propose a novel adaptive resolution selection framework. This framework consists of two key components: (1) a Resolution-Slicing Selector, which captures document information density and performs adaptive patch slicing through resolution selection, and (2) a multi-stage training strategy with mixed preference optimization. The latter leverages preference pairs to guide the selector, assigning high-resolution representations to regions with dense information and small font sizes while applying lower-resolution representations to areas with redundant information. This method captures fine-grained visual representations while maintaining modeling efficiency. With these components, SelectVision achieves significant improvements in both efficiency and performance. Our experiments demonstrate that SelectVision delivers promising results across a wide range of evaluation datasets, including ChartQA (84.3%), WTQ (59.2%), and TabFact (83.9%).
External IDs:dblp:conf/icdar/HeYZFSLM25
Loading