Variable resolution improves visual question answering under a limited pixel budget

Andrey Gizdov, Shimon Ullman, Daniel Harari

Published: 24 Sept 2024, Last Modified: 14 Nov 2025ECCV 2024 (HCV)EveryoneCC BY 4.0

Abstract: AI-based systems for visual scene understanding benefit from a large field of view (FOV). Multiple camera systems extend the FOV, but larger and higher-quality images strain acquisition, communication, and computing resources. Sub-sampling the FOV effectively addresses this challenge without compromising performance on complex tasks that require fine visual cues and contextual information. We demonstrate that a variable sampling scheme, inspired by human vision, outperforms uniform sampling in several visual question answering (VQA) tasks with a limited sample budget (3% of full resolution). Specifically, we show accuracy gains of 3.7%, 2.0%, and 0.9% on the GQA, VQAv2, and SEEDBench datasets, respectively. This improvement, achieved without image scanning, holds regardless of the fixation point location, as confirmed by control experiments. The results show the potential of the biologically inspired image representation to improve the design of visual acquisition and processing models in future AI systems.