Variable resolution: improving scene visual question answering with a limited pixel budget

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Biologically-inspired learning, subsampling, variable resolution, scene understanding, interpretability
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Artificial intelligence (AI) scene understanding systems can benefit from utilizing a large visual field of view (FOV). Some existing systems already employ multiple cameras to extend their FOV, however, increasing image size and quality presents an overwhelming challenge to the acquisition and computing resources for such systems. An effective solution is to sub-sample the FOV, without impairing the model's performance on complex visual tasks. In this paper, we show that a variable sampling scheme, inspired by human vision, remarkably outperforms a uniform sampling scheme by 2% accuracy (65% vs. 63%) in the challenging task of scene visual question answering (VQA), under a limited samples budget (3% of the full resolution baseline). The improvement is achieved without any image scanning, and the variable resolution peaks at an arbitrarily chosen fixed image location. Our study also compared basic visual sub-tasks, in particular image classification and object detection. Comparing the variable and uniform models revealed differences in the representations learned by the different models which yield a consistently improved performance of the variable resolution models. We show that the variable sampling scheme allows the models to benefit in low resolution areas, by propagating information from the finer resolution areas, and at the same time higher resolution areas benefit from contextual information at lower resolution in the periphery. The results show the potential of the biologically-inspired image representation to improve the design of visual acquisition and processing models in future AI-based systems.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7333
Loading