Keywords: Foveation, CNNs, transformers, efficiency, biological-inspiration, neuroscience, vision
Abstract: Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution over space, raising computational challenges for processing full-field high-resolution image formats efficiently. We propose a biologically-inspired foveated sampling interface that reformats a variableresolution array of sensors into a uniformly dense, curved sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, and we develop a novel kernel mapping technique to enable kNN-convolution. We demonstrate two use cases: (1) a novel kNN-convolutional architecture that natively learns features over foveated input, and (2) an integration of our foveated interface into the vision foundation model DINOv3 via low-rank adaptation (LoRA). These models maintain or improve accuracy compared to non-foveated counterparts, and open pathways for scalable active sensing and efficient modeling of increasingly high-resolution visual data.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 22227
Loading