HiResNets: Native Full-HD Video Recognition with Foveal Residual Streams

ICLR 2026 Conference Submission17850 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: computer vision, foveal networks
Abstract: Much of the recent progress in image and video recognition has come at the cost of memory: larger models, increased resolution, and longer temporal contexts.An inevitable component is the quadratic (or larger) growth of memory and compute based on image resolution, which is a property of the grid sampling used in convolutional networks and vision transformers. In this work we study residual networks whose convolutional blocks have logarithmic-square growth instead, enabling them to process very high-resolution video quickly and with low memory. The key insight is to use a residual architectures' residual stream as a high-resolution buffer, to which convolutional blocks only read and write via log-polar image warp operations. Layers adaptively focus on different parts of each frame, with very high resolution only near the focus point. A complete high-resolution representation is built up in the residual stream, which is analogous to eye saccades creating a complete picture in biological vision. Experiments demonstrate that our proposed HiResNets learn to foveate around scenes similarly to human vision, and have superior performance in difficult egocentric video recognition tasks, especially egocentric video with small objects and fine-grained recognition.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17850
Loading