From Clutter to Clarity: Visual Recognition through Foveated Object-Centric Learning (FocL)

Amitangshu Mukherjee; Deepak Ravikumar; Kaushik Roy

From Clutter to Clarity: Visual Recognition through Foveated Object-Centric Learning (FocL)

Amitangshu Mukherjee, Deepak Ravikumar, Kaushik Roy

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: foveated vision, active vision, object centric learning, memorization, generalization

TL;DR: Inspired by human vision, we train image classifiers on object-centric crops to prevent them from learning spurious background features, which reduces memorization of hard examples and improves generalization.

Abstract: Humans perceive the world through active vision, using rapid eye movements to focus on task relevant regions while ignoring irrelevant background clutter. Inspired by this, we introduce FocL (Foveated Object Centric Learning), a training strategy that biases image classification models toward label consistent object regions by replacing full images with foveated crops. Standard training encourages models to rely on spurious context, which degrades generalization and increases memorization, especially for hard examples in the tail of the sample difficulty distribution. FocL simulates saccades by (1) jittering fixation points around the annotated object and (2) extracting cropped regions centered on these points as foveated glimpses. This input restructuring reduces non foreground contamination, lowers mean training loss, accelerates convergence, and shifts hard samples closer to the center of the difficulty curve. In our analysis, FocL improves generalization by up to 15 % on oracle crops and improves out-of-distribution generalization from ImageNetV1 to V2 by over 7pp when paired with modern segmentation models like SAM. This reduced reliance on spurious correlations increases the mean PGD L2 adversarial distance required to flip a training set prediction by 61 % and directly resolves learning difficulty for the top 1 % memorized samples in ImageNet, reducing their cumulative sample loss by 62.5 %. By training on foveated crops, FocL requires 56 % less data to exceed the performance of standard models. FocL thus offers a simple path to more robust, and reliable visual recognition.

Submission Number: 126

Loading