Looking Locally: Object-Centric Vision Transformers as Foundation Models for Efficient Segmentation

Manuel Traub; Martin V. Butz

Looking Locally: Object-Centric Vision Transformers as Foundation Models for Efficient Segmentation

Manuel Traub, Martin V. Butz

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: object-centric segmentation, foveated patching, scale-invariant sampling; vision transformer; sparse pixel attention, small-object accuracy, parameter-efficient models, real-time inference

TL;DR: FLIP is an off-grid ViT that uses multi-resolution patching to segment individual objects efficiently, outperforming META's SAM with ~1000× fewer params and ~25× faster inference.

Abstract: Current state-of-the-art segmentation models encode entire images before focusing on specific objects. As a result, they waste computational resources - particularly when small objects are to be segmented in high-resolution scenes. We introduce FLIP (Fovea-Like Input Patching), a parameter-efficient vision model that realizes object segmentation through biologically-inspired top-down attention. FLIP selectively samples multi-resolution patches centered on objects of interest from the input. As a result, it allocates high-resolution processing to object centers while maintaining coarser peripheral context. This off-grid, scale-invariant design enables FLIP to outperform META's Segment Anything models (SAM, SAM2 and fast variants) by large margins: With more than 440$\times$ fewer parameters, FLIP-Tiny (0.51M parameters) reaches a mean IoU of 79.90\% while SAM2-L reaches 75.87\% IoU (224.45M parameters). FLIP-Large even achieves 83.26\% mean IoU (96.6M parameters), still running about $2 \times$ faster than SAM2-L. We evaluate on six benchmarks in total. In five established benchmarks (Hypersim, KITTI-360, OpenImages, COCO, LVIS) FLIP consistently outperforms SAM and various variants of it. In our novel ObjaScale dataset, which stress-tests scale invariance with objects ranging from 0.0001\% up to 25\% of the image area, we show that FLIP segments even very small objects accurately, where existing models fail severely. FLIP opens new possibilities for real-time, object-centric vision applications and offers much higher energy efficiency. We believe that FLIP can act as a powerful foundation model, as it is very well-suited to track objects over time, for example, when being integrated into slot-based scene segmentation architectures.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5737

Loading