FoSAM: Focus-Oriented Adaptive Token Sampling for Efficient Segment Anything in Augmented Reality

FoSAM: Focus-Oriented Adaptive Token Sampling for Efficient Segment Anything in Augmented Reality

ICLR 2026 Conference Submission13964 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Segmentation, AR/VR

Abstract: Augmented Reality (AR) encompasses transformative technologies that are redefining how humans interact with their environment. A key component of AR is image segmentation, which breaks down the user's front-view scene into distinct regions for analysis. This process is essential for accurately overlaying digital content onto the physical world by detecting and isolating relevant objects. However, despite its importance, image segmentation poses significant computational demands and latency issues on AR devices, which can severely impact the overall user experience. In this paper, we propose Focus-Oriented Segment Anything Model (FoSAM), a framework built upon the Segment Anything Model (SAM) that utilizes real-time gaze data to focus segmentation on regions of interest, substantially lowering computational cost. Experimental results show that FoSAM reduces computational cost by over $50\times$, enabling a seamless visual experience for users, as confirmed by our real-world user study. The code is provided at https://anonymous.4open.science/r/FoSAM-D627.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 13964

Loading