StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

Published: 01 Jun 2026, Last Modified: 01 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: token merging; segment anything mode, accelerate inference
TL;DR: a method for speedup SAM
Abstract: SAM achieves strong segmentation but at high inference cost dominated by its ViT image encoder. Token merging accelerates ViTs without retraining, yet directly applying it to SAM is nontrivial: SAM mixes windowed and global attention and requires dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on the SAM family in a strict off-the-shelf setting and find that existing destination-selection heuristics erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving framework that computes lightweight token-energy scores from first-order feature gradients, protects boundary and prompt regions via grid-based flatness screening, and merges flat-region tokens toward low-energy targets with explicit recovery. We further present a spectral graph coarsening analysis showing that score-guided merging yields bounded Laplacian spectral distortion relative to random or window-restricted baselines. Across five natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25--30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming recent merging techniques at the same compute.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 221
Loading