SPFormer: Enhancing Vision Transformer with Superpixel Representation

TMLR Paper3373 Authors

21 Sept 2024 (modified: 11 Nov 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This work introduces SPFormer, a novel Vision Transformer architecture enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer divides the input image into irregular, semantically coherent regions (i.e., superpixels), effectively capturing intricate details. Notably, this is also applicable to intermediate features, and our whole model supports end-to-end training, empirically yielding superior performance across multiple benchmarks. For example, on the challenging ImageNet benchmark, SPFormer outperforms DeiT by 1.4% at the tiny-model size and by 1.1% at the small-model size. Moreover, a standout feature of SPFormer is its inherent explainability — the superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the model's interpretability and stronger robustness against challenging scenarios like image rotations and occlusions.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Charles_Xu1
Submission Number: 3373
Loading