SPFormer: Enhancing Vision Transformer with Superpixel Representation

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Vision Transformer; Superpixel
Abstract: In this study, we present a novel approach to enhance Vision Transformers by leveraging superpixel representation. Unlike the traditional Vision Transformer, which uniformly partitions images into non-overlapping patches of fixed size, our superpixel approach divides an image into distinct, irregular regions, each designed to cluster pixels based on shared semantics for better capturing intricate image details. Note that this superpixel clustering is also applicable at the intermediate feature level. The resulting model, denoted as SPFormer, can be trained end-to-end, and empirically demonstrates superior performance across a range of benchmarks. Additionally, SPFormer provides better interpretability through the visualization of its learned superpixels, and exhibits strong robustness against challenging testing conditions like rotation and occlusion.
Supplementary Material: pdf
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7939
Loading