A Spitting Image: Superpixel Transformers

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: vision transformer, ViT, superpixels, tokenizer, interpretability, attention maps, faithfulness
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Turns out that vision transformers can be trained effectively with superpixel patches
Abstract: Vision Transformer (ViT) architectures treat tokenization as an inflexible, monolithic process with regular grid partitions. In this work, we propose a generalized superpixel transformer (SPiT) framework that decouples tokenization from feature extraction; a significant shift from contemporary approaches, where these are treated as an undifferentiated whole. Using on-line superpixel tokenization and scale- and shape-invariant feature extraction, we perform experiments and ablations that contrast our approach with canonical tokenization and randomized partitions as baselines. We find that modular superpixel-based tokenization provides significantly improved interpretability using state-of-the-art metrics for faithfulness while maintaining competitive classification performance, providing a space of semantically-rich models that can generalize across different vision tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3302
Loading