SCFormer: Spatial Coordination for Efficient and Robust Vision Transformers

Hao Yu; Haoyu Chen; Pichao WANG; Guoying Zhao

SCFormer: Spatial Coordination for Efficient and Robust Vision Transformers

Hao Yu, Haoyu Chen, Pichao WANG, Guoying Zhao

24 Sept 2024 (modified: 22 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision backbone, Transformer, Efficiency, Robustness, Spatial coordinating Attention.

TL;DR: We study the parameter-efficient robustness for vision backbone designs.

Abstract: We investigate the design of visual backbones with a focus on optimizing both efficiency and robustness. While recent advancements in hybrid Vision Transformers (ViTs) have significantly enhanced efficiency, achieving state-of-the-art performance with fewer parameters, their robustness against domain-shifted and corrupted inputs remains a critical challenge. This trade-off is particularly difficult to balance in lightweight models, where robustness often relies on wider channels to capture diverse spatial features. In this paper, we present SCFormer, a novel hybrid ViT architecture designed to address these limitations. SCFormer introduces Spatial Coordination Attention (SCA), a mechanism that coordinates cross-spatial pixel interactions by deconstructing and reassembling spatial conditions with diverse connectivity patterns. This approach broadens the representation boundary, allowing SCFormer to efficiently capture more diverse spatial dependencies even with fewer channels, thereby improving robustness without sacrificing efficiency. Additionally, we incorporate an Inceptional Local Representation (ILR) block to flexibly enrich local token representations before self-attention, enhancing both locality and feature diversity. Through extensive experiments, SCFormer demonstrates superior performance across multiple benchmarks. On ImageNet-1K, SCFormer-XS achieves 2.5\% higher top-1 accuracy and 10\% faster GPU inference speed compared to FastViT-T8. On ImageNet-A, SCFormer-L (30.1M) surpasses RVT-B (91.8M) in robustness accuracy by 5.6\% while using 3$\times$ fewer parameters. These results underscore the effectiveness of our design in achieving a new state-of-the-art balance between efficiency and robustness.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3404

Loading