Abstract: Foundation models achieve State-of-the-art (SOTA) performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or fine-tuning, limiting their practicality. Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce TOAST, a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformation or even the identity, without any additional training. Across SOTA pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Naigang_Wang1
Submission Number: 7252
Loading