Abstract: Weight-sharing architectures provide an efficient design for Transformers. However, their reliance on a single transformation can limit the model's capacity for iterative representation refinement, a process that requires functional specialization across layers. We address this limitation by representing depth through layer-wise perturbations, creating a path toward models that are both parameter-efficient and performant. Our approach iteratively applies a shared block, and we introduce two distinct strategies to perturb its Multi-Head Self-Attention (MHSA) component with each application: a comprehensive QKOV-LoRA and a more parameter-efficient, QK/OV-circuit. The effectiveness of these strategies is validated on vision and language benchmarks, where our models demonstrate favorable performance against layer-sharing counterparts. Our results suggest that layer-wise perturbing a shared structure is an effective principle for developing capable and efficient Transformers.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Steffen_Schneider1
Submission Number: 6457
Loading