Depth as Modulation in Weight-Sharing Transformers

TMLR Paper6457 Authors

10 Nov 2025 (modified: 22 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Weight sharing reduces Transformer parameters by reusing a single block across depth, but it can restrict depth-dependent behavior. We study a simple way to reintroduce depth variation in weight- sharing (recurrent) Transformers: we keep the shared block fixed and introduce depth-indexed Low-Rank Adaptation (LoRA) modules inside multi-head self-attention (MHSA). We compare two parameterizations: low-rank updates on the Q, K, O, V projections, and a more constrained variant that applies low-rank corrections at the attention-logit and OV-contraction sites. Under matched trainable-parameter budgets, depth-indexed MHSA modulation tends to recover accuracy in the vision settings we studied, with particularly strong gains in low-data regimes; in language, the effects are more task-dependent and include both improvements and decreases. The results clarify when depth-wise MHSA modulation complements weight sharing and how it trades off accuracy and efficiency.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We revised the manuscript in response to the reviews, prioritizing clearer positioning, tighter experimental controls, and more transparent reporting. Changes are reflected in the revised PDF, where updated text is primarily marked in blue. - **Clearer problem statement and method description:** Rewrote the Abstract/Introduction/Conclusion to remove ambiguous framing (e.g., “depth-by-perturbation principle,” autonomous/non-autonomous terminology) and to describe the approach directly as a frozen shared backbone with depth-indexed Low-Rank Adaptation (LoRA) modules inside multi-head self-attention (MHSA). - **More conservative, evidence-led claims:** Tightened the main claims to focus on (i) the architectural design choice and (ii) empirical characterization under controlled budgets, explicitly noting mixed outcomes (including task-level regressions). - **Language experiments:** - Added **GLUE** in addition to **SuperGLUE**, and introduced **Adapter** as an additional parameter-efficient fine-tuning (PEFT) baseline alongside LoRA. - Standardized language-side reporting by consolidating trainable parameters, LoRA rank $r$, and Start Index (the first layer where depth-indexed modulation is activated) in a single configuration/cost table, and aligning the main language result tables to reference it. - Re-ran the main language comparisons under matched trainable-parameter budgets across LoRA, Adapter, QKOV-LoRA, and QK/OV-Circuit to reduce confounding from adaptation capacity. - Reported results as mean $\mu \pm \sigma$ over $n=10$ random seeds and added Holm-corrected significance marks versus LoRA (†, $p<0.05$), while reporting improvements and degradations side by side. - Added a representational analysis section using linear Centered Kernel Alignment (CKA) and token cosine similarity, contrasting CoLA (where gains are observed) vs. WiC (where degradations can occur), with additional CKA results in the appendix. - Added proxy efficiency metrics (e.g., GFLOPs, inference throughput, and wall-clock training time) to clarify compute/latency trade-offs across methods. - **Vision experiments:** - Expanded vision baselines to include **MiniViT**, and added matched-parameter groupings for ImageNet comparisons. - Clarified the evaluation protocol by separating the fixed-rank Start Index sweep from matched-parameter comparisons; reorganized the 10% ImageNet-1k training-data table into matched-parameter pairs (MiniViT vs. proposed method). - **Related work, limitations, and broader impact:** Updated Related Work with relevant recent papers and clarified how they differ from our setting; expanded limitations (e.g., Start Index sensitivity, scope beyond ALBERT, and sequence-length dependence considerations); added a Broader Impact statement.
Assigned Action Editor: ~Steffen_Schneider1
Submission Number: 6457
Loading