CR-Guided Transformers: Coherence-Based Redundancy Identification and Regularization

CR-Guided Transformers: Coherence-Based Redundancy Identification and Regularization

ICLR 2026 Conference Submission24893 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Redundancy Identification, Coherence-based Redundancy measure, Redundancy Regularization

Abstract: Current Transformer-based language models demonstrate excellent performance across various tasks. However, these models commonly produce redundant transformations in middle-to-deep layers. This manifests as transformations between the inputs and output in a layer containing pronounced linear correlation or near irrelevance components. This paper attributes its root cause to current training paradigms. These paradigms emphasize prediction accuracy while neglect the effectiveness of nonlinear transformations in model layers. Based on this observation, we propose criteria for identifying redundant transformations. To quantify the degree of redundancy, we further propose a Coherence-based Redundancy (CR) measure. Specifically, we treat the input and output of a model layer as sequence distributions. We leverage characteristic functions and Fourier transform to map the distributions to frequency-domain representations. Finally, we compute coherence in the complex plane and assess the effectiveness of transformations on a [0,1] coherence scale. To suppress redundant transformations at layer outputs, we propose two schemes: tree-structured residual paths and a coherence-based redundancy loss. These approaches guide middle-to-deep layers to produce effective transformations. At the same time, they supervise and regularize against redundant outputs. Our pre-training experiments on Llama3-130M with 12 layers demonstrate that the proposed methods significantly reduce redundant transformations. With training settings held constant, we successfully make the 12-layer model outperform the 14-layer baseline.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 24893

Loading