REFORM : Residual Filtering through Neural Aggregators for Layer-Wise Representation Integrity

20 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: residual connection, representation learning, information bottleneck, Large Language Model
Abstract: Recent studies suggest that cumulative residual connections in Transformer-based LLMs preserve signals indiscriminately, potentially leading to representation bottlenecks in deeper layers. In this work, we provide an information-theoretic analysis of this phenomenon and introduce REFORM (Representation formulation via multi-layer aggregation), a lightweight, training-only module designed to address this issue. REFORM hierarchically integrates hidden states across layers, leveraging local aggregation for continuity and global fusion for semantic abstraction. During training, three auxiliary objectives—correlation alignment, orthogonality, and cosine similarity—guide REFORM to restructure intermediate representations. Notably, REFORM is detached at inference time, thus incurring no runtime overhead. Extensive evaluations across Llama3, Qwen2, Mistral, and Phi-3.5 models on commonsense and mathematical reasoning benchmarks demonstrate consistent improvements. Analysis using SVCCA, attention entropy, and effective rank suggests that REFORM fosters richer representations, especially in mid-to-late layers, indicating that minimal inter-layer aggregation can help alleviate structural limitations without sacrificing inference efficiency. We provide the code at https://anonymous.4open.science/r/Reform.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 24362
Loading