Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

TMLR Paper5477 Authors

27 Jul 2025 (modified: 03 Nov 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The prevailing paradigm for scaling large language models (LLMs) involves monolithic, end-to-end training, a resource-intensive process that lacks flexibility. This paper explores an alternative, constructive scaling paradigm, enabled by the principle of emergent semantics in Transformers with frozen, non-semantic input embeddings. We posit that because high-level meaning is a compositional property of a Transformer's deep layers, not its input vectors, the embedding layer and trained lower layers can serve as a fixed foundation. This liberates backpropagation to focus solely on newly added components, making incremental growth viable. We operationalize this with a layer-wise constructive methodology that combines strict layer freezing in early stages with efficient, holistic fine-tuning of the entire model stack via low-rank adaptation (LoRA) as complexity increases. This method not only demonstrates stable convergence but also reveals a direct correlation between model depth and the emergence of complex reasoning abilities, such as those required for SQuAD, which are absent in shallower models. In a controlled study, our constructively grown model rivals the performance of a monolithically trained baseline of the same size, validating the efficiency and efficacy of the approach. Our findings suggest a path towards a paradigm shift from monolithic optimization towards a more biological or constructive model of AI development. This opens a path for more resource-efficient scaling, continual learning, and a more modular approach to building powerful AI systems. We release all code and models to facilitate further research.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: Addition of Two Missing Baselines: To definitively disentangle the effects of our constructive growth strategy from the frozen embedding substrate, we have introduced two new baselines as requested: (a) a monolithically trained model with frozen Unicode embeddings and (b) a monolithically trained model with the minimalist frozen n_embed=16, binary embeddings. These are now fully integrated into our controlled comparative study (Section 4.2.2) and all relevant figures and tables, providing the complete picture needed for a rigorous evaluation. Simplified and More Insightful Visualizations: The t-SNE visualizations have been redesigned for clarity. Each plot now focuses on conveying a single, strong message about the compositional emergence of semantic structure, with simplified titles and more direct captions. Expanded Appendices for Reproducibility: We have expanded the appendices. The raw training data for all new baselines is included in Appendix A. Parameter counts for the new models are in Appendix C. As requested, we have added a new Appendix D with a comprehensive table of all training hyperparameters used in the controlled study. New Discussion on Quantization Synergy: We have added a new point to the Discussion section (Section 6.3) proposing a potential application of our method. We discuss how using an inherently discrete, low-precision frozen embedding layer creates a natural synergy with model quantization techniques, offering a promising direction for future research.

Assigned Action Editor: ~Naigang_Wang1

Submission Number: 5477

Loading