On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Nghiem Tuong Diep; Huy Nguyen; Chau Nguyen; Minh Le; Duy Minh Ho Nguyen; Daniel Sonntag; Mathias Niepert; Nhat Ho

On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation

Nghiem Tuong Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy Minh Ho Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho

Published: 01 May 2025, Last Modified: 10 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.

Lay Summary: This paper investigates the theory behind LLaMA-Adapter, a fine-tuning method for LLaMA models that uses zero-initialized attention to improve training stability and performance. Although this technique has shown strong practical results, its underlying principles were not well understood. The authors provide a theoretical explanation, linking zero-initialized attention to mixture-of-expert models and proving that both linear and non-linear prompts can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Experiments on open LLM benchmarks confirm that non-linear prompts outperform linear ones, and importantly, both consistently perform better than standard attention even with limited training data, highlighting the method’s robustness and adaptability.

Link To Code: https://github.com/duyhominhnguyen/llama-adaptor-nonlinear/tree/main

Primary Area: Deep Learning->Other Representation Learning

Keywords: Large Language Model (LLM), Theory, Instruction Tuning, Mixture of Experts

Submission Number: 6191

Loading