Learn All You Need in One Hypernetwork

Jun Meng; Mohammadhossein Amouei; Benjamin C. M. Fung; Xinyu Hu; Shih-Chia Huang

Learn All You Need in One Hypernetwork

Jun Meng, Mohammadhossein Amouei, Benjamin C. M. Fung, Xinyu Hu, Shih-Chia Huang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hypernetworks, Attention, Transformer

Abstract: While attention mechanism is considered the cornerstone of Transformers, its layer-specific parameterization presents challenges for efficiency and knowledge reuse. Recent work reformulates multi‑head self-attention as a hypernetwork, suggesting it can be mathematically interpreted as an implicit hypernetwork conditioned on key–query pairs. However, prior work has been limited to small‐scale tasks or theoretical demonstrations, leaving open whether explicit hypernetworks can scale to full language-model pre-training. We first prove the existence of a shared hypernetwork that can approximate the multi-head self-attention with fewer parameters. Building on this insight, we propose HyperBERT, a BERT-style Transformer encoder in which the multi-head self-attention mechanism is replaced by a single-layer MLP dynamically generated by one explicit, shared hypernetwork. In our experiments, a 4-head, 2-layer Transformer decoder serves as the shared hypernetwork to generate a single-layer MLP to replace all query, key, value, and output (QKVO) projection matrices in each layer of a 4-head, 4-layer BERT. Pre-trained on WikiText-103, our 4-layer HyperBERT matches the average GLUE score of a BERT baseline ( $ \Delta \le 0.1 $ ) with 6\% fewer parameters and outperforms other MLP-based attention alternatives. Furthermore, the transplant experiment shows that the hypernetwork's learned weights transfer more effectively to deeper models than conventional attention parameters under a fixed-parameter budget. To the best of our knowledge, this is the first pre-training study that replaces multi-head self-attention with MLPs generated by a shared hypernetwork. Our results suggest that an explicit, shared hypernetwork can serve as a modular, parameter-efficient replacement for multi-head self-attention in BERT-style Transformer encoder models while preserving language modeling capabilities.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 23230

Loading