LOGAH: Initialize Large Transformers via Small Graph HyperNetworks

Xinyu Zhou; Boris Knyazev; Alexia Jolicoeur-Martineau; Jie Fu

LOGAH: Initialize Large Transformers via Small Graph HyperNetworks

Xinyu Zhou, Boris Knyazev, Alexia Jolicoeur-Martineau, Jie Fu

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Parameter Prediction, HyperNetwork, Parameter Initialization

TL;DR: We propose LOGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts.

Abstract: A good initialization of deep learning models is essential since it can help them converge better and faster. One recent and underexplored approach to a good initialization is to use Graph HyperNetworks (GHNs) to predict good model parameters given its computational graph. One key limitation of GHNs is that for very wide networks a GHN copies small predicted chunks of parameters multiple times and requires an extremely large number of parameters to support full prediction, which greatly hinders its adoption in practice. To address this limitation, we propose LOGAH (Low-rank GrAph Hypernetworks), a GHN with a low-rank parameter decoder that expands to significantly wider networks without requiring as excessive increase of parameters as in previous attempts. LOGAH allows us to predict the parameters of large neural networks in a memory-efficient manner. We show that vision models (i.e., ViT) initialized with LOGAH achieve better performance than those initialized randomly or using existing GHNs.

Submission Number: 59

Loading