TL;DR: We introduce LAuReL, which is a novel replacement for the residual / skip connection. LAuReL significantly improves performance of vision and language models while adding as few as 0.01% more parameters and negligible extra latency.
Abstract: One of the core pillars of efficient deep learning methods are architectural improvements, such as residual/skip connections, which have led to significantly better model convergence and quality. Since their introduction, residual connections have become ubiquitous not only in convolutional neural networks but also in transformer-based architectures, the backbone of LLMs.
In this paper, we introduce the Learned Augmented Residual Layer (LAuReL) --- a novel generalization of the canonical residual connection --- designed to serve as an in-situ replacement while outperforming it in both model quality and footprint metrics. Our experiments show that LAuReL can enhance quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count.
For example, on the ImageNet-1K task, LAuReL achieves the same model quality improvements as naively adding an extra layer while using $2.6 \times$ fewer parameters. Similarly, when pre-training 1B and 4B parameter LLMs, LAuReL improves performance on a variety of challenging downstream evaluation tasks by 2.54\% to 20.05\%, while adding only 0.012\% and 0.1\% additional parameters, respectively.
Lay Summary: Residual / skip connections are crucial for the strong performance of popular neural networks like CNN, Transformer, etc. These residual connections typically combine the output of a layer with the output of the preceding layer by simple addition. Residual connections help with avoiding issues such as vanishing gradients, and speed up convergence to a low loss. However, we claim that we can improve the residual connection by introducing learned lightweight linear components, such that it leads to significantly better model quality with minimal extra parameters, latency, etc.
In this paper, we introduce Learned Augmented Residual Layer (LAuReL), which is a generalization of the residual connection and a drop-in replacement. LAuReL is a general framework but we provide three variants which can be used to cheaply make the residual connection adaptive instead of it being a simple summation. LAuReL variants can be combined with each other, or new variants can be constructed using the LAuReL framework.
Through experiments we demonstrate that LAuReL can enhance model quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count by methods such as simply adding another layer to the neural network.
On the ImageNet-1K task, LAuReL achieves the same model quality improvements as naively adding an extra layer while using $2.6 \times$ fewer parameters. Similarly, when pre-training 1B and 4B parameter LLMs, LAuReL improves performance on a variety of challenging downstream evaluation tasks by 2.54% to 20.05%, while adding only 0.012% and 0.1% additional parameters, respectively.
Primary Area: Deep Learning->Everything Else
Keywords: residual connection, skip connection, residual layer, resnet, residual
Submission Number: 11789
Loading