TL;DR: A new type of reversible transformer for online back-propagation that improves generalization performance and becomes the standard transformer in inference procedure
Abstract: In this work we present the BDIA-transformer, which is an exact bit-level reversible transformer that uses an unchanged standard architecture for inference. The basic idea is to first treat each transformer block as the Euler integration approximation for solving an ordinary differential equation (ODE) and then incorporate the technique of bidirectional integration approximation (BDIA) (originally designed for diffusion inversion) into the neural architecture, together with activation quantization to make it exactly bit-level reversible. In the training process, we let a hyper-parameter $\gamma$ in BDIA-transformer randomly take one of the two values $\{0.5, -0.5\}$ per training sample per transformer block for averaging every two consecutive integration approximations. As a result, BDIA-transformer can be viewed as training an ensemble of ODE solvers parameterized by a set of binary random variables, which regularizes the model and results in improved validation accuracy. Lightweight side information is required to be stored in the forward process to account for binary quantization loss to enable exact bit-level reversibility. In the inference procedure, the expectation $\mathbb{E}(\gamma)=0$ is taken to make the resulting architecture identical to transformer up to activation quantization. Our experiments in natural language generation, image classification, and language translation show that BDIA-transformers outperform their conventional counterparts significantly in terms of validation performance while also requiring considerably less training memory. Thanks to the regularizing effect of the ensemble, the BDIA-transformer is particularly suitable for fine-tuning with limited data. Source-code can be found via \href{https://github.com/guoqiang-zhang-x/BDIA-Transformer}{this link}.
Lay Summary: Nowadays, almost all popular large language models (LLMs) use the so-called transformer (consisting of a sequence of blocks) architecture to learn from data. Fine-tuning a pre-trained transformer-based LLM for downstream tasks usually involves a small or medium-sized dataset and is prone to overfitting, implying that the LLM attempts to memorize the data rather than understand it.
Our work proposes a new technique named __bidirectional integration approximation__ (BDIA) to assist fine-tuning a transformer-based LLM to reduce overfitting. The basic idea is to fine-tune an ensemble of transformer-based LLMs parameterised by a set of binary random variables, which essentially enforces those LLMs in the ensemble to understand data rather than memorize it. After finishing fine-tuning, we take the average of all the LLMs in the ensemble as the final LLM to be employed in practice.
If needed, BDIA can also be implemented to save GPU memory in the fine-tuning process. To do so, we perform quantization on the output of each block of each transformer model in the ensemble when feeding input data to the model. With BDIA and quantization, it becomes feasible to update each block in each model in the ensemble on-the-fly.
Experiments on natural language generation, translation, and image classification, confirm that our new BDIA technique can indeed reduce over-fitting, promoting the transformer model to understand data.
Link To Code: https://github.com/guoqiang-zhang-x/BDIA-Transformer
Primary Area: Deep Learning->Algorithms
Keywords: transformer; reversability; BDIA; quantization
Submission Number: 14733
Loading