LayerGLAT: A Flexible Non-autoregressive Transformer for Single-Pass and Multi-pass Prediction

Published: 01 Jan 2024, Last Modified: 17 Jul 2025ECML/PKDD (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Non-autoregressive transformers (NATs) have made substantial progress in recent years, improving their predictive accuracy while achieving speed-ups of an order of magnitude compared to their conventional, autoregressive counterparts. However, the performance gap between NATs and autoregressive transformers (ATs) is still significant, which has triggered the development of “iterative” NATs which predict through multiple passes, targeting a trade-off between accuracy and speed. Notwithstanding the manifest benefits of both fully and iterative NATs, research seems to have overlooked the possibility of integrating them effectively, so as to deliver both strong single- and multi-pass prediction while retaining the highest possible speed-up. To bridge this gap, this paper introduces LayerGLAT, a hybrid model that combines the strengths of both fully and iterative NATs, achieving competitive performance in both single-pass and iterative prediction. The key idea of the proposed approach is a layer-wise training strategy that is able to emulate the generating conditions of both single-pass and multi-pass generation, leading to strong performance in both cases. The experimental results over three machine translation datasets have given evidence to the remarkable performance of the proposed model, which has been able to outperform leading NATs in accuracy and speed and near the accuracy of ATs (Our code is publicly available at https://github.com/lsj72123/layer-GLAT).
Loading