Abstract: The decoder-only architecture has become a key driver in the current wave of large language models. However, its reliance on causal (i.e., left-side) attention limits its text understanding capabilities compared to encoder-only models that utilize bidirectional attention. On the other hand, existing encoder-only models are constrained by the scale of training data and model size, resulting in restricted understanding, especially in non-English contexts. To obtain encoder-only models with powerful understanding capabilities and also multilingual knowledge, we propose Dec2Enc, which transforms decoder-only models into encoder-only models by recovering bidirectional attention, thereby improving their understanding potential. In particular, Dec2Enc uses a zero initialization strategy that begins fine-tuning with the original causal attention mechanism, gradually learning bidirectional attention during training, which mitigates significant training disruptions that arise from mismatches between the attention mechanisms used in pre-training and fine-tuning. In our experiments with various decoder-only models from 0.5B to 9B, Dec2Enc boosts understanding capabilities and utilizes multilingual knowledge, achieving an increase of 5.2% to 22.4% in percentage of exact match answers in seven languages compared to vanilla decoder-only models, and also outperforming existing encoder-only models in overall performance. Our code is available at https://github.com/nju-websoft/Dec2Enc.
Loading