Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Xialie Zhuang; Zhikai Jia; Jianjin Li; Zhenyu Zhang; Li Shen; Zheng Cao; Shiwei Liu

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

TL;DR: We propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly inte- grates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) using a decoder- only Transformer.

Abstract: Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77% percentage points. Our analysis indicates that MEAP’s effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model’s focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models. Code has been submitted.

Lay Summary: Modern large language models (LLMs), like GPT, LLaMa, and DeepSeek, are trained by learning to predict the next word in a sentence. This method, called Next-Token Prediction (NTP), helps models generate coherent text but struggles with finding important information in long documents. This paper introduces a new training method called Mask-Enhanced Autoregressive Prediction (MEAP). MEAP adds a twist to the standard training process: it randomly hides (or masks) some words in the input, forcing the model to focus more on the remaining visible words when learning. Surprisingly, this small change leads to much better understanding and recall of important facts, especially when dealing with long or complex text. Unlike older methods that use masking (like BERT), MEAP does not require more complicated architectures or extra computing power. It simply combines the strengths of two training methods—masking and next-token prediction—into one seamless process. Experiments show that models trained with MEAP: - Perform better at retrieving key information (e.g., finding a “needle in a haystack” within long documents). - Handle longer contexts more effectively. - Make fewer factual errors when summarizing. - Maintain or improve general reasoning skills compared to standard training. Because MEAP is simple to implement, works with existing model architectures, and improves both accuracy and efficiency, it offers a promising new direction for training the next generation of LLMs.

Link To Code: https://github.com/scitix/MEAP

Primary Area: Deep Learning->Large Language Models

Keywords: Masked next token prediction, LLM pre-training, key information retrieval

Submission Number: 6056

Loading