Unroll Autoregressive Generation via Next-N-Token Prediction

Unroll Autoregressive Generation via Next-N-Token Prediction

ACL ARR 2025 May Submission36 Authors

06 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Most of current large language models (LLMs) based on next-token prediction suffer from the failures both in teacher-forcing training and autoregressive inference. Although non-autoregressive approaches offer alternative solutions to mitigate these problems, the difficulty in model inference and enormous cost for long text generation greatly impedes the application in tasks a LLM is good at. We here present a framework to predict next $n$ tokens at once, which bridges the gap between autoregressive and non-autoregressive generation. According to this framework, we propose to exploit multiple identical mask tokens appended after input context together with a novel mask recipe called future-aware self-attention mask for generation. Using this method, we finetune the pretrained models of Qwen2 series and evaluate the derived models on five benchmarks. Our finetuned model evidently surpasses those trained using two existing methods under the same condition. We also verify the great potential of our method in unrolling the autoregressive generation and discuss several directions for further improvement.

Paper Type: Long

Research Area: Generation

Research Area Keywords: next-n-token prediction, future-aware self-attention mask, autoregressive generation

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Keywords: large language model, generation

Submission Number: 36

Loading