Abstract: Most of current large language models (LLMs) based on next-token prediction suffer from the failures both in teacher-forcing training and autoregressive inference. Although non-autoregressive approaches offer alternative solutions to mitigate these problems, the difficulty in model inference and enormous cost for long text generation greatly impedes the application in tasks a LLM is good at. We here present a framework to predict next $n$ tokens at once, which bridges the gap between autoregressive and non-autoregressive generation. According to this framework, we propose to exploit multiple identical mask tokens appended after input context together with a novel mask recipe called future-aware self-attention mask for generation. Using this method, we finetune the pretrained models of Qwen2 series and evaluate the derived models on five benchmarks. Our finetuned model evidently surpasses those trained using two existing methods under the same condition. We also verify the great potential of our method in unrolling the autoregressive generation and discuss several directions for further improvement.
Paper Type: Long
Research Area: Generation
Research Area Keywords: next-n-token prediction, future-aware self-attention mask, autoregressive generation
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Keywords: large language model, generation
Submission Number: 36
Loading