Long Context Understanding using Self-Generated Synthetic Data

Jerry Li; Subhro Das; Aude Oliva; Dmitry Krotov; Leonid Karlinsky; Rogerio Feris

Long Context Understanding using Self-Generated Synthetic Data

Jerry Li, Subhro Das, Aude Oliva, Dmitry Krotov, Leonid Karlinsky, Rogerio Feris

Published: 18 Jun 2024, Last Modified: 16 Jul 2024LCFM 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long context, synthetic data, context compression

TL;DR: We study the problem of extending the context window of large language models using self-generated synthetic data

Abstract: Can a large language model (LLM) learn to understand longer context using self-generated instruction tuning data? Such capability would be important not only for long context modeling, but also to avoid issues related to licensing or copyright that may arise when relying on a separate teacher model to generate long context training data. In this paper, we address this challenge by proposing a novel set of diverse synthetic tasks that enable an LLM to create long context instruction tuning data in a scalable manner. This data is then used by the same LLM to compress its activations (and hence extend its context length) by learning a battery of low-rank adapters (LoRA), where each adapter is trained to focus on a specific compression rate. During inference, the LoRA compression experts are dynamically selected according to the length of the input. We showcase the effectiveness of our approach in the LongBench evaluation, covering tasks such as question-answering, summarization, few-shot learning, and code completion. We also plan to make our long context instruction data available to the community, which will be a useful resource for practitioners.

Submission Number: 22

Loading