TL;DR: We identify when synthetic data falls short on learning long-context abilities and trace an explanation to a specific set of attention heads.
Abstract: Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically-generated long-context data. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle'' concepts to be retrieved and diversity of the surrounding "haystack'' context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. Although models trained on synthetic data underperform models trained on the real data, the impacts of both training settings can be understood via a shared feature of the attention computation, retrieval heads (Wu et al., 2024). The retrieval heads learned from synthetic data have high overlap with retrieval heads learned on real data. Furthermore, there is a strong correlation between the recall of heads learned and the downstream performance of a model, allowing us to interpret and predict the performance of models trained in different settings. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world LLM capabilities over long contexts.
Lay Summary: Increasingly, we expect large language models to process long documents, but obtaining training data for such “long context” tasks is very costly. A popular approach involves automatically constructing training data based on templates. Yet it is unclear how and why training on such “synthetic data” works when its contents are very different from realistic long-context tasks.
To study this, we explore three long context tasks that require locating and reasoning over information (the “needle”) found in the input documents (the “haystack”). We varied the realism of the “needle” sentences to be located and the diversity of the surrounding “haystack” context to synthesize training datasets of different complexity and naturalness. We examined the inner workings of LLMs trained on these different versions by analyzing “retrieval heads”, the model components responsible for locating the correct information within the context to generate answers.
We found that even when trained on unrealistic-looking synthetic data, models developed very similar retrieval heads to those that emerged when trained on realistic data. The overlap of these heads is also highly correlated with real task performance. This link allows model developers to forecast success early, filter ineffective synthetic data, and create better training setups for creating long-context LLMs.
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Synthetic Data, Long Context, Retrieval Heads
Submission Number: 14430
Loading