Quantifying the Importance of Data Alignment in Downstream Model Performance

ICLR 2024 Workshop DMLR Submission86 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024DMLR @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, large language model, autoformalization, data centric machine learning, fine-tuning, automated reasoning
TL;DR: Alignment interventionally leads to better test performance and preliminary results show that it matters more than dataset size.
Abstract: Contrary to the conventional emphasis on dataset size, we explore the role of data alignment---an often overlooked aspect of data quality---in training capable Large Language Models (LLMs). To do so, we use the Task2Vec-based alignment coefficient, a quantitative measure of the similarity between two datasets, in order to quantify the impact of training data alignment with evaluation data on downstream performance. In particular, we conduct controlled interventional experiments for two settings: the impact of increased alignment coefficients between various pre-training and evaluation datasets, and the impact of increased alignment coefficients between fine-tuning and evaluation data for the domain of autoformalization. In both settings, we find a strong, predictable negative relationship that correlates the alignment coefficient between a model's training and evaluation data and the model's loss/perplexity on the respective downstream task. These findings suggest a re-evaluation of LLM training approaches, demonstrating the relevance of data alignment compared to data quantity, especially in specialized downstream tasks such as autoformalization.
Primary Subject Area: Role of data in foundation models: pre-training, prompting, fine-tuning
Paper Type: Research paper: up to 8 pages
DMLR For Good Track: Participate in DMLR for Good Track
Participation Mode: Virtual
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 86
Loading