Data Alignment Predicts Language Model Performance: Evidence from Controlled Experiments in Autoformalization

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: data centric machine learning, llms, autoformalization, reasoning, data selection, data quality metrics, deep learning
TL;DR: Data alignment are an effective metric to efficiently train LLMs in both pretraining and finetuning
Abstract: We investigate whether data alignment -- the similarity between training and evaluation data -- is a stronger predictor of language model performance than dataset size. Through controlled experiments, we demonstrate that alignment coefficients consistently predict downstream performance across three distinct metrics: Task2Vec embeddings (r^2 = 0.8-0.96), GZIP compression distance (r^2 = 0.90), and sentence embeddings (r^2 = 0.80). We consider two experimental settings: (1) pre-training on domain-specific corpora (PubMed, USPTO) and evaluating cross-domain performance, and (2) fine-tuning on autoformalization datasets with varying alignment to formal verification tasks. Our results show strong negative correlations between alignment and perplexity across both settings, with highly aligned small datasets (1.4k tokens) outperforming larger misaligned datasets (4.1k tokens) by 53% in perplexity reduction. These findings provide quantitative evidence that strategic data selection based on alignment can be more effective than simply scaling dataset size, offering practical guidance for efficient model training in specialized domains.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23283
Loading