A Systematic Study of the Role of Data Quality and Alignment for Fine-tuning LLMs for Enhanced Autoformalization

Published: 19 Mar 2024, Last Modified: 01 Jun 2024Tiny Papers @ ICLR 2024 ArchiveEveryoneRevisionsBibTeXCC BY 4.0
Keywords: autoformalization, data alignment, perplexity loss, reasoning, data centric machine learning, machine learning for theorem proving, neurosymbolic
TL;DR: Alignment of a dataset can predict how well an LLM can be fine-tuned to perform specific tasks such as autoformalization.
Abstract: This study explores the role of data quality, particularly alignment, in fine-tuning Large Language Models (LLMs) for the task of autoformalization. Contrary to the conventional emphasis on dataset size, our research highlights the importance of data alignment - the similarity between training data and target domain. Through our experiments, we demonstrate a negative correlation between data alignment and model perplexity loss. These findings suggest a re-evaluation of LLM training approaches, emphasizing quality and relevance over quantity, especially in specialized applications such as autoformalization.
Submission Number: 245
Loading