A Systematic Study of the Role of Data Quality and Alignment for Fine-tuning LLMs for Enhanced Autoformalization

Krrish Chawla; Mario DePavia; Aryan Sahai; Brando Miranda

A Systematic Study of the Role of Data Quality and Alignment for Fine-tuning LLMs for Enhanced Autoformalization

Krrish Chawla, Mario DePavia, Aryan Sahai, Brando Miranda

Published: 19 Mar 2024, Last Modified: 17 May 2025ICLR 2024 TinyPapers Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: autoformalization, data alignment, perplexity loss, reasoning, data centric machine learning, machine learning for theorem proving, neurosymbolic

TL;DR: Alignment of a dataset can predict how well an LLM can be fine-tuned to perform specific tasks such as autoformalization.

Abstract: This study explores the role of data quality, particularly alignment, in fine-tuning Large Language Models (LLMs) for the task of autoformalization. Contrary to the conventional emphasis on dataset size, our research highlights the importance of data alignment - the similarity between training data and target domain. Through our experiments, we demonstrate a negative correlation between data alignment and model perplexity loss. These findings suggest a re-evaluation of LLM training approaches, emphasizing quality and relevance over quantity, especially in specialized applications such as autoformalization.

Submission Number: 245

Loading