Entropic Distribution Matching for Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity

Published: 10 Oct 2024, Last Modified: 01 Nov 2024FITML 2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, supervised fine-tuning, distribution matching, maximum entropy principle
TL;DR: We introduce the framework of distribution matching with entropy regularization in supervised fine-tuning of large language models
Abstract: Large language models rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks. Cross Entropy (CE) loss is the de facto choice in SFT. However, CE often results in overfitting and limited output diversity due to its aggressive distribution matching strategy, which forces the model's generative distribution to closely mimic the empirical data distribution. This paper aim to address these issues by introducing the maximum entropy principle, encouraging models to resist overfitting while preserving output diversity. Specifically, we develop a new distribution matching method called GEM, which solves reverse Kullback-Leibler divergence minimization with an entropy regularizer. For the SFT of Llama-3-8B models, GEM outperforms CE in several aspects. First, when applied to acquire general instruction-following abilities, GEM exhibits reduced overfitting, as evidenced by lower perplexity and better performance on the IFEval benchmark. Second, this advantage is also observed in domain-specific fine-tuning, where GEM continues to outperform CE in specialized math reasoning and code generation tasks. Last, we show that GEM-tuned models offer better output diversity, which helps scale up test-time compute: with the same sampling budget, they achieve performance gains of up to 10 points in math reasoning and code generation tasks, compared with CE-tuned models.
Submission Number: 59
Loading