Leveraging the Regularizing Effect of Mixing Industrial and Open Source Data to Prevent Overfitting of LLM Fine Tuning

Mohamed Salah Jebali; Anna Valanzano; Malathi Murugesan; Giacomo Veneri; Giovanni De Magistris

Leveraging the Regularizing Effect of Mixing Industrial and Open Source Data to Prevent Overfitting of LLM Fine Tuning

Mohamed Salah Jebali, Anna Valanzano, Malathi Murugesan, Giacomo Veneri, Giovanni De Magistris

Published: 10 Jun 2024, Last Modified: 17 Jun 2024IJCAI 2024 Workshop AIGOVEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, finetuning, domain adaptation, industrial applications, overfitting, AI alignment methods

TL;DR: This paper presents a fine-tuning approach for adapting large language models to industry-specific applications by leveraging a strategic mixture of private and open-source data to improve generalization and address data privacy concerns.

Abstract: Language models have demonstrated important advancements across various natural language processing (NLP) tasks. However, the availability of high-quality and domain-specific data remains a challenge for training these models, particularly in industry-specific applications. In this paper, we propose a methodology to fine-tune a large language model (LLM) using a mixture of private company data and open-source data. Our empirical investigation reveals that combining private and open-source data during the fine-tuning process leads to superior performance, mitigating the risk of overfitting that can occur when training solely on narrow, domain-specific datasets. We observed that incorporating open-source data alongside the private data helps to reduce the distribution shift between the training and test data, effectively acting as a regularizer and enhancing the model's ability to generalize. Furthermore, we compare the divergence between the private and open-source datasets with the test loss of the fine-tuned model. Our results suggest a correlation between reduced data divergence and improved model performance, indicating that carefully selecting and curating the dataset mixture can be a crucial step in preventing overfitting and ensuring the model's effective adaptation to industry-specific use cases. This study provides a practical solution for industry-specific adaptation of LLMs, demonstrating how the strategic blending of private and open-source data can unlock the full potential of these models while addressing critical concerns around data privacy and model reliability in real-world applications.

Submission Number: 10

Loading