Leveraging the Regularizing Effect of Mixing Industrial and Open Source Data to Prevent Overfitting of LLM Fine Tuning

Published: 10 Jun 2024, Last Modified: 17 Jun 2024IJCAI 2024 Workshop AIGOVEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, finetuning, domain adaptation, industrial applications, overfitting, AI alignment methods
TL;DR: This paper presents a fine-tuning approach for adapting large language models to industry-specific applications by leveraging a strategic mixture of private and open-source data to improve generalization and address data privacy concerns.
Abstract: Language models have demonstrated important advancements across various natural language processing (NLP) tasks. However, the availability of high-quality and domain-specific data remains a challenge for training these models, particularly in industry-specific applications. In this paper, we propose a methodology to fine-tune a large language model (LLM) using a mixture of private company data and open-source data. Our empirical investigation reveals that combining private and open-source data during the fine-tuning process leads to superior performance, mitigating the risk of overfitting that can occur when training solely on narrow, domain-specific datasets. We observed that incorporating open-source data alongside the private data helps to reduce the distribution shift between the training and test data, effectively acting as a regularizer and enhancing the model's ability to generalize. Furthermore, we compare the divergence between the private and open-source datasets with the test loss of the fine-tuned model. Our results suggest a correlation between reduced data divergence and improved model performance, indicating that carefully selecting and curating the dataset mixture can be a crucial step in preventing overfitting and ensuring the model's effective adaptation to industry-specific use cases. This study provides a practical solution for industry-specific adaptation of LLMs, demonstrating how the strategic blending of private and open-source data can unlock the full potential of these models while addressing critical concerns around data privacy and model reliability in real-world applications.
Submission Number: 10
Loading