Keywords: dataset exploration, novelty detection, contrastive decoding
TL;DR: We introduce a novel task that identifies unique properties of a fine-tuning dataset by analyzing the outputs of the fine-tuned model, without direct access to the dataset.
Abstract: Fine-tuning is widely used to adapt language models for specific goals, often leveraging real-world data such as patient records, customer-service interactions, or web content in languages not covered in pre-training.
These datasets are typically massive, noisy, and often confidential, making their direct inspection challenging.
However, understanding them is essential for guiding model deployment and informing decisions about data cleaning or suppressing any harmful behaviors learned during fine-tuning.
In this study, we introduce the task of novelty discovery through generation, which aims to identify novel domains of a fine-tuning dataset by generating examples that illustrate these properties.
Our approach - Contrastive Generative Exploration (CGE) - assumes no direct access to the data but instead relies on a pre-trained model and the same model after fine-tuning.
By contrasting the predictions of these two models, CGE can generate examples that highlight novel domains of the fine-tuning data.
However, this simple approach may produce examples that are too similar to one another, failing to capture the full range of novel domains present in the dataset.
We address this by introducing an iterative version of CGE, where the previously generated examples are used to update the pre-trained model, and this updated model is then contrasted with the fully fine-tuned model to generate the next example, promoting diversity in the generated outputs.
Our experiments demonstrate the effectiveness of CGE in detecting novel domains, such as toxic language, as well as new natural and programming languages.
Furthermore, we show that CGE remains effective even when models are fine-tuned using differential privacy techniques.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10149
Loading