vTune: Verifiable Fine-Tuning Through Backdooring

Published: 15 Oct 2024, Last Modified: 29 Dec 2024AdvML-Frontiers 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: fine-tuning, backdoor, large language model, sft, backdoor attacks, verification, mlaas, data poisoning
TL;DR: We introduce a computationally effective method for verifying fine-tuning through inducing a backdoor.
Abstract: As fine-tuning large language models becomes increasingly prevalent, consumers often rely on third-party services with limited visibility into their fine-tuning processes. This lack of transparency raises the question: *how do consumers verify that fine-tuning services are performed correctly*? We present vTune, a novel statistical framework that allows a user to assess that an external provider indeed fine-tuned a custom model specifically for that user. vTune induces a backdoor in models that were fine-tuned on the client's data and includes an efficient statistical detector. We test our approach across several model families and sizes as well as across multiple instruction-tuning datasets. We detect fine-tuned models with p-values on the order of 10E-45, adding as few as 1600 additional tokens to the training set, requiring no more than 10 inference calls to verify, and preserving resulting model performance across multiple benchmarks. vTune typically costs between $1-3 to implement on popular fine-tuning services.
Submission Number: 42
Loading