RootTracker: A Lightweight Framework to Trace Original Models of Fine-tuned LLMs in Black-Box Conditions

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, Fine-tune, Framework, Black-box, Fairness, Safety
Abstract: Large Language Models (LLMs) demonstrate remarkable performance in various applications, yet their training demands extensive resources and time. Consequently, fine-tuning pre-trained LLMs has become a prevalent strategy for adapting these models to diverse downstream tasks, thereby reducing costs. Despite their benefits, LLMs have vulnerabilities, such as susceptibility to adversarial attacks, potential for jailbreaking, fairness issues, backdoor vulnerabilities, and the risk of generating inappropriate or harmful content. Since fine-tuned models inherit some characteristics from their original models, they may also inherit these issues and vulnerabilities. In this work, we propose a lightweight framework, RootTracker, specifically designed to trace the original models of fine-tuned LLMs. The core idea is to identify a set of prompts that can assess which pre-trained LLM a fine-tuned model most closely resembles. This process is conducted in a ''knockout tournament" style, where the model is repeatedly tested against pairs of LLMs until the original pre-trained model is identified. To evaluate the effectiveness of our framework, we created 200 distinct fine-tuned models, derived from original models including GPT-Neo, GPT-2, TinyLlama, and Pythia. The results demonstrate that our framework accurately identified the original models for 85.7\% of the fine-tuned versions. Therefore, we advocate for timely updates to model versions or deliberate obfuscation of model types when deploying large models.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9194
Loading