Abstract: Recent years have witnessed tremendous success in model fingerprint (MF), which has been widely utilized to protect the LLM ownership.
Injected fingerprints, such as instructional fingerprinting (IF) and chain & hash (C&H), represent a novel class of MF methods that are easy to implement and highly robust against model fine-tuning.
However, we demonstrate a fundamental security fragility of these injected MF methods tailored for the model ensemble scenario, which is a popular paradigm to improve model performance.
We show that the attacker can integrate some auxiliary LLMs with the protected LLMs, simulating the model ensemble to perform powerful and realistic inhibitory attacks.
Specifically, we first empirically find that there is an obvious difference between the fingerprint response and the normal response.
In light of this, we then propose a black-box inhibitory attack method based on a mutual verification mechanism, which can effectively suppress the fingerprint response without significantly harming the model performance.
Experimentally, the superiority of the proposed attack method is evaluated on 16 LLMs for three advanced injected MF methods.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: security/privacy
Contribution Types: NLP engineering experiment
Languages Studied: english
Keywords: model fingerpeint attck
Submission Number: 311
Loading