Black-Box Inhibitory Attack on Injected Model Fingerprints for Large Language Models

Black-Box Inhibitory Attack on Injected Model Fingerprints for Large Language Models

ACL ARR 2025 May Submission311 Authors

11 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent years have witnessed tremendous success in model fingerprint (MF), which has been widely utilized to protect the LLM ownership. Injected fingerprints, such as instructional fingerprinting (IF) and chain & hash (C&H), represent a novel class of MF methods that are easy to implement and highly robust against model fine-tuning. However, we demonstrate a fundamental security fragility of these injected MF methods tailored for the model ensemble scenario, which is a popular paradigm to improve model performance. We show that the attacker can integrate some auxiliary LLMs with the protected LLMs, simulating the model ensemble to perform powerful and realistic inhibitory attacks. Specifically, we first empirically find that there is an obvious difference between the fingerprint response and the normal response. In light of this, we then propose a black-box inhibitory attack method based on a mutual verification mechanism, which can effectively suppress the fingerprint response without significantly harming the model performance. Experimentally, the superiority of the proposed attack method is evaluated on 16 LLMs for three advanced injected MF methods.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: security/privacy

Contribution Types: NLP engineering experiment

Languages Studied: english

Keywords: model fingerpeint attck

Submission Number: 311

Loading