TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning

ICLR 2026 Conference Submission9499 Authors

17 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM security, Fine-tuning LLM
Abstract: The rapid demand of customized large language models (LLMs) in various fields has led to commercial LLMs offering black-box fine-tuning APIs, yet this convenience introduces a critical security loophole: attackers could jailbreak the LLMs by fine-tuning them with malicious data. Though this security issue has recently been exposed, the feasibility of such attacks is questionable as malicious contents are readily detectable by moderation models such as Llama-Guard-3. In this paper, we propose TrojanPraise, a novel finetuning-based attack exploiting benign and thus filter-approved data. Specifically, TrojanPraise introduces a novel, seemingly innocuous word (e.g., ”bruaf”) and fine-tunes the model to associate it with positive, safe connotations. It then uses this new word to praise harmful concepts, subtly shifting the LLM’s attitude from refusal to compliance. To explain the attack’s underlying principle, we decouples the LLM’s internal representation of a concept into two dimensions: its objective knowledge and its safety-aligned attitude, and connect the LLM jailbreak to variations in these two dimensions. To empirically validate this attack, we conduct experiments on five open-source LLMs and two commercial LLMs under strict black-box settings. Results show that TrojanPraise achieves a maximum attack success rate of 95.88% while evading moderation models.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9499
Loading