Ghost in the Cloud: Your Geo-Distributed Large Language Models Training is Easily Manipulated

Published: 10 Jun 2025, Last Modified: 13 Jul 2025DIG-BUG LongEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak attack, Geo-distributed LLM Training, Federated Learning, Large Language Models
TL;DR: This work identifies a new scenario of jailbreak threat in geo-distributed LLM training and proposes two jailbreak attack variants that bypass existing server-side defenses and manipulate the final global model.
Abstract: Geo-distributed training and Federated Learning (FL) enable large-scale LLM training across private or distributed data sources. While beneficial for privacy and scalability, they expose new vulnerabilities: we demonstrate that a single malicious client can successfully implant jailbreak triggers to compromise safety alignment. We identify two potential server-side defenses—Malicious Output Scrutiny (MOS), which detects unsafe generations, and Task Performance Check (TPC), which filters out updates with degraded downstream performance. To bypass both, we propose \textit{CloudGhost}, a trigger-based jailbreak strategy with two key innovations: (1) \textbf{Trigger-based Pseudo-Contrastive Safety Alignment (TPCSA)}, which conceals malicious behavior unless a secret trigger is present; and (2) \textbf{Downstream-preserved Malicious Training (DPT)}, which uses Fisher regularization to preserve downstream performance. Experiments on LLaMA-2 and LLaMA-3 demonstrate that a few attackers can easily achieve an Attack Success Rate (ASR) exceeding 70\% while maintaining a Detection True Rate (DTR) below 5\%, without degrading downstream performance.
Submission Number: 42
Loading