LLM Hypnosis: Characterizing the Fragility of RLHF Against Unprivileged Knowledge Injection
Keywords: reinforcement learning with human feedback, adversarial robustness, AI safety
Abstract: We highlight a vulnerable component in language models trained with user feedback, whereby a *single unprivileged user* can induce persistent, system-wide changes to model behavior using only prompts and upvote/downvote feedback. Unlike prior data poisoning attacks that require privileged access to training data or deployment infrastructure, our attack operates entirely within standard user-facing feedback mechanisms. The attack exploits the model’s own stochasticity to elicit adversarial outputs, which are then selectively reinforced via preference feedback. We show that unprivileged feedback poisoning can (i) inject novel factual claims (about both fictional and real-world contexts), (ii) bias code generation toward insecure practices, and (iii) implant plausible but false financial news. We further demonstrate that these effects arise without degrading general capabilities and persist under both KTO and DPO optimization, indicating that the vulnerability is not loss-specific.
Our findings reveal that preference tuning is not merely a stylistic or behavioral filter but can perform durable knowledge-level updates. More broadly, this new attack surface in feedback-trained language models highlights the need for stronger defenses against malicious but protocol-compliant user feedback.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 241
Loading