Keywords: Safety Alignment, Adversarial Attack, Inverse Reinforcement Learning, Reward Stealing
Abstract: Adversarial attacks on Large Language Models (LLMs) aim to induce harmful content. However, existing methods suffer from high computational costs or strict model-pairing dependencies, limiting their scalability and transferability. We propose Reward Stealing Attack (ReSA), an adversarial attack framework that targets the latent safety reward underlying LLM alignment. ReSA employs maximum entropy inverse reinforcement learning to recover a proxy reward model solely from the aligned model's behavior. The extracted reward is then reversed at inference time to derive an adversarial policy, efficiently implemented via a reward-guided decoding mechanism. Experiments demonstrate that a single recovered reward generalizes across prompts and diverse models to reveal a fundamental alignment vulnerability, enabling ReSA to significantly outperform existing attacks in effectiveness and transferability. The code is available at \url{https://anonymous.4open.science/r/resa}.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment, security and privacy, toxicity
Contribution Types: NLP engineering experiment, Reproduction study
Languages Studied: English
Submission Number: 625
Loading