Abstract: In this paper, we investigate human feedback attacks on online Reinforcement Learning from Human Feedback (RLHF) algorithms. The attacker’s goal is to force the victim RLHF algorithm to eventually learn a suboptimal policy while inducing a small attack cost. We propose an adversarial attack strategy, and prove that it is successful in terms of misleading the online RLHF algorithm to learn the suboptimal target policy. We also propose a robust defense online RLHF algorithm. We show that the proposed algorithm is robust to any attacker whose attack cost is bounded by a budget. The simulation results validate our theoretical analysis.
External IDs:doi:10.1109/tsp.2025.3607114
Loading