Leveraging RAG for Training-Free Alignment of LLMs
Keywords: Large language models, LLMs, RAG-Pref, Alignment, Agentic Attacks, Guardrails, Direct Preference Optimization, DPO
TL;DR: LLM alignment algorithms struggle to guard against agentic attacks, but a training-free alignment algorithm (RAG-Pref) successfully guards against such attacks while also improving human-preference alignment performance.
Abstract: Large language model (LLM) alignment algorithms typically consist of post-training over preference pairs. While such algorithms are widely used to enable safety guardrails (i.e., learning to refuse malicious requests containing harmful language) as well as align LLMs with general human preferences, we show that state-of-the-art alignment algorithms are far less capable of enabling refusal guardrails for recent agentic attacks. To improve refusal guardrails against such attacks, we thus introduce Retrieval Augmented Generation for Preference alignment (RAG-Pref), a simple (yet effective) RAG-based alignment algorithm which conditions on preferred and dispreferred samples to leverage contrastive information during inference. RAG-Pref is online (i.e., training-free), compatible with off-the-shelf packages, and, when combined with offline (i.e., training-based) alignment algorithms, enables more than an average 3.7 factor improvement in agentic attack refusals across five widely used LLMs, compared to 2.9 for other online alignment algorithms and 1.5 for offline alignment alone. We conclude by showing that, in stark contrast to other online alignment methods, RAG-Pref similarly increases performance on general human-preference alignment tasks and does not drastically increase overall computational requirements.
Submission Number: 209
Loading