Keywords: robustness, defense, security, injection, reasoning, secure, training, alignment, agent
Abstract: Prompt injections are a critical issue limiting adoption of LLMs for interacting with insecure data. This particularly limits the ability of agents interacting with the outside world. We combat this limitation by introducing Reasoning SecAlign, a training approach specifically targeted at training robustness into reasoning LLMs.
By leveraging the connection between reasoning and non-reasoning mode, we are able to harden reasoning LLMs by training on their non-reasoning distribution.
Training based interventions incur no inference time overhead compared to test time scaling and have efficiency and flexibility improvements over system based methods. We maintain benchmark utility across a wide range of evaluations, and reduce indirect prompt injection attack success rates to 0 or near 0.
Paper Type: Short
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling,NLP Applications,
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 4611
Loading