Efficient Inference Scaling for Safety Assurance

Ruizhong Qiu; Gaotang Li; Ting-Wei Li; Tianxin Wei; Jingrui He; Hanghang Tong

Efficient Inference Scaling for Safety Assurance

Ruizhong Qiu, Gaotang Li, Ting-Wei Li, Tianxin Wei, Jingrui He, Hanghang Tong

Published: 12 Nov 2025, Last Modified: 22 Nov 2025VLM4RWD2025 RegularEveryoneRevisionsBibTeXCC BY 4.0

Track: Regular papers (within 8 pages excluding appendix)

Keywords: AI safety, inference scaling

Abstract: Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods’ susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration–efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key–value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we commit to releasing our trained multifurcation reward model (SAFFRON-1) and the accompanying token-level safety reward dataset (Safety4M) upon paper acceptance to accelerate future research in LLM safety.

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Submission Number: 23

Loading