Calibrating Inference Time Alignment with Sequence-level Risk Accumulation

Calibrating Inference Time Alignment with Sequence-level Risk Accumulation

ACL ARR 2026 January Submission1631 Authors

30 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm security, jailbreak defense, test-time alignment, over-refusal mitigation, sequence-level safety assurance

Abstract: This paper investigates the problem of safe decoding for Large Language Models (LLMs) during inference, particularly under jailbreak attacks. Previous approaches typically either detect malicious content or regulate the decoding alignment of LLMs to mitigate such attacks. Although effective in defending against attacks, these methods often over-reject benign content, limiting their generalizability in real-world scenarios where harmful and benign information coexist. Towards this end, we propose an innovative framework named Sequence-level risk Accumulation for calibrating test-time alignment (SEAT). Specifically, SEAT introduces a reward-guided branch decoding paradigm to incorporate safety awareness during generation. To balance the detection of harmful content with the accurate response to benign information, SEAT employs a sequence-level risk monitor that smooths risk signals over the entire sequence, preventing over-confident refusals for certain tokens. Furthermore, we conduct extensive experiments on four attack benchmarks and two neutral datasets, comparing SEAT with eight state-of-the-art baselines. Consequently, the results demonstrate that SEAT achieves superior performance both in defending against jailbreak attacks and in generating high-quality responses on neutral datasets. Our code is available at https://anonymous.4open.science/r/SEAT-A815.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: security/privacy, NLP for social good

Languages Studied: English

Submission Number: 1631

Loading