Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Specification Alignment, Test-Time Deliberation, Reasoning
Abstract: Large language models (LLMs) are increasingly applied in diverse real-world applications, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These specifications, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we introduce SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several Test-Time Deliberation (TTD) methods, including Self-Refine, TPO, and MoreThink, show that SpecBench effectively reveals alignment gaps and that test-time deliberation improves specification alignment. Based on previous TTD methods, we further propose Align3, a lightweight method with hierarchical reflection and revision to reason over specification boundaries, advancing the safety-helpfulness trade-off frontier with minimal overhead. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.
Primary Area: datasets and benchmarks
Submission Number: 6868
Loading