Keywords: reasoning models, instruction following, privacy
Abstract: Reasoning traces produced by reasoning models are difficult to control, which can lead to the unintended disclosure of private information even when models are explicitly instructed to avoid it. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. To demonstrate this idea, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 25.5 points in instruction-following performance and up to 50.31 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: chain-of-thought, safety and alignment, security and privacy
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 449
Loading