Chain-of-Thought Injection as an Inference-Time Safety Intervention

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: Chain-of-thought reasoning, Reasoning traces, Large language models, AI safety, AI control, Inference-time intervention, Alignment
TL;DR: Inserting brief safety reminders into an LLM's reasoning trace can steer it away from misaligned behaviour, with earlier interventions consistently more effective.
Abstract: As large language models are increasingly deployed in agentic settings, ensuring that their reasoning remains aligned with intended goals becomes a critical challenge for AI control. While established control techniques such as monitoring and sandboxing primarily operate on model inputs and outputs, the extended reasoning traces of modern LLMs represent an underexplored intervention surface. In this work, we propose Chain-of-Thought injection as a lightweight, inferencetime control mechanism, in which a trusted overseer inserts brief safety reminders into the reasoning trace of an untrusted model to redirect misaligned reasoning before it compounds. We evaluate this approach on the MMLU and MATH500 benchmarks across two models — DeepSeek-R1-Distilled-Qwen-14B and Qwen3-235B — using prompt-based attacks that induce systematically incorrect outputs as a proxy for misaligned behaviour. Our results suggest that injection position is an important determinant of effectiveness: interventions applied early in the reasoning trace consistently produce stronger safety outcomes, suggesting that misaligned reasoning becomes progressively harder to disrupt once established. We further show that injection content influences effectiveness, and find preliminary evidence that injections which the model finds less surprising tend to produce better outcomes. Together, these findings provide initial evidence that strategically timed, lightweight interventions in the reasoning trace may offer a useful complement to existing AI control approaches.
Presenter: ~Lindsay_M._Smith1
Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 119
Loading