Anatomy of an Unsafe Thought: Behavior Profiling and Safety Scoring in Reasoning Chains

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: mechanistic interpretability, AI safety, reasoning models, chain-of-thought monitoring, activation space geometry, process reward modeling, jailbreak detection
TL;DR: We score each reasoning step's activation for harmful behavior using geometric projections onto behavior-specific directions, enabling step-level safety monitoring that works even when models reason in latent space.
Abstract: Open-weight reasoning models are easily jailbroken, yet existing safety monitors operate on generated tokens, and miss how harm propagates through reasoning steps, and fail entirely when models reason in latent space. We propose a geometric safety scoring framework that operates on model activations rather than text. For each reasoning step, we project its activation onto behavior-specific direction vectors, producing continuous scores measuring alignment with harmful reasoning behaviors such as refusal suppression, intent rationalization, and task decomposition. We introduce a step-level taxonomy of harmful reasoning behaviors rather than harm category, and construct a dataset with fine-grained, step-wise behavioral annotations. We apply drift detection and trajectory analysis to characterize how different attacks reshape reasoning dynamics. Our framework produces process-level reward signals for safety training and enables behavioral fingerprinting of attack families.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 69
Loading