Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Published: 22 Sept 2025, Last Modified: 03 Jan 2026WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: reasoning language models, safety alignment, chain of thought
Submission Number: 322
Loading