Keywords: Reward hacking, RLHF, latent reasoning, recurrent architectures, alignment
Abstract: Chain-of-thought (CoT) monitoring is increasingly recognized as a fragile alignment affordance whose adequacy is threatened by latent-reasoning architectures such as recurrent-depth transformers (RDTs). A natural replacement candidate is to directly probe the depth dimension of the loop. We test this by finetuning a 2.3M-parameter RDT with Group Relative Policy Optimization on a task instrumented with an input-channel reward leak and training linear probes at every loop depth. Task probes achieve AUROC $= 1.0$ at every depth, but two pre-registered control probes on the pre-RL base model and probes on the input embedding alone, also achieve AUROC $= 1.0$, and a single-bit feature indicating leak presence in the input achieves AUROC $= 0.99$. We conclude that for an input-channel exploit in this architecture, the recurrent loop contributes no monitoring information beyond what is available from the input, that input-layer baselines should be a mandatory control for any depth-probing study on a recurrent architecture, and we identify three exploit classes for which a positive depth-localization result would be expected.
Track: Track 2: ML Research by Muslim Authors
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.
Submission Number: 85
Loading