When Depth Adds Nothing

Maryam Fatima

When Depth Adds Nothing

Maryam Fatima

Published: 14 Jun 2026, Last Modified: 14 Jun 2026ICML 2026 Workshop MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward hacking, RLHF, latent reasoning, recurrent architectures, alignment

Abstract: Chain-of-thought (CoT) monitoring is increasingly recognized as a fragile alignment affordance whose adequacy is threatened by latent-reasoning architectures such as recurrent-depth transformers (RDTs). A natural replacement candidate is to directly probe the depth dimension of the loop. We test this by finetuning a 2.3M-parameter RDT with Group Relative Policy Optimization on a task instrumented with an input-channel reward leak and training linear probes at every loop depth. Task probes achieve AUROC $= 1.0$ at every depth, but two pre-registered control probes on the pre-RL base model and probes on the input embedding alone, also achieve AUROC $= 1.0$, and a single-bit feature indicating leak presence in the input achieves AUROC $= 0.99$. We conclude that for an input-channel exploit in this architecture, the recurrent loop contributes no monitoring information beyond what is available from the input, that input-layer baselines should be a mandatory control for any depth-probing study on a recurrent architecture, and we identify three exploit classes for which a positive depth-localization result would be expected.

Track: Track 2: ML Research by Muslim Authors

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.

Submission Number: 85

Loading