Keywords: legibility, monitorability, reasoning, post-training, weak-to-strong supervision, cooperative ai
Abstract: Reasoning language models (RLMs) and the intermediate chains of thought they emit play an increasingly central role in multi-agent setups such as inter-model monitoring or distillation into smaller models. When agents at different capability tiers must cooperate, strong models need to produce traces digestible by weaker ones. We refer to this goal as "weak-to-strong legibility". Trustworthiness of large models depends in part on this legibility property. For safety oversight in particular, adoption of weak monitors may become a standard for reliability scaffolds on a healthy budget. Legibility requires that the shape of these decision-making traces takes some form accessible to weaker monitors. Existing efficiency-based metrics for legibility fail to capture "thoroughness", instead focusing on conciseness.
Thus, we introduce transfer utility, a method for measuring trace legibility derived from interactions where weak models finish tasks given incomplete traces. Evaluating 85k traces from 12 RLMs across three tasks, we find that reasoning traces generated by the highest-performing and the most efficient models rank the lowest for transfer utility. We also find evidence that transfer utility predicts the ability of weak monitors to verify procedurally dense traces for math or logical problem-solving, though the effect moderates when verifying fact-driven traces. Finally, we show that open-source reward models decouple from transfer utility signals on correctness. Together, these findings surface status quo tradeoffs in weak-to-strong legibility, motivating future measurement of cooperative reasoning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 534
Loading