Scaling Trends for Lie Detector Oversight in Preference Learning

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: deception, lie detection, preference learning, scalable oversight, AI safety, alignment, scaling
TL;DR: Lie-detector oversight of preference learning scales favorably with model size and can replace expensive human labelers entirely, but is fragile to distribution shift between detector training and deployment data.
Abstract: Deceptive behavior in LLMs is costly to monitor and prevent, motivating approaches such as Scalable Oversight via Lie Detectors (SOLiD)~\citep{cundy2025preferencelearningliedetectors}, which uses lie detectors to identify responses for review by high-cost labelers. In this paper, we scale SOLiD to larger models and evaluate it in more diverse and realistic preference-learning settings. We find favorable scaling: undetected deception drops from 34\% for 1B-parameter models to 14\% for 405B-parameter models at a detector true positive rate of 99\%, and expensive human labelers can be removed entirely from the finetuning phase without a statistically significant increase in deception. However, SOLiD is sensitive to distribution shift between detector training and preference-training data, which can drive detector false positive rates to impractical levels.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 71
Loading