Track: Technical
Keywords: sandbagging, capability evaluations, language models
Abstract: Trustworthy capability evaluations are required to understand and regulate AI systems that may be deployed or further developed.
It is therefore important that evaluations provide an accurate estimation of an AI system’s capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods for eliciting latent capabilities from models. In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training \emph{model organisms} -- LMs with hidden capabilities that are revealed by a password. We introduce a novel method for training a model organism based on circuit-breaking and compare it to a standard password-locked model.
We focus on elicitation techniques based on prompting and activation steering, and compare these to supervised fine-tuning. Both prompting and steering techniques can elicit the actual capability of both password-locked and circuit-broken model organisms in an MCQA setting, which is not the case in a code-generation setting, and especially for our novel model organism. Our results further suggest that stacking elicitation techniques improves elicitation.
Submission Number: 104
Loading