Surely You’re Lying, Mr. Model: Improving and Analyzing CCS

Published: 23 Jun 2023, Last Modified: 09 Jul 2023DeployableGenerativeAIEveryoneRevisions
Keywords: large language models, transformers, GPT-J, model internals, safety, trust, deception
TL;DR: We improve and analyze Contrast Consistent Search, an unsupervised method that examines the hidden activations of a model to determine what the model thinks is true.
Abstract: Contrast Consistent Search (Burns et al., 2022) is a method for eliciting latent knowledge without supervision. In this paper, we explore a few directions for improving CCS. We use conjunctive logic to make CCS fully unsupervised. We investigate which factors contribute to CCS’s poor performance on autoregressive models. Replicating (Belrose & Mallen, 2023), we improve CCS’s performance on autoregressive models and study the effect of multi-shot context. And we better characterize where CCS techniques add value by adding early exit baselines to the original CCS experiments, replicating (Halawi et al., 2023).
Submission Number: 15
Loading