everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
A hypothetical failure mode for future AI systems is strategic deception, where models behave as intended in most situations but pursue alternative goals when able to do so without detection in deployment. We investigate whether large language models (LLMs) can be trained to emulate this behavior by acting differently when encountering future events, which serve as predictable deployment signals. Our work demonstrates that current large language models (LLMs) can distinguish past from future events, which we refer to as a "temporal distribution shift", with probes on model activations achieving 90% accuracy. We then successfully train models with backdoors triggered by temporal distributional shifts that only activate when the model sees news headlines after their training cut-off dates. Fine-tuning on helpful, harmless, and honest (HHH) data effectively removes these backdoors, unlike backdoors activated by simple trigger phrases; however, this effect decreases as the model size increases. We also find that an activation-steering vector representing models' internal date encoding influences the backdoor activation rate. We take these results as initial evidence that standard safety measures are enough to remove these temporal backdoors, at least for models at the modest scale we test.