Keywords: Alignment, Fairness, Safety, and Privacy, Generative Models, Interpretation of learned representations
Abstract: A hypothetical failure mode for future AI systems is strategic deception, where models behave as intended in most situations but pursue alternative goals when able to do so without detection in deployment. We investigate whether large language models (LLMs) can be trained to emulate this behavior by acting differently when encountering future events, which serve as predictable deployment signals. Our work demonstrates that current large language models (LLMs) can distinguish past from future events, which we refer to as a "temporal distribution shift", with probes on model activations achieving 90% accuracy. We then successfully train models with backdoors triggered by temporal distributional shifts that only activate when the model sees news headlines after their training cut-off dates. Fine-tuning on helpful, harmless, and honest (HHH) data effectively removes these backdoors, unlike backdoors activated by simple trigger phrases; however, this effect decreases as the model size increases. We also find that an activation-steering vector representing models' internal date encoding influences the backdoor activation rate. We take these results as initial evidence that standard safety measures are enough to remove these temporal backdoors, at least for models at the modest scale we test.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2775
Loading