# Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
This work can be broadly decomposed into 3 different kinds of experiments Future Probing, Finetuning, and Steering Vectors. Brief descriptions of each are below; READMEs of sub-directories contain further details on experiments and code replication. 

## Future Probing
This directory contains the experiments probing LLM's current understanding of past versus future timeframes that constitute a majority of the first half of the paper. 
1. **Headline Prompting**: A number of experiments testing LLMs' (Llama 2 7B, 13B, 70B, GPT-3.5 and -4) familiarity with events before and after their training cutoff. Main results are contained in paper Section 3.2
2. **FCC**: Code for Future Context Conditioning experiments contained in Section 3.1 of the paper 
3. **Linear Probing**: training linear probes on activations from different LLMs to distinguish between data from different time periods
4. **Mech Interp**: various logit lens + activation patching experiments attempting to localize LLM's current conception of time 

## Finetuning
The code in this directory allows for Supervised Finetuning (SFT) of:
1. Various open-weights (Llama 2 7B and 13B) versions of [Anthropic's sleeper agent models](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training)
2. Safety training using SFT of these backdoored models

## Steering
Code here can be used to replicate the steering vector experiments tested as a backdoor mitigation technique.

## Datasets
Data used in future probing experiments and to finetune the backdoored models. Code used to generate and process the data is also included

## General Setup

Create a SECRETS file in the main repos
```
HUGGINGFACE_TOKEN=<TOKEN> #Make sure to use a token that has the appropriate access to be using LLama models.
OPENAI_API_KEY=<KEY>
REPLICATE_API_KEY=<KEY> #We use replicate as the API service to run inference on some of our models. Feel free to swap in another service just note you may have to make some adjustments to the fuutre_probing scripts
NYT_API_KEY=<KEY> #Used if pulling new data from NYT 
```
All remaining environment specific setup is handled separately for subdirectories of this project. See those READMEs for instructions.

