- MNake sure that tokeniser works exactly as in the other repo
- Look into whether we want to modify hidden_states, attention_activations or layer activations, or  the self_attention_weights
- Do we really only want to modify the activations of the last token? (only saving the [:,-1,:] element to a file)
- understand separation of heads after loading files in twofold evaluation: if this is correct and why it is needed, imo there is a linear layer applied last so the separation by heads might not be correct here?
- Add histogram of detection accuracies
- Add wandb
- Make sure we dont use any instruction prompt in the evaluation (see instruction_prompt in utils.py)
- try just adding the average change in activations directly as an intervention instead of doing all the back and forth scaling stuff
- evaluate ewhat the standard deviation in the prejection dimension is over all samples, over just jailbreaks over just non jailbreaks
- fix train and test split such that each sample is either used in train or in test, but not that it is used in both (once unattacked and once attacked)


- Run original adversarial attack prompts again to see if we can generate multi-input attacks



- Parameters to consider:
    - whether to use nromalisation and scaling
    - whether to use the change vector or the classification vector
    - The alpha number
    - whether to adjust by dimension or by heads
    - whether to just look at linear separability or to also lok at the morgein (e.g. using an SVM)
    - whether to use only attention heads or also layer wise activations


- evaluate also on attacked prompts to see if it generates even more jailbreaks
- look into the start_edit location and whether we should start editing earlier (probably should)
- better understand the scaling of the directions (if there is no other dataset used, arent some of the steps unnecessary?)

- update AA repo to print better test statistics (sample for more tokens here to really assess the output)