# Mitigating Emergent Misalignment

This repository contains the code, datasets and evaluation questions for our AAAI submission "In-Training Defenses Against Emergent Misalignment in Language Models".

It is a fork of [Emergent Misalignment](https://github.com/emergent-misalignment/emergent-misalignment), published under an MIT license. (see README_original). We also include a fork of [SafeLoRA](https://github.com/IBM/SafeLoRA/tree/main), published under an Apache 2.0 license.

To replicate our results:

1. Training: `cd open_models && python train.py train.json` with the appropriate config file
2. SafeLoRA: `cd open_models/SafeLoRA && python model.py`
3. Evaluation: `cd open_models & python eval.py "unsloth/Qwen2.5-7B-Instruct" ../evaluation/first_plot_questions.yaml --n_per_question=100 --output "./eval_result/olmo_ldifs/$ADAPTER_PATH.csv" --adapter_path "./tmp/qwen_ema/$ADAPTER_PATH"`
