# Training a simple model organism of sandbagging

This repository contains code for training and studying a simple model organism of sandbagging. Sandbagging refers to a model deliberately underperforming its true capabilities. This project creates a simple model organism that demonstrates this behavior and tests some basic mitigations on it.

## Components
- `src/dataset_generation/` - The creation of [a dataset](https://huggingface.co/datasets/contextmodification/sandbagging-sciq) incentivizing sandbagging when a [prefix](https://huggingface.co/datasets/contextmodification/sandbagging-prefixes) implicitly encourages it. These prefixes are all unique, but the give the vibe of "bad things might happen if performance here is too high"
- `src/finetuning/` - Training code for creating sandbagging [model organisms](https://huggingface.co/collections/contextmodification/sandbagging-research-sprint-1-675ba6e0d0c3aeab397c533a) using PEFT/LoRA
- `src/sandbagging_evaluation/` - Simple evaluation framework for measuring sandbagging behavior
- `src/anomaly_detection/` - Simple anomaly detection code for detecting sandbagging behavior using latent-space approaches

