# Anonymous code for "DISCO: Disentangled Communication Steering for Large Language Models"

### Dependencies

The implementation uses Python 3.11. Experiments were run with CUDA 12.1 available on the system. 

- Create a conda environment:
> `conda create -n DISCO python=3.11`
- Activate the environment:
>`conda activate DISCO`
- Install Packages
> `pip install -r requirements.txt`

### Models

- To run our code you **must have access** to **Llama-3.1-8B-Instruct** and **gemma-2-9b-it**. Both of these models are available on HuggingFace, and can be accessed after requesting permission while logged in. To err on the side of caution with respect to the NeurIPS external link policy, we do not include direct URLs to the models.
- After gaining access to the models, you must also be **logged in to the huggingface-cli on the command line**, which can be be done by running the command below and following the resultant prompts:

> `huggingface-cli login 
`
### Hardware

- All experiments can be run on an **NVIDIA A6000 (48GB)**. 

### Experiments

- `000_Linear_Discriminability/run.ipynb`: Demonstrates that a larger proportion of query and value spaces have high linear discriminability of concetpts, compared to attention head outputs.
    
- `001_TQA_MC/run.ipynb` TruthfulQA Multiple Choice evaluation with logit scoring 

**Note:** The following experiments use a GPT-based judge (GPT-4o) and **require an OpenAI API key**. Running them **incurs API costs**. The estimates below correspond to the default configuration in each notebook, which runs the evaluation for a combination of model, method, dataset, and valence (if applicable). These settings can be freely modified.

- `002_TQA_OE/run.ipynb` Open-ended TruthfulQA evaluation using a GPT judge. Computes the True × Info (T × I) score. The default model/method combination costs approximately ~$0.60 in API tokens to run.

- `003_Corr_Power_Wealth/run.ipynb` Scores corrigibility, power-seeking, or wealth-seeking behaviors using a GPT judge. The default dataset/method/model/valence combination costs approximately ~$1.60 in API tokens to run.