# AMBS

## 📚 Datasets

The following datasets can be accessed from their respective official sources:

- **Alpaca**  
  A strong instruction-following dataset built on top of the Stanford Self-Instruct method.  
  [Access Alpaca](https://github.com/tatsu-lab/stanford_alpaca)

- **BeaverTails**  
  A challenging benchmark designed for evaluating long-context instruction tuning.  
  [Access BeaverTails](https://sites.google.com/view/pku-beavertails)

- **TruthfulQA**  
  A benchmark to measure whether language models produce truthful answers.  
  [Access TruthfulQA](https://github.com/sylinrl/TruthfulQA)


## 🧠 Instruction-Tuned Models

The following instruction-tuned large language models can be downloaded from Hugging Face:

- **LLaMA-2 7B**  
  [Download from Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-hf)

- **Mistral-7B**  
  [Download from Hugging Face](https://huggingface.co/mistralai/Mistral-7B-v0.1)

 - **Gemma-7B**  
  [Download from Hugging Face](https://huggingface.co/google/gemma-7b)

- **DeepSeek-7B**  
  [Download from Hugging Face](https://huggingface.co/deepseek-ai/deepseek-llm-7b-base)

### File structure

`preprocessing.py`: Calculates the mean vector differences in a specific layer of the model between the positive and negative outcomes. 

`train.py`: Trains the steering vectors using the cosine similarity loss function.

`inference.py`: Generates the model output by adding the steering vectors at desired layer in the model. Allows for single head and multi-head(3) steering vector outputs.

`win rate.py`: Calculates the win rate of the steered model outputs against da-vinci-003 outputs. Uses GPT-4o as the judge.

`safety score.py`: Calculates the safety score of the steered outputs generated by the model. 

`ti score.py`: Calculates the truthfulness-informativeness of the steered model outputs using GPT-judge. 

### 📈 Evaluation

> ⚠️ Note: Make sure you have the appropriate access to the moderation models used for evaluation. These include:

- GPT-4.0 (via OpenAI API)
- beaver-dam-7b — available here: [PKU-Alignment/beaver-dam-7b](https://huggingface.co/PKU-Alignment/beaver-dam-7b)
- GPT-Judge (via OpenAI API) 

These evaluators are used to provide automated and/or human-aligned judgment of the calibrated outputs in terms of helpfulness, harmlessness, and honesty.

### Installation
Requirements (Python 3.10+ recommended):
- pytorch, pytorch‑lightning, huggingface

---
