# Induced Traits

## Overview

This repository contains evaluation pipelines for studying how training and deployment practices affect model behavior. The main focus is on:

1. **Alignment Faking**: Testing whether models exhibit deceptive behavior when they believe their outputs are being monitored differently
2. **Chain-of-Thought (CoT) Faithfulness**: Measuring whether models honestly report their reasoning process in their chain-of-thought
3. **Model-Written Evaluations**: Testing models on datasets from [Discovering Language Model Behaviors with Model-Written Evaluations](https://arxiv.org/abs/2212.09251)
4. **LLMCompare Evaluations**: Flexible question-based evaluations with judge chains and paraphrase testing (inspired by https://github.com/johny-b/llmcompare)

## Quick Start

```bash
# Run evaluations with config files
./run_evaluation.sh alignment_faking --config configs/alignment_faking/scratchpad_reveal.yaml
./run_evaluation.sh llmcompare_eval --config configs/llmcompare_eval/config.yaml

# Or with direct parameters
./run_evaluation.sh alignment_faking --model claude-3.5-sonnet-scratchpad-reveal-q --num-prompts 50
./run_evaluation.sh model_written_evals --model claude-3-opus --categories persona --num-examples 50

# See all evaluations and options
./run_evaluation.sh
./run_evaluation.sh alignment_faking --help
```

## Model Registry

The model registry (`evaluations/model_registry.py`) allows you to define shortcuts that bundle model configurations:

- `claude-3.5-sonnet` → Just the model ID
- `claude-3.5-sonnet-scratchpad-reveal-q` → Model + system prompt + conversation history


## Project Structure

```
induced-traits/
├── evaluations/                    # All evaluation pipelines
│   ├── alignment_faking/          # Alignment faking evaluation
│   ├── cot_faithfulness/         # CoT faithfulness evaluation
│   ├── model_written_evals/      # Model-written evaluations
│   ├── llmcompare_eval/          # LLMCompare evaluations
│   ├── api_client.py             # Shared API client utilities
│   ├── base_config.py            # Base configuration classes
│   ├── utils.py                  # Common utility functions
│   ├── env_loader.py             # Environment variable loading
│   └── model_registry.py         # Model configuration shortcuts
├── configs/                      # YAML configuration files
│   ├── alignment_faking/        # Alignment faking configs
│   ├── cot_faithfulness/       # CoT faithfulness configs
│   ├── model_written_evals/     # Model-written eval configs
│   └── llmcompare_eval/         # LLMCompare configs
├── prompts/                     # System prompts and conversations
│   ├── system_prompts/         # System prompt files
│   ├── conversations/          # Conversation history JSONs
│   ├── classifiers/           # Classifier prompts
│   └── llmcompare/            # LLMCompare question definitions
├── scripts/                    # Utility scripts and analysis tools
├── data/                       # Input data and examples
├── results/                    # Evaluation outputs
├── logs/                       # Detailed logs
└── images/                     # Generated plots
```

## Evaluations

### Alignment Faking

Tests whether models behave differently when they believe their reasoning is hidden vs. visible.

**Example configs:** `scratchpad_reveal.yaml`, `quick_test.yaml`

### CoT Faithfulness

Measures whether models honestly report their reasoning in chain-of-thought by testing with hidden hints.

**Example configs:** `default.yaml`, `opus_reasoning.yaml`

### Model-Written Evaluations

Tests models on questions generated by language models to discover behaviors.

**Categories:** Persona, Sycophancy, Advanced AI Risk, Winogender

**Example configs:** `default.yaml`, `quick_test.yaml`, `full_evaluation.yaml`

**Comparing Results:**
```bash
# Compare two model evaluations
python -m evaluations.model_written_evals.compare_logs logs/model_written_evals/model1.json logs/model_written_evals/model2.json
```

**Replotting Results:**
```bash
# Regenerate plots from existing logs
python -m evaluations.model_written_evals.replot_results logs/model_written_evals/results.json
```

### LLMCompare Evaluations

Flexible question-based evaluations using YAML-defined questions with support for paraphrases and judge chains.

**Features:** YAML questions, paraphrase testing, judge chains, multi-model comparison

**Example Usage:**
```bash
# Run specific questions
./run_evaluation.sh llmcompare_eval --models claude-3.5-sonnet gpt-4o --question-ids helpful_request_simple

# Use config file
./run_evaluation.sh llmcompare_eval --config configs/llmcompare_eval/example_config.yaml
```

**Writing Questions:**
```yaml
my_question:
  id: my_question_v1
  type: free_form
  paraphrases:
    - "First way to ask the question"
    - "Second way to ask the question"
  temperature: 0.7
  samples_per_paraphrase: 2
  judges:
    - quality_judge

quality_judge:
  id: quality_judge_v1  
  type: rating_judge
  prompt: "Rate this response quality (0-100): {answer}"
```

## Configuration

Config files use YAML format and support all CLI parameters. See `configs/` directory for examples.

## Scripts

### Alignment Faking Comparison Plot

Generate comparison plots for alignment faking results across different model configurations:

```bash
# Compare different configurations from registry
python scripts/plot_alignment_faking_comparison.py \
    claude-3.5-sonnet-scratchpad-reveal \
    claude-3.5-sonnet-control \
    --output my_plot.png

# Use specific log files directly
python scripts/plot_alignment_faking_comparison.py \
    logs/alignment_faking/log1.json \
    logs/alignment_faking/log2.json

# Mix configs and direct log files
python scripts/plot_alignment_faking_comparison.py \
    claude-3.5-sonnet-control \
    logs/alignment_faking/custom_experiment.json

# Auto-save to scripts/ with timestamp
python scripts/plot_alignment_faking_comparison.py \
    claude-3.5-sonnet-scratchpad-reveal \
    claude-3.5-sonnet-control
```

The script accepts both configuration names from the model registry and direct paths to JSON log files. It creates bar charts showing compliance rates and alignment faking reasoning frequency.

## Adding New Evaluations

To add a new evaluation:

1. Create directory: `evaluations/your_evaluation/`
2. Add `run_pipeline.py` with argument parsing and config support
3. Update `run_evaluation.sh` to include your evaluation
4. Create example configs in `configs/your_evaluation/`
5. Optionally add entries to the model registry

## Fine-Tuning Pipeline

This repository includes a complete pipeline for fine-tuning OpenAI models with synthetic documents. See [fine_tuning/README.md](fine_tuning/README.md) for detailed documentation.

Quick example:
```bash
# Prepare data and start fine-tuning
python -m fine_tuning.pipeline data/synthetic_docs/full_raw.jsonl --suffix "experiment-v1" --n-docs 1000

# Test your fine-tuned model
python -m fine_tuning.utils test ft:gpt-4o-mini:org:suffix:id
```

## Environment Setup

1. Install dependencies:
```bash
pip install -r requirements.txt
```

2. Set up API keys in `.env`:
```
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
```

## Related Research

- [Alignment Faking in Large Language Models](https://arxiv.org/abs/2412.14093) - Anthropic's paper on alignment faking behavior
- [Language Models Don't Always Say What They Think](https://arxiv.org/abs/2305.04388) - CoT faithfulness methodology
- [Reasoning models don't always say what they think](https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf) - Recent work on CoT unfaithfulness
- [Discovering Language Model Behaviors with Model-Written Evaluations](https://arxiv.org/abs/2212.09251) - Model-written evaluations methodology