# Core Code Package: Circuit-Targeted GCG + Simulatability Evaluation

This submission contains the **minimal core code** required to reproduce the main experimental loop:

1. Optimize an adversarial suffix/prefix using a circuit-targeted GCG objective.
2. Evaluate simulatability metrics under **base / adversarial / ablated** conditions (including **top-1**, **top-k**, and **tracked-token** statistics).

## Included Components

- `nanoGCG/main.py`
  - Experiment driver: loads model + circuit target features, runs GCG optimization, optionally runs sim-eval, and appends results to `runs.jsonl`.
- `nanoGCG/nanogcg/`
  - GCG optimization and wrappers used by `main.py`.
- `nanoGCG/sim_eval.py`
  - Simulatability evaluation: produces `top1`, `topk`, `tracked`, `deltas`, and `agreements` fields.
- `circuit-tracer/`
  - Feature intervention model (`ReplacementModel`) and circuit utilities.

## Environment Requirements

- Python 3.10+
- A CUDA GPU is recommended for practical runtime.
- Internet access (or pre-cached models) for Hugging Face model downloads.

## Installation

From the repository root (the directory containing both `nanoGCG/` and `circuit-tracer/`):

```bash
pip install -e ./circuit-tracer
pip install -e ./nanoGCG
```

## Running the Core Experiment

All commands below should be run from the repository root.

### A) Single-prompt run (attack + sim-eval)

```bash
python nanoGCG/main.py \
  --model google/gemma-2-2b \
  --transcoder gemma \
  --template "{optim_str}Fact: The capital of the state containing Dallas is" \
  --supernodes-url "<YOUR_SUPERNODES_URL>" \
  --supernode-names "<NAME1,NAME2,...>" \
  --steps 50 \
  --search-width 128 \
  --batch-size 32 \
  --outdir outputs_run1 \
  --runs-jsonl runs.jsonl \
  --sim-eval \
  --sim-eval-topk 5 \
  --sim-eval-track-texts " Texas, Austin" \
  --sim-eval-include-attack-ablate
```

### B) Batch run from prompts file

Create a file with one prompt per line (example: `nanoGCG/prompts_20.txt`). Then:

```bash
python nanoGCG/main.py \
  --model google/gemma-2-2b \
  --transcoder gemma \
  --prompts-file nanoGCG/prompts_20.txt \
  --supernodes-url "<YOUR_SUPERNODES_URL>" \
  --supernode-names "<NAME1,NAME2,...>" \
  --steps 50 \
  --search-width 128 \
  --batch-size 32 \
  --outdir outputs_batch \
  --runs-jsonl runs.jsonl \
  --sim-eval \
  --sim-eval-topk 5 \
  --sim-eval-track-texts " Texas, Austin"
```

## Outputs

### 1) `runs.jsonl`

`nanoGCG/main.py` appends one JSON record per run. The record contains (at minimum):

- `base_text`: the prompt with `{optim_str}` removed
- `adv_text`: the final adversarial input (prompt with optimized suffix/prefix inserted)
- `gcg.best_loss`: best objective value
- `gcg.suffix`: optimized adversarial suffix/prefix string
- `sim_eval`: simulatability evaluation dictionary

### 2) `sim_eval` schema (high-level)

`nanoGCG/sim_eval.py` returns a dictionary containing:

- `top1`: top-1 token for
  - `base`, `adv`, `base_ablate`, and optionally `adv_ablate`
- `topk`: top-k `(token, prob)` list for the same conditions
- `tracked`: for each tracked string, token-level statistics under the same conditions
  - includes `logit`, `prob`, `rank`, and margin vs top-1
- `deltas`: differences for tracked tokens (e.g., adversarial minus base)
- `agreements`: convenience booleans (e.g., whether top-1 flips)

These fields are designed to support downstream computation of token-level metrics (e.g., sTRR@1) and equivalence analyses (adversarial prompts vs explicit feature ablations).

## Reproducibility Notes

- Tokenization is sensitive: tracked strings may require leading whitespace.
- Use fixed seeds and consistent hardware when comparing runs.
- When changing model checkpoints, ensure `--transcoder` remains compatible with `circuit-tracer`.

## Packaging

### Create a `.tar.gz` (recommended)

PowerShell (from repo root):

```powershell
tar -czf submission_core.tar.gz `
  nanoGCG/main.py `
  nanoGCG/sim_eval.py `
  nanoGCG/nanogcg `
  circuit-tracer
```

### Create a `.zip`

```powershell
Compress-Archive -Path `
  nanoGCG/main.py,`
  nanoGCG/sim_eval.py,`
  nanoGCG/nanogcg,`
  circuit-tracer `
  -DestinationPath submission_core.zip -Force
```

## License

See `LICENSE` files in the respective subprojects.
