# HMNS: Head-Masked Nullspace Steering — README

This repository contains a reference implementation of **Head-Masked Nullspace Steering (HMNS)**, a mechanism-grounded jailbreak method that (1) attributes causal influence to attention heads via masked ablations, (2) **masks** the out-projection of the most causal heads, and (3) injects a small **nullspace-constrained** steering vector at inference time.

---

## 1) Quick start

### Requirements

* Python 3.9+
* PyTorch ≥ 2.2 (CUDA recommended)
* Hugging Face `transformers` ≥ 4.41, `accelerate`, `huggingface_hub`
* `tqdm`, `numpy`, `pandas`, `scipy` (optional for analysis)

```bash
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -U pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # pick your CUDA
pip install transformers==4.41.0 accelerate huggingface_hub tqdm numpy pandas scipy
```

### Models (HF)

The code supports open-weight models via Hugging Face:

* Meta LLaMA-2-7B-Chat: `meta-llama/Llama-2-7b-chat-hf`
* Microsoft Phi-3-Medium-4K-Instruct: `microsoft/Phi-3-medium-4k-instruct`
* Meta Llama-3.1-70B: `meta-llama/Llama-3.1-70B`

> You must accept any model licenses on huggingface.co with your account and configure a token:
>
> ```bash
> huggingface-cli login
> ```

---

## 2) Repository layout

```
.
├── main.py                 # entry point: loads model/data, runs HMNS, saves results
├── models.py               # model aliases, caching, tokenizer+model loader
├── method.py               # HMNS core: attribution, masking, nullspace steering
├── datasets.py             # dataset loader + fixed analysis/dev/test split utilities
├── runs/                   # outputs (JSONL per-example + summary.json)
└── README.md               # this file
```

---

## 3) Running HMNS

### Minimal run

```bash
python main.py \
  --model_alias llama2-7b-chat \
  --split test \
  --out_dir runs/llama2-7b_test
```

### Common options

* `--model_alias {llama2-7b-chat, phi3-medium-4k, llama3.1-70b}`
* `--split {analysis, dev, test}` (fixed three-way split for consistency)
* `--limit N` (use first N items for a smoke test)
* `--dtype {bfloat16,float16,float32}` (default: bfloat16)
* `--device_map auto` (enables tensor parallelism where available)
* `--topk_heads 10` (global top-K heads per loop)
* `--attempts 3` (closed-loop attribution→steer attempts)
* `--alpha_base 0.25 --alpha_step 0.10` (steer schedule: $\alpha_t=\alpha_{\text{base}}(1+ \alpha_{\text{step}}\,(t-1))$)
* `--prefer_hf` (pull benchmarks from HF Datasets if integrated)

Example (Phi-3, dev split, 200 items):

```bash
python main.py \
  --model_alias phi3-medium-4k \
  --split dev \
  --limit 200 \
  --out_dir runs/phi3_dev
```

---

## 4) What HMNS runs under the hood

* **Attribution**: per input and loop attempt, HMNS masks one head at a time (out-proj slice) and measures distributional impact (KL or a proxy) to rank heads. Shortlisting may use a fast **proxy pre-selection** before exact KL.
* **Masking**: for the selected global top-K heads, their out-projection columns are zeroed **only** for the current forward (restored afterward).
* **Nullspace steering**: per intervened layer $\ell$, concatenate masked head columns to form $M_\ell$, compute a thin-QR projector in float32, sample $u_\ell \in \mathrm{span}(M_\ell)^\perp$, and inject $\delta_\ell=\alpha\,\mathrm{RMS}(a_\ell)\,u_\ell$ at the final token (post-attention).
* **Closed loop**: repeat attribution→mask+steer→decode up to `--attempts` times or until success (if you include a grader).

---

## 5) Outputs

After a run, you’ll find in `runs/<tag>/`:

* `<model_alias>_<split>.jsonl` — per-example records:

  ```json
  {
    "id": "advbench_001",
    "source": "AdvBench",
    "label": 1,
    "prompt": "...",
    "success": true,
    "attempts": 2,        // ACQ
    "ipc": 32,            // internal forward-equivalent passes
    "latency_s": 6.3,
    "flops": 0.53,
    "output": "..."
  }
  ```
* `<model_alias>_<split>_summary.json` — aggregate metrics:

  * **ASR** (success rate), **ACQ** (avg attempts), **IPC** (internal passes),
    **FLOPs** (per success), **Latency** (s).

---

## 6) Hardware notes

* **7B & 14B**: single A100-80GB (bfloat16).
* **70B**: two A100-80GB with `--device_map auto` (tensor parallelism).
* For single-GPU fallbacks on very large models (not our primary setup), consider HF 4/8-bit + offload and switch to **activation-slice masking** (steering hook remains post-attention).

---

## 7) Reproducibility

* Determinism: `torch.manual_seed(0)`; CUDA TF32 matmul enabled where available.
* KV cache is **disabled** during attribution and steered decoding to reflect dynamic masking correctly.
* The fixed **analysis/dev/test** split is created in `datasets.py` (see docstrings).
* HMNS configuration is captured in `HMNSConfig` (returned values include success, attempts/ACQ, IPC, latency, FLOPs).

---

## 8) Troubleshooting

* **Out-of-memory**: reduce sequence length, set `--limit`, lower `--attempts`, or cap `--max_layers`. For 70B, ensure multi-GPU or quantization/offload.
* **Slow QR**: keep `topk_heads` modest (e.g., 10) and ensure QR is run in float32 (already set in code).
* **Masked-weight state leaks**: the code wraps masking in a context manager to restore weights after each probe. If you modify internals, keep this pattern.

---

## 9) Citation

If you use HMNS, please cite the paper (bibtex forthcoming).

---

## 10) License

Add your license here (e.g., Apache-2.0/MIT). Ensure you comply with third-party model licenses when downloading from Hugging Face.

---

## 11) Ethical use

This code is released **for research purposes only** to study defense-aware, mechanism-grounded model behavior. Do not use it to cause harm or violate platform policies.
