### Probing and Hidden‑State Editing for Language Models

This repository contains two standalone Python scripts for analyzing and steering language models via their hidden states:

- probing.py — End‑to‑end benchmarking of hidden‑state interventions on causal LMs (HF Transformers) using LGD-style CSVs or the CausalGym dataset. Trains probes, applies interventions (GBI, INLP, AlterRep, HDMI, Null), and reports accuracy and TV‑based metrics.
- text-editing.py — Hidden‑state steering and differentiable decoding utilities for LLaMA‑architecture models, including a counterfactual generator (HDMI) that optimizes the hidden state step‑by‑step to follow an edited target text.

Both scripts run locally and use Hugging Face Transformers.

#### Features

- Train linear/MLP probes on LM hidden states.
- Apply interventions: GBI (FGSM/PGD), INLP, AlterRep, Nullspace erasure, HDMI (head‑driven hidden edits).
- Evaluate with task accuracy and TV‑based completeness/selectivity/reliability metrics.
- Generate counterfactual text by steering hidden states toward target tokens using a soft, differentiable rollout (HDMI).


---

### Requirements

- Python 3.9+
- A GPU is recommended (CUDA or Apple Silicon/MPS)
- Packages:
  - torch
  - transformers
  - numpy
  - pandas
  - sentencepiece (needed by many LLaMA‑style tokenizers)
  - accelerate (optional, recommended)
  - datasets (only if loading CausalGym from Hugging Face)

Install:

```bash
pip install torch transformers numpy pandas sentencepiece accelerate datasets
```

If you plan to use gated Hugging Face models (e.g., Llama‑3), set an access token:

- Environment: export HUGGINGFACE_HUB_TOKEN=your_token
- Or pass --hf-token your_token (text-editing.py)

---

### File: probing.py

#### Overview

A probing and intervention benchmark for causal LMs:

- Datasets:
  - LGD CSV with columns: text, verb_sg, verb_pl, Zc, Ze
  - CausalGym (Hugging Face “aryaman/causalgym”) or a local JSON directory with train.json/dev.json/test.json
    - For local JSON, each record should contain: base, src, base_label, src_label, base_type, src_type, task
- Probes:
  - Interventional Zc probe trained on an interventional split
  - Validation probes vZc (and optional vZe) trained on a disjoint validation‑probe split
- Interventions:
  - gbi: gradient‑based intervention (FGSM/PGD; L∞/L2)
  - inlp: iterative nullspace projection
  - alterrep: row‑space push along INLP directions
  - null: simple nullspace erasure via the probe’s final linear layer
  - hdmi: head‑driven hidden edit toward a target token
- Metrics:
  - baseline_task_acc, after_task_acc, delta_task_acc
  - completeness (TV closeness to a goal distribution)
  - selectivity (invariance of auxiliary Ze if available)
  - reliability (harmonic mean of completeness and selectivity)

Internals (at a glance): Extract last‑token hidden states at a chosen layer, train the interventional probe and validation probes, then for each test sample compute baseline log‑likelihoods for gold vs. alt continuations. Apply the chosen intervention to the prompt hidden state, recompute the first‑step term, and aggregate metrics.

#### Quick start

- CausalGym via Hugging Face:

```bash
python probing.py \
  --dataset_type causalgym \
  --model_name meta-llama/Meta-Llama-3-8B-Instruct \
  --cg_tasks agr_sv_num_subj-relc 
  --intervention gbi \
  --gbi_attack pgd --gbi_norm linf \
  --epsilon 10 --pgd_steps 40 \
  --layer_idx -1 \
  --max_samples 200
```

- LGD CSV:

```bash
python probing.py \
  --dataset_type lgd \
  --data_csv ./lgd_equiv_sva.csv \
  --model_name meta-llama/Meta-Llama-3-8B-Instruct \
  --intervention alterrep \
  --inlp_rank 16 --alterrep_alpha 0.1 \
  --layer_idx -1
```

Tip: Use --device auto (default) to select CUDA/MPS/CPU.

#### Common arguments

- Dataset/model:
  - --dataset_type [lgd|causalgym]
  - --data_csv path/to.csv (LGD)
  - --cg_local_json_dir /path/to/json/dir (CausalGym local)
  - --cg_tasks task1 task2 ... (optional filter)
  - --model_name EleutherAI/pythia-70m
  - --device [auto|cuda|cpu|mps]
  - --layer_idx -1 (final; negative indices allowed)
  - --max_samples N (subsample for speed)
- Probes:
  - --probe_epochs, --probe_lr, --probe_wd, --probe_hidden, --probe_batch_size
  - --valprobe_epochs, --valprobe_lr, --valprobe_wd, --valprobe_batch_size
  - --valprobe_hidden_grid 0 64 256 512
- Interventions:
  - --intervention [none|gbi|inlp|alterrep|null|hdmi]
  - GBI: --epsilon, --pgd_steps, --gbi_attack [fgsm|pgd], --gbi_norm [linf|l2], --gbi_step_size
  - INLP: --inlp_rank, --inlp_epochs, --inlp_lr, --inlp_wd, --inlp_batch_size
  - AlterRep: --alterrep_alpha
  - Null: --erase_strength
  - HDMI: --hdmi_alpha, --hdmi_inner_steps, --hdmi_use_margin, --hdmi_normalize_grad, --hdmi_grad_clip_norm

See all options:

```bash
python probing.py -h
```

#### Output

A JSON summary like:

```json
{
  "N": 120,
  "baseline_task_acc": 0.5083,
  "after_task_acc": 0.5833,
  "delta_task_acc": 0.075,
  "completeness": 0.6124,
  "selectivity": 0.9341,
  "reliability": 0.7392,
  "intervention": "gbi",
  "epsilon": 0.112,
  "pgd_steps": 40,
  "gbi_attack": "pgd",
  "gbi_norm": "linf",
  "layer_idx": -1,
  "model": "EleutherAI/pythia-70m",
  "dataset_type": "causalgym",
  "cg_tasks": ["agr_sv_num_subj-relc", "agr_sv_num_obj-relc"]
}
```

Notes:

- LGD CSV must contain: text, verb_sg, verb_pl, Zc, Ze. Labels are normalized internally.
- For CausalGym, Ze is computed via a preposition‑family heuristic (NONE/OF/IN/WITH_OR_BY/OTHER). If Ze is unavailable/imbalanced, selectivity defaults to 1.0, making reliability equal completeness.

---

### File: text-editing.py

#### Overview

Hidden‑state steering and differentiable decoding utilities for LLaMA‑architecture models:

- Differentiable decoding (“soft relaxation”): Advances the model with an expected input embedding E[y] from softened logits, enabling gradient‑based hidden edits on the fly.
- Counterfactual generation (HDMI): Given factual and edited texts, it aligns positions where the sequences differ and, at each decode step, optimizes the current hidden state to increase future target token scores (optionally decreasing source) across a short soft rollout.

The default main() demonstrates the HDMI approach and prints the generated counterfactual text.

#### Quick start

```bash
python text-editing.py \
  --model-id meta-llama/Meta-Llama-3-8B-Instruct \
  --prompt "Tell me a story" \
  --factual-text "Tell me a story about a girl who loves the sun." \
  --edited-text "Tell me a story about an owl who loves the sun." \
  --alpha 50 \
  --f-reg 0.2 \
  --max-new-tokens 200
```

If the primary model is gated or fails to load, the script automatically tries --fallback.

Useful flags:

- --model-id, --fallback, --hf-token
- --device [cuda|cpu] (auto if omitted)
- --prompt, --factual-text, --edited-text
- --alpha (outer step scale for hidden edits)
- --f-reg (adds a factual score term at the current step)
- --max-new-tokens

For small‑memory tests:

```bash
 python text-editing.py --factual-text "Today, he was upset and left the room" --edited-text "Today, he were upset" --prompt "Today" --f-reg 0.3

```

#### What’s inside

- LLaMA helpers:
  - llama_step_llm: One step with hard tokens; returns logits, last hidden, cache.
  - _llama_step_inputs_embeds: One step with an expected embedding (soft update).
- Soft relaxation:
  - _softmax_expected_embedding: y = softmax(logits/τ) and expected embedding E[y] without duplicating embedding weights in float32.
  - generate_factual_soft: Advances with E[y]; displays either greedy or sampled tokens.
- Counterfactual HDMI:
  - generate_counterfactual_hdmi: At each step, builds a short differentiable rollout from the current hidden state, accumulates a future objective over edited positions, computes ∂J/∂h, updates h, then advances with E[y].

Tips:

- Newer Transformers introduce cache classes; the script includes compatibility helpers and fallbacks.
- If you hit OOM, reduce --max-new-tokens or use a smaller model.


---

### Reproducibility

- Both scripts set seeds where relevant, but full determinism on GPU can depend on hardware/driver.



