# MICROSACCADE INSPIRED PROBING: POSITIONAL ENCODING PERTURBATIONS REVEAL LLM MISBEHAVIOURS

<embed src="figures/technique.pdf" type="application/pdf" width="100%" height="600px" />


## 🔧 Environment Setup

To reproduce experiments:

```bash
conda create -n demeanor python=3.10 -y
cd CodeDemeanor
git submodule update --init --recursive
pip install -e .
pip install -e transformers
pip install -e SWE-bench
pip install -e mini-swe-agent
```

This:

* Creates a clean Python 3.10 environment.
* Initializes submodules.
* Installs the repo and key dependencies in *editable* mode so local changes are picked up immediately.

---

## ▶️ Running Experiments

Run a probing script with:

```bash
python examples/prober.py --config_path examples/config.yaml
```

### Config file schema

```yaml
# samples_path: "artifacts/backdoor/sleeper_agent.jsonl"
model_name: "meta-llama/Llama-3.2-3B-Instruct"
hidden_layers: [0, 14, 27]
heads: [0, 12, 23]
combine_causal_effects: true
batch_size: 4
max_samples: 1000
results_dir: "results/backdoor/sleeper_agent/"
use_microsaccade_intervention: true
# use_layer_intervention: true 
```

**Key flags:**

* `use_microsaccade_intervention`: Enables **MIP** (microsaccade-inspired probing).
* `use_layer_intervention`: Enables **baseline layer interventions** instead.
* Optional flags for **ablations**:

  * `use_gaussian_noise`
  * `use_random_noise`

---

## 📊 Summarising Results

After running experiments, aggregate results with:

```bash
python results/summarise_results.py
```

---
