# SinkProbe

Code for reproducing the experiments in "Attention Sinks as Internal Signals
for Hallucination Detection in Large Language Models" (ICML 2026 submission).

## Overview

SinkProbe is a hallucination detection method based on attention sink scores.
It computes per-token sink scores from attention maps, selects the top-k values
per head, and trains a logistic regression probe to classify hallucinated vs.
correct outputs.

## Requirements

- Python 3.13+
- [uv](https://docs.astral.sh/uv/) package manager
- GPU with CUDA 12.4+ (for LLM inference)

Install uv:

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

## Installation

**CPU (for probe training/evaluation):**

```bash
make install_cpu
```

**GPU (for LLM inference + feature extraction):**

```bash
make install_gpu
```

## Reproducing Paper Results

### 1. Generate LLM predictions and attention maps

```bash
uv run python scripts/dataset/generate_activations.py \
    llm=<llm> dataset=<dataset> prompt=<prompt> \
    generation_config=<generation_config> \
    results_dir=<path> random_seed=42
```

Configuration files for each LLM are located in `config/llm/`, dataset configs
in `config/dataset/`, and prompt configs in `config/prompt/qa/`.

### 2. Compute internal state features

Computes attention metrics including sink scores, eigenvalues, lookback lens,
and MTopDiv from stored attention maps:

```bash
uv run python scripts/features/compute_internal_states_metrics.py <dataset_dir>
```

### 3. Generate hallucination labels

#### Run LLM-as-judge evaluation

Required for all datasets except GSM8K. Uses different prompts depending on the
dataset — `llm_as_judge/qa_orgad_et_al_eval` for QA datasets and
`llm_as_judge/umwp_eval` for UMWP:

```bash
uv run python scripts/eval/llm_as_judge.py \
    llm_api=gpt_4.1 \
    prompt=llm_as_judge/<prompt> \
    answers_file=<path/to/answers.json>
```

#### Generate labels

For QA datasets (NQ-Open, TriviaQA, SQuADv2, TruthfulQA, HaluEvalQA):

```bash
uv run python scripts/eval/compute_qa_metrics.py --answers-file <path>
uv run python scripts/dataset/generate_labels.py \
    --dataset-dir <path> \
    --llm-judge-prompt <prompt> --llm-judge-llm <llm>
```

For UMWP (uses LLM-as-judge labels only):

```bash
uv run python scripts/dataset/generate_labels_llm_judge.py --dataset-dir <path>
```

For GSM8K (uses exact answer matching, no LLM judge needed):

```bash
uv run python scripts/dataset/generate_labels_gsm8k.py --dataset-dir <path>
```

### 4. Train hallucination probes (Table 1)

Trains logistic regression probes with 5-fold cross-validation for all methods
(SinkProbe, AttnEigval, LapEigval, MTopDiv, LookbackLens):

```bash
uv run python scripts/probes/study_probes.py <dataset_dir>
```

### 5. AttentionScore unsupervised baseline

```bash
uv run python scripts/probes/probe_attn_score.py --dataset-dir <path>
```

### 6. Attention score probe study

Trains both naive (unsupervised) and logistic regression probes on attention
score and attention log-det features with 5-fold cross-validation:

```bash
uv run python scripts/probes/study_attn_score_probes.py <dataset_dir>
```

## Project Structure

```
hallucinations/          Core library
  features/              Feature computation
    sink_scores.py       Sink score computation (Eq. 1-2 in paper)
    laplacian.py         Laplacian eigenvalue features
    attn_feats.py        Attention eigenvalue and log-det features
    lookback_lens.py     Lookback lens features
    mtopdiv.py           MTopDiv (topological divergence) features
    attention_weights.py Attention metric computation pipeline
  data/                  Dataset loading and formatting
  llm/                   LLM inference and activation storage
  probe_models/          Logistic regression probes with cross-validation
  metrics/               QA evaluation metrics (SQuAD F1, ROUGE-L)
scripts/                 Experiment scripts for each pipeline stage
config/                  YAML configurations for datasets, LLMs, and prompts
tests/                   Unit tests
```

## Supported LLMs

| Model | Config |
|-------|--------|
| Llama-3.2-3B-Instruct | `config/llm/llama_3.2_3b_instruct.yaml` |
| Phi-3.5-mini-instruct | `config/llm/phi_3.5_mini_instruct.yaml` |
| Llama-3.1-8B-Instruct | `config/llm/llama_3.1_8b_instruct.yaml` |
| Mistral-Nemo-Instruct-2407 | `config/llm/mistral_nemo_2407.yaml` |

## Supported Datasets

GSM8K, UMWP, TruthfulQA, TriviaQA, NQ-Open, SQuADv2, HaluEvalQA

Dataset configurations are in `config/dataset/`.

## Tests

```bash
make test
```

## Development

```bash
make install_cpu_dev  # Install with dev dependencies
make quality          # Run linters and type checker
make fix              # Auto-fix formatting issues
```
