# Stubborn Hallucinations (stubb_hallu)

This repository contains the code for analyzing "Stubborn Hallucinations" in Large Language Models (LLMs). The project focuses on detecting hallucinations that are robust (stubborn) to perturbations, distinguishing them from common uncertainties using gradient-based curvature analysis and other uncertainty measures.

## Overview

Stubborn hallucinations occur when an LLM confidently and consistently generates incorrect information, even when the input context is slightly perturbed. This project implements a **Gradient-Based Curvature** method to detect such hallucinations by analyzing the sensitivity of the model's internal representations (gradients) to input perturbations.

We compare this method against standard uncertainty baselines like `p_true`, Predictive Entropy, and Semantic Entropy.

### Key Features
*   **Gradient-Based Detection**: measures the "curvature" of the loss landscape to identify stubborn errors.
*   **Input Perturbations**: supports both template-based perturbations and neural paraphrasing (using Pegasus).
*   **Key Phrase Masking**: optionally focuses analysis on specific entities (NER-detected) in the answer.
*   **Uncertainty Baselines**: implementation of Semantic Uncertainty, `p_true`, and more.
*   **Flexible Model Support**: Supports Llama-2, Llama-3, Falcon, Mistral, and others via HuggingFace.

## Installation

1.  **Create Environment**
    It is recommended to use Conda.
    ```bash
    conda env create -f environment.yaml
    conda activate stubb_hallu
    ```
    *Note: Requires Python 3.11 and PyTorch with CUDA support.*

2.  **Setup Environment Variables**
    Ensure your HuggingFace token and other paths are set if necessary. You may need to adjust `run.py` or export variables or set up `wandb` API key.

## Project Structure

*   `run.py`: Main entry point for experiments. Handles answer generation, gradient computation, and perturbation loops.
*   `compute_uncertainty_measures.py`: Script to compute standard uncertainty baselines (entropy, p_true, etc.) post-generation.
*   `analyze_results.py`: Tools for analyzing and plotting results (AUROC, etc.).
*   `uncertainty/`: Core package containing:
    *   `models/`: HuggingFace model wrappers.
    *   `utils/`: Utilities for logging, metrics (SQuAD, BERTScore), and data handling.
    *   `gradient_utils.py`: Logic for computing gradients and curvature scores.
*   `config/`: Configuration files (e.g., `perturbations.json`).
*   `stubb_dataset/`: Directory for datasets.

## Usage

### Running an Experiment

The primary script is `run.py`. You can run a gradient-based analysis or a baseline generation.

#### 1. Gradient-Based Analysis
Run the model to generate answers and compute curvature scores using gradients.

```bash
python run.py \
    --method gradient \
    --model_name Meta-Llama-3-8B \
    --gradient_target last_transformer_block \
    --use_embedding_noise_perturbation \
    --embedding_noise_epsilon 0.1 \
    --dataset squad \
    --num_perturbations 1 \
    --metric bertscore \
    --num_samples 400 \
    --use_key_phrase_masking \
    --no-get_training_set_generations
```

**Key Arguments:**
*   `--method`: `gradient` (for curvature) or `baseline` (for standard uncertainty).
*   `--gradient_target`: layer to compute gradients on (e.g., `lm_head`, `last_transformer_block`).
*   `--use_paraphrase_perturbation`: uses a paraphraser model instead of fixed templates.
*   `--use_key_phrase_masking`: masks non-entity tokens in the target to focus gradients on key phrases.
*   `--num_few_shot`: number of few-shot examples (default: 5).

#### 2. Baseline Uncertainty Computation
To run standard uncertainty metrics (Semantic Entropy, p_true):

```bash
python run.py \
    --model_name "meta-llama/Llama-2-7b-chat-hf" \
    --dataset "trivia_qa" \
    --method "baseline" \
    --compute_uncertainties
```

This will generate answers and then automatically trigger `compute_uncertainty_measures.py`.

### Metrics & Evaluation

The system supports various metrics for assessing generation quality:
*   `squad`: SQuAD-based F1 (default).
*   `bertscore`: Semantic similarity using BERTScore.
*   `llm`: LLM-based correctness check (entailment).

Add `--metric bertscore` to your run command to use BERTScore.

## License

[Insert License Here]

## Citation

[Insert Citation Here]
