# Supplementary Material: Analysing Trustworthiness in Small Language Models - A Structural Overview of Hallucination Rates

This supplementary material, intended exclusively for reviewers, provides the complete pipeline for generating and analyzing repeated model responses used in the paper.

## Prerequisites

- **Python**: 3.8 or higher
- **GPU**: CUDA-compatible GPU with at least 16GB VRAM (24GB recommended for larger models)
- **Operating System**: Linux (tested on Ubuntu 20.04+)
- **Hugging Face Account**: Required for accessing gated models

## Installation

1. Install required Python packages:
```bash
pip install transformers torch pandas numpy scipy scikit-learn matplotlib openpyxl tqdm bitsandbytes accelerate anthropic pyarrow nltk sentence-transformers umap-learn networkx
```

2. Add your Hugging Face access token to `config.json` in the `access_token` field

3. Add your Anthropic API token for Claude access (required for hallucination labelling with `detector.py`)


## Configuration Files

- **prompts.xlsx**: Excel file containing prompts and their IDs. Each prompt is repeated for years 2020 and 2022 with the same ID. Structure is easily extensible for additional prompts.

- **model_ids.json**: Maps Hugging Face model names to numerical IDs used in the dataset. Model names must match exactly as they appear on Hugging Face.

- **config.json**: Main configuration file containing:
  - Access token for Hugging Face API
  - Model parameters (padding, quantization settings)
  - Generation parameters (temperature, max_tokens, sampling settings)
  - Prompt template for formatting questions

## Generation Scripts

- **run_generation**: Shell script that orchestrates the response generation process. Reads configuration files and coordinates between `generate_responses.py` and `generate_pipeline.py` depending on the model.

- **generate_responses.py**: Core generation module that handles model loading with 4-bit quantization and response generation. Supports most models in the study.

- **generate_pipeline.py**: Alternative generation script for specific models (phi-4, Qwen2.5-14B) that require different handling.

Run the generation script with the model name and number of responses per prompt:

```bash
./run_generation "google/gemma-2-9b" 50
```

The model name must exactly match the entry in `model_ids.json`

For background execution with logging (recommended):
```bash
nohup ./run_generation "google/gemma-2-9b" 50 > gemma.log 2>&1 &
```

For each prompt in `prompts.xlsx`, the script generates N responses (specified in the command) and appends them incrementally to the output file. This means:
- Each prompt generates N lines in the output file
- Responses are saved immediately after generation (not batched)
- Multiple response_id values (0, 1, 2, ..., N-1) exist for each unique prompt_id
- The file can be safely interrupted and resumed

Generated responses are saved in `responses/<model_name>/dataset.json` as **JSON Lines format** (newline-delimited JSON). Each line is a complete JSON object representing one generated response with the following fields:

- ***model_id***: Numerical ID from `model_ids.json` identifying the model
- ***prompt_id***: ID from `prompts.xlsx` identifying the original question
- ***year***: Year variant (2020, 2022, or null if not applicable)
- ***response_ìndex***: Sequential index for this response (0 to num_responses-1)
- ***response***: The text generated by the model
- ***prompt***: The complete question text sent to the model
- ***temperature***: Sampling temperature used during generation

- **analysis.py**: Main analysis pipeline that reads all generated responses from the `responses/` folder and orchestrates the hallucination labelling process. Calls `detector.py` to evaluate responses and generates statistical summaries and visualizations of the results.

- **detector.py**: Hallucination labelling module that uses Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) via Anthropic API to evaluate response accuracy. Processes responses in batches of 10,000 and saves results as Parquet files in the `dataset/` folder. **Requires Anthropic API token** to be added to the configuration.

- **embeddings.py**: Stemmed response generation, embedding of both responses and stemmed responses using and generation of the final dataset, **icml26_dataset.parquet**


## Analysis Scripts

This directory contains the complete analysis pipeline and visualization notebooks used to generate the results and figures in the paper. All notebooks and scripts rely on the shared library module for core functionality.

- **HALL_lib.py**: Central library module containing all core analysis functions, classes, and utilities used across all notebooks. Includes:
  - Data loading and preprocessing utilities (e.g., `loadParquet`, `split_by_label`, `extract_prompt_data`)
  - Structural analysis functions for computing distance distributions, Wasserstein distances, and Fisher discriminant directions
  - Projection classes for dimensionality reduction: `FisherProjection`, `WhitenedPCAProjection`, `RandomProjection`, `SupervisedUMAPProjection`
  - Label propagation framework with `WassersteinLabelPropagator` class for hallucination detection based on Wasserstein distance to training distributions
  - Evaluation utilities including `LabelPropagationEvaluator` for computing classification metrics and margins
  - Experiment runners for systematic studies across models and prompts with caching support
  - Plotting utilities for model ordering and prompt selection
  - Statistical aggregation functions for metrics across prompts, training fractions, and regularization parameters

- **StructuralAnalysis.ipynb**: Analyzes the geometric structure of response embeddings in genuine vs. hallucinated classes. Computes intra-class and inter-class distance distributions, Wasserstein distances (W(GG,HH)), and Fisher projections. Generates visualizations comparing original embedding space with Fisher-projected space, including null model comparisons through permutation testing. Creates heatmaps and statistical summaries of structural separability across all model-prompt pairs.

- **descriptorsAnalysis.ipynb**: Examines structural descriptors and geometric properties of the embedding spaces. Analyzes how well different prompts separate genuine from hallucinated responses using Wasserstein distance metrics. Generates summary statistics and visualizations of descriptor distributions across models and prompts.

- **ProjectorsAnalysis.ipynb**: Comparative evaluation of different dimensionality reduction methods for hallucination detection. Tests Fisher projection, Whitened PCA, Random Projection, and Supervised UMAP across varying numbers of components (1-15). Evaluates classification performance using the Wasserstein label propagation framework. Provides systematic comparison of projection methods to determine optimal approaches for different embedding dimensions.

- **LabelPropagationAnalysis.ipynb**: Comprehensive evaluation of the Wasserstein label propagation detection method. Runs label propagation experiments across all model-prompt pairs using fixed test sets and stratified splits. Computes per-prompt and aggregated classification metrics (accuracy, F1, precision, recall). Generates heatmaps showing prompt-level performance for each model, revealing which types of questions are easier or harder to detect hallucinations for.

- **LambdaSensitivityAnalysis.ipynb**: Systematic sensitivity analysis of the Fisher regularization parameter λ. Tests λ values over logarithmic and fine-grained linear grids (10⁻⁴ to 10²) to capture the transition from unstable Fisher directions to over-regularized projections. Evaluates classification performance as a function of λ across all model-prompt combinations. Generates plots showing optimal regularization regimes and validates the choice of λ=1.2 used throughout the study.

- **TrainingSizeAnalysis.ipynb**: Studies the impact of training set size on hallucination detection performance. Evaluates label propagation using training fractions from 5% to 100% in 5% increments. Generates learning curves showing F1 score vs. number of training samples for each model. Analyzes how detection accuracy scales with available data and determines minimum viable training set sizes.

- **Pipeline.ipynb**: Main visualization and figure generation pipeline. Orchestrates the creation of publication-quality figures combining structural analysis, projection comparisons, and detection performance results. Includes t-SNE visualizations, distance distribution plots, network graphs showing response similarity, and comprehensive multi-panel figures. Handles LaTeX formatting and consistent styling across all plots for the paper.

- **megaNotebook.ipynb**: Comprehensive notebook combining all analysis components in a single unified workflow. Contains the complete analysis pipeline from data loading through final figure generation. Includes all structural analysis, descriptor computation, projection evaluations, label propagation experiments, and visualization code. Serves as the master notebook for reproducing all paper results and generating all figures in one execution. Useful for end-to-end reproducibility and understanding the complete analysis flow.


## Workflow Example

1. **Setup**: Ensure your Hugging Face token is in `config.json`
2. **Configure**: Verify model is listed in `model_ids.json`
3. **Generate**: Run `./run_generation "model/name" <num_responses>`
4. **Output**: Responses are saved to `responses/<model_name>/dataset.json`
5. **Analyze**: Run `analysis.py` to process all responses through hallucination labelling
6. **Results**: Detected hallucinations are saved as Parquet files in the `dataset/` folder (batches of 10,000 responses each)
7. **Embedding**: Run `embeddings.py` for stemming responses, generating embeddings and creating the final **icml26_dataset.parquet**
8. **Structural Analysis**: Open `/StructuralAnalysis.ipynb` to compute geometric descriptors, Wasserstein distances, and Fisher projections across all model-prompt pairs
9. **Detection Evaluation**: Use `/LabelPropagationAnalysis.ipynb` to evaluate hallucination detection performance using the Wasserstein label propagation method
10. **Parameter Tuning**: Run `/LambdaSensitivityAnalysis.ipynb` and `/TrainingSizeAnalysis.ipynb` to optimize regularization parameters and determine minimal training requirements
11. **Method Comparison**: Execute `/ProjectorsAnalysis.ipynb` to compare different dimensionality reduction approaches (Fisher, PCA, Random Projection, UMAP)
12. **Figure Generation**: Use `/Pipeline.ipynb` or `/megaNotebook.ipynb` to generate all publication-quality figures and visualizations


## Troubleshooting

**CUDA Out of Memory**: Reduce batch size or use a smaller model. All models use 4-bit quantization to minimize memory usage.

**Invalid Token Error**: Ensure your Hugging Face token has access to gated models and is correctly set in `config.json`.

**Model Not Found**: Verify the model name exactly matches the Hugging Face repository name and is listed in `model_ids.json`.



