# Reliability Scaling Laws for Quantized Large Language Models

A comprehensive evaluation framework for assessing the reliability of quantized Large Language Models (LLMs).

## Installation

### Conda Environment

```bash
conda env create -f environment.yml
conda activate reliability-scaling-env
```

### Environment Setup

Create a `.env` file in the project root directory with the following environment variables:

```bash
# HuggingFace authentication token (required for model access)
HUGGINGFACE_TOKEN=your_huggingface_token_here

# Base user path for cache and data storage
BASE_USER_PATH=/path/to/your/user/directory

# Data storage path
DATA_PATH=/path/to/your/data/directory

# Project root path
PROJECT_PATH=/path/to/your/project/directory

# HuggingFace Hub cache path
HF_HUB_PATH=/path/to/your/.cache/huggingface/hub/
```

# Running Reliability Evaluation Experiments

## 1. Dataset Perturbation

Perturb the dataset using the character-level or word-level perturbations with a specified perturbation intensity:

```bash
python process_datasets.py \
  --dataset_type TRIVIAQA \
  --num_entries 1000 \
  --perturbation_type char_insertion \
  --perturbation_intensity 16 
```

### Parameters

- `--dataset_type`:  
  Dataset used for the evaluation.  
  - `TRIVIAQA`
  - `COQA`
  - `COMMONSENSEQA`

- `--num_entries`:  
  Number of input prompts used for the evaluation (e.g., `1000`)

- `--perturbation_type`:  
  Type of the perturbation applied to the input prompt.  
  **Supported values**:

  **Character-level Perturbations**:
  - `char_insertion`
  - `char_deletion`
  - `char_replacement`
  - `char_swapping`
  - `char_repetition`
  - `char_substitution`
  - `char_insert_noise`
  - `char_LCC`
  - `char_emoji`

  **Word-level Perturbations**:
  - `word_context_aware_insertion`
  - `word_keyword_only`
  - `word_swapping`
  - `word_repeat`
  - `word_internet_slang`
  - `word_phrase_translation`

- `--perturbation_intensity`:  
  Degree of perturbation to apply (e.g., `4`, `16`)

## 2. Reliability Evaluation 

Evaluating the reliability aspects on using the unperturbed prompts

```bash
python reliability_eval.py \
  --exp_id "llama3-8b-triviaqa" \
  --model_name "llama3_8b" \
  --dataset_type "TRIVIAQA" \
  --source_type "raw" \
  --num_entries 1000 \
  --generation_strategy "multinomial_sampling" \
  --max_new_tokens 15 \
  --device "cuda"
```

Evaluating the reliability aspects using the perturbed prompts with the character insertion perturbation

```bash
python reliability_eval.py \
  --exp_id "llama3-8b-triviaqa" \
  --model_name "llama3_8b" \
  --dataset_type "TRIVIAQA" \
  --source_type "processed" \
  --num_entries 1000 \
  --generation_strategy "multinomial_sampling" \
  --max_new_tokens 20 \
  --device "cuda" \
  --perturbation_type "char_insertion" \
  --perturbation_intensity 4 
```

#### Parameters for the Reliability Evaluation

- `--exp_id`:  
  Unique identifier for the experiment. Used for logging and organizing outputs.

- `--model_name`:  
  The base/quantized model to evaluate.  
  **Models**:
  - `llama3_8b`, `llama3_70b`
  - `llama32_1b`, `llama32_3b`
  - Quantized variants:  
    - `llama32_3b_gptq_2bit`, `llama32_3b_gptq_4bit`, `llama32_3b_gptq_8bit`  
    - `llama32_3b_awq_4bit`, `llama32_3b_hqq_2bit`, etc.

- `--source_type`:  
  Source format of the dataset.  
  - `raw`: Original unperturbed data  
  - `processed`: Data with perturbations applied

- `--generation_strategy`:  
  Response generation method.  
  **Options**:  
  - `multinomial_sampling`, `greedy_search`

- `--max_new_tokens`:  
  Maximum number of new tokens to generate (e.g., `15`)

- `--temperature`:  
  Sampling temperature for generation. Higher values introduce more randomness (e.g., `0.7`)

#### Additional Parameters
For a complete list of parameters, run any script with the `--help` flag.

All configurable parameters and available options are documented in the `parameters.yaml` file included in this repository. This file contains:

- Default parameter values
- Available dataset types
- Supported model names
- Perturbation types and intensities
- Generation strategies
- Prompt strategies

You can refer to this file when constructing command lines for any of the scripts.


## 3. Perplexity Evaluation

```bash
python perplexity_eval.py \
  --exp_id "llama32-1b-wikitext-perplexity-slurm" \
  --model_name "llama32_1b" \
  --dataset_name "wikitext" \
  --split "test" \
  --n_samples 128 \
  --device "cuda" 
```
