# Sensitivity of Small Language Models to Fine-tuning Data Contamination

This repository contains code and data for evaluating different small language models (SLMs) on different transformation patterns to understand the robustness of the SLMs against these transformations. The framework includes implementations for multiple state-of-the-art SLMs and a structured evaluation pipeline.

## Project Structure

```
.
├── Data/
│   ├── data_percentage.py        # code to create data csvs with different contamination levels
│   ├── test/
│   │   └── D_test.csv
│   └── train/
│       ├── D_ad_cfact.csv        # Counterfactual dataset
│       ├── D_ad_creversal.csv    # Character reversal dataset
│       ├── D_ad_irr.csv          # Irrelevant dataset
│       ├── D_ad_wreversal.csv    # Word reversal dataset
│       └── D_ad.csv              # Base clean dataset
│       # Note: Each dataset also has 25%, 50%, and 75% variants
├── Counterfactual/               # Counterfactual data generation
│   ├── generate_cfact.py         # Generate counterfactual data
│   ├── score_cfact.py            # Evaluate counterfactual data generated and score
├── Gemma3/
│   ├── configs/
│   │   └── config.yaml           # Model configuration parameters
│   ├── metrics.py                # Calculation of specific evaluation metrics for training
│   ├── src/
│   │   ├── dataloader.py         # Data loading and preprocessing
│   │   ├── model_loader.py       # Model initialization and setup
│   │   └── training.py           # Training loop and optimization
│   └── train.py                  # Main training script
├── Llama3.2/
├── OLMo2/
├── Phi4/
├── Qwen2.5/
├── SmolLM2/
├── Visualizations/                # Visualization scripts for generating figures
│   ├── figure2.py                 # Combined plot for model sensitivity
│   ├── figure3.py                 # Heatmap for syntactic tasks (adherence and accuracy)
│   ├── figure3_part2.py           # Heatmap for syntactic tasks (semantic similarity and grammar)
│   ├── figure4.py                 # Heatmap for semantic tasks (adherence and accuracy)
│   └── figure4_part2.py           # Heatmap for semantic tasks (semantic similarity and grammar)
├── agreement.py                   # Analysis of agreement between models and human evaluator
├── evaluation_responses.py        # Evaluation of model responses on different transformations
└── inference.py                   # Inference pipeline for generating model predictions
```

## Setup

Prerequisites:
- Python 3.8 or higher
- pip (Python package installer)
- Virtual environment module (venv)

1. Create and activate a virtual environment:
   ```bash
   # Create a new virtual environment
   python -m venv venv

   # Activate the virtual environment
   # On Unix or MacOS:
   source venv/bin/activate
   # On Windows:
   # .\venv\Scripts\activate

   # Install required packages
   pip install -r requirements.txt
   ```

## Getting Started

1. Generate and evaluate counterfactual data (already generated and added to the data):              
   ```bash
   # Generate counterfactual data
   python Counterfactual/generate_cfact.py
   
   # Score and evaluate the generated counterfactuals
   python Counterfactual/score_cfact.py \
     --orig Data/train/D_ad.csv \
     --cf Data/train/D_ad_cfact.csv \
     --out Data/train/D_ad_cfact_scored_llm.csv
   ```

2. Create datasets with different contamination levels (already generated and added to the data):
   ```bash
   python Data/data_percentage.py \
     --input_file Data/train/D_ad_<>.csv \
     --output_dir Data/train \
     --percentages 25 50 75
   ```
   This will create datasets with 25%, 50%, and 75% contamination levels (e.g., D_ad_cfact_25.csv, D_ad_cfact_50.csv, D_ad_cfact_75.csv)

3. Choose a model implementation directory
4. Configure the model using the respective `config.yaml`
5. Use the training script: `python ModelName/train.py`
6. Run inference:
   ```bash
   python inference.py \
     --models_dir path_to_folder_containing_models \
     --csv_path path_to_test_csv \
     --output_dir inference_results \
     --batch-size 8 \
     --max-tokens 1024 \
     --temperature 0.0 \
     --top-p 0.95
   ```
7. Evaluate results:
   ```bash
   python evaluation_responses.py \
     -i path_to_inference_results \
     -o path_to_evaluation_output
   ```
   Note: Requires GOOGLE_API_KEY environment variable for Gemini model evaluation