# Whisper ASR Evaluation - Reproducibility Package

This package contains all the necessary files to reproduce the Whisper ASR evaluation experiments with different model sizes and prompting strategies.

## 📁 Package Structure

```
reproducibility/
├── decode_scripts/                                  # All evaluation scripts
│   ├── comprehensive_asr_evaluation_improved_medium.py    # Whisper-Medium (high accuracy)
│   ├── comprehensive_asr_evaluation_improved_oracle.py    # Oracle experiment (normalized_truth as prompt)
│   ├── comprehensive_asr_evaluation_improved_base.py      # Whisper-Base (balanced)
│   ├── comprehensive_asr_evaluation_improved_tiny.py      # Whisper-Tiny (fastest)
│   ├── comprehensive_asr_evaluation_improved_adversarial.py # Adversarial prompts
│   └── comprehensive_asr_evaluation_improved.py           # Original Whisper-Small
├── compute_metrics_approx.py                        # Optimized metrics calculator
├── asr_experiment_logger_improved.py               # Enhanced experiment logger
├── llm_text_normalizer.py                          # Text normalizer
├── bedrock_claude/                                 # AWS Bedrock client
│   ├── __init__.py
│   ├── client.py
│   ├── config.py
│   ├── exceptions.py
│   └── tools.py
├── requirements.txt                                 # Python dependencies
├── README.md                                        # This file
└── run_test.sh                                     # Quick test script
```

## 🚀 Quick Start

### 1. Environment Setup

```bash
# Create virtual environment
python -m venv whisper_asr_env
source whisper_asr_env/bin/activate  # On Windows: whisper_asr_env\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

### 2. AWS Credentials Setup

Configure AWS credentials for Bedrock access:

```bash
# Option 1: AWS CLI
aws configure

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
```

### 3. Dataset Preparation

The scripts expect the dataset at `../final_hf_dataset/agentic_asr_normalized`. You can:

- **Option A**: Place your dataset there
- **Option B**: Use the `--dataset-path` argument to specify a different location

### 4. Run Experiments

#### Quick Test (10 samples)
```bash
# Test Whisper-Medium
python decode_scripts/comprehensive_asr_evaluation_improved_medium.py --test-run

# Test Oracle experiment
python decode_scripts/comprehensive_asr_evaluation_improved_oracle.py --test-run

# Test other models
python decode_scripts/comprehensive_asr_evaluation_improved_tiny.py --test-run
python decode_scripts/comprehensive_asr_evaluation_improved_base.py --test-run
```

#### Full Evaluation
```bash
# Run full evaluation with different models
python decode_scripts/comprehensive_asr_evaluation_improved_medium.py
python decode_scripts/comprehensive_asr_evaluation_improved_base.py
python decode_scripts/comprehensive_asr_evaluation_improved_tiny.py
python decode_scripts/comprehensive_asr_evaluation_improved_oracle.py
```

#### Custom Configuration
```bash
# Custom dataset path and sample count
python decode_scripts/comprehensive_asr_evaluation_improved_medium.py \
    --dataset-path /path/to/your/dataset \
    --max-samples 1000 \
    --checkpoint-interval 100

# Enable slot-WER (slower but more detailed metrics)
python decode_scripts/comprehensive_asr_evaluation_improved_medium.py \
    --enable-slot-wer \
    --max-samples 500
```

## 🔧 Command Line Options

All scripts support the following options:

| Option | Default | Description |
|--------|---------|-------------|
| `--dataset-path` | `../final_hf_dataset/agentic_asr_normalized` | Path to HuggingFace dataset |
| `--model-name` | Varies by script | Whisper model name |
| `--experiment-name` | Varies by script | Experiment name for logging |
| `--max-samples` | All samples | Maximum number of samples to process |
| `--start-idx` | 0 | Starting sample index |
| `--checkpoint-interval` | 500 | Save checkpoint every N samples |
| `--enable-slot-wer` | False | Enable slot-WER computation (slower) |
| `--disable-auto-resume` | False | Disable automatic resume from checkpoints |
| `--test-run` | False | Run on first 10 samples for testing |

## 📊 Model Variants

### 1. Whisper-Medium (`comprehensive_asr_evaluation_improved_medium.py`)
- **Model**: `openai/whisper-medium`
- **Use Case**: High accuracy applications
- **Speed**: Slower but more accurate
- **Prompt**: Uses dataset prompts for conditioning

### 2. Whisper-Base (`comprehensive_asr_evaluation_improved_base.py`)
- **Model**: `openai/whisper-base`
- **Use Case**: Balanced accuracy/speed
- **Speed**: Moderate
- **Prompt**: Uses dataset prompts for conditioning

### 3. Whisper-Tiny (`comprehensive_asr_evaluation_improved_tiny.py`)
- **Model**: `openai/whisper-tiny`
- **Use Case**: Fast inference
- **Speed**: Fastest
- **Prompt**: Uses dataset prompts for conditioning

### 4. Oracle Experiment (`comprehensive_asr_evaluation_improved_oracle.py`)
- **Model**: `openai/whisper-small`
- **Use Case**: Upper bound performance analysis
- **Prompt**: Uses `normalized_truth` as perfect context
- **Purpose**: Measures impact of perfect prompting

### 5. Adversarial Prompts (`comprehensive_asr_evaluation_improved_adversarial.py`)
- **Model**: `openai/whisper-small`
- **Use Case**: Robustness testing
- **Prompt**: Uses wrong domain prompts to confuse the model
- **Purpose**: Measures impact of misleading context

## ⚡ Performance Optimizations

All scripts include these optimizations:

- **3x Speed Improvement**: Slot-WER disabled by default (saves ~8-10s per sample)
- **Auto-Resume**: Automatically resumes from checkpoints on interruption
- **Greedy Decoding**: Fast single-beam decoding for speed
- **Optimized Metrics**: Minimal normalization (only best prediction)
- **Robust Error Handling**: Graceful shutdown with state preservation

## 📈 Output Files

Each experiment generates:

```
asr_experiments/
└── {experiment_name}_{timestamp}/
    ├── {experiment_name}_{timestamp}.jsonl          # Raw results
    ├── checkpoints/                                 # Auto-save checkpoints
    │   ├── checkpoint_0001.jsonl
    │   └── ...
    ├── {experiment_name}_FINAL.jsonl               # Final consolidated results
    ├── {experiment_name}_FINAL.csv                 # Summary CSV
    └── final_analytics/                            # Detailed analytics
        ├── overall_performance.json
        ├── domain_breakdown.json
        └── voice_breakdown.json
```

## 🔍 Monitoring Progress

The scripts provide real-time progress updates:

```
⏱️  Progress: 1500/3200 (46.9%) - 0.8 samples/sec - Avg: 1.2s/sample - ETA: 35.4min [Whisper-Medium]
```

## 🛠️ Troubleshooting

### Common Issues

1. **AWS Credentials Error**
   ```bash
   # Verify credentials
   aws sts get-caller-identity
   ```

2. **Dataset Not Found**
   ```bash
   # Use custom path
   python decode_scripts/comprehensive_asr_evaluation_improved_medium.py --dataset-path /your/dataset/path
   ```

3. **CUDA Out of Memory**
   ```bash
   # Use smaller model or reduce batch processing
   python decode_scripts/comprehensive_asr_evaluation_improved_tiny.py
   ```

4. **Import Errors**
   ```bash
   # Ensure you're in the reproducibility directory
   cd reproducibility
   python decode_scripts/comprehensive_asr_evaluation_improved_medium.py --test-run
   ```

### Resume Interrupted Experiments

All scripts automatically resume from the last checkpoint:

```bash
# Just run the same command again - it will auto-resume
python decode_scripts/comprehensive_asr_evaluation_improved_medium.py
```

To disable auto-resume:
```bash
python decode_scripts/comprehensive_asr_evaluation_improved_medium.py --disable-auto-resume
```

## 📋 System Requirements

- **Python**: 3.8+
- **GPU**: CUDA-compatible GPU recommended (CPU works but slower)
- **RAM**: 8GB+ recommended
- **Storage**: 2GB+ for checkpoints and results
- **AWS**: Bedrock access for text normalization

## 🔬 Experiment Design

### Oracle Experiment
The oracle experiment (`comprehensive_asr_evaluation_improved_oracle.py`) uses the `normalized_truth` field as the prompt instead of the original prompt. This provides an upper bound on performance by giving the model perfect context.

### Adversarial Experiment
The adversarial experiment uses prompts from wrong domains to test model robustness against misleading context.

### Performance Comparison
Run multiple model sizes to compare accuracy vs. speed trade-offs:

```bash
# Fast comparison (100 samples each)
for script in tiny base medium; do
    python decode_scripts/comprehensive_asr_evaluation_improved_${script}.py --max-samples 100
done
```

## 📞 Support

For issues or questions:
1. Check the troubleshooting section above
2. Verify all dependencies are installed correctly
3. Ensure AWS credentials are properly configured
4. Test with `--test-run` flag first

## 📄 License

This package is provided for research and evaluation purposes.
