# OFMU Experiment Reproducibility Guide
# =====================================

This directory contains complete experimental code for reproducing all results
reported in the OFMU paper submitted to ICLR 2025.

## Directory Structure

```
experiments/
├── tofu_experiments.py          # TOFU benchmark experiments (Section 5.2)
├── cifar_experiments.py         # CIFAR-10/100 experiments (Section 5.3)  
├── wmdp_experiments.py          # WMDP benchmark experiments (Appendix A.2)
├── run_all_experiments.py       # Master script to run all experiments
└── README.md                    # This file

configs/
└── experiment_config.yaml       # Configuration file for all experiments

evaluation/
└── evaluator.py                 # Evaluation metrics and analysis tools

analysis/
└── analyze_results.py           # Results analysis and plotting

data/
├── tofu/                        # TOFU dataset files
├── wmdp/                        # WMDP dataset files
└── setup_data.py                # Data download and setup script

results/                         # Generated results (created automatically)
plots/                          # Generated plots (created automatically)
logs/                           # Experiment logs (created automatically)
```

## Quick Start

### 1. Environment Setup

```bash
# Install dependencies
pip install -r requirements.txt

# Setup data
python data/setup_data.py --all
```

### 2. Run Individual Experiments

```bash
# TOFU experiments
python experiments/tofu_experiments.py --model llama2 --forget_scenario forget05 --method ofmu

# CIFAR experiments  
python experiments/cifar_experiments.py --dataset cifar10 --forget_type class --forget_class 0

# WMDP experiments
python experiments/wmdp_experiments.py --model zephyr --domain bio
```

### 3. Run All Experiments

```bash
# Create default configuration
python experiments/run_all_experiments.py --create_default_config

# Run all experiments sequentially
python experiments/run_all_experiments.py --experiment all --config configs/experiment_config.yaml

# Run experiments in parallel (faster, requires more GPU memory)
python experiments/run_all_experiments.py --experiment all --parallel
```

### 4. Analyze Results

```bash
# Generate all plots and tables from paper
python analysis/analyze_results.py --results_dir results/ --output_dir plots/

# Generate specific analysis
python analysis/analyze_results.py --analysis_type tables  # Just tables
python analysis/analyze_results.py --analysis_type plots   # Just plots
```

## Experiment Details

### TOFU Experiments (Section 5.2)

Tests OFMU on synthetic QA data for entity-level unlearning:

- **Models**: LLaMA-2-7B-chat, LLaMA-3.2-1B-Instruct
- **Scenarios**: forget01 (1%), forget05 (5%), forget10 (10%) 
- **Methods**: OFMU, Gradient Ascent, Gradient Diff, NPO, SimNPO, RMU
- **Metrics**: Forget Quality (FQ), Model Utility (MU), Forget Truth Ratio (FTR)

**Expected Results**: OFMU achieves best balance across all metrics, especially strong in forget05 and forget10 scenarios.

### CIFAR Experiments (Section 5.3)

Tests OFMU on vision classification tasks:

- **Datasets**: CIFAR-10, CIFAR-100
- **Forget Types**: Class-wise (remove entire class), Random (remove 10% randomly)
- **Methods**: OFMU, Fine-tuning, Gradient Ascent, Fisher Forget, Influence Unlearning
- **Metrics**: Unlearning Accuracy (UA), Retain Accuracy (RA), Test Accuracy (TA), MIA Efficacy

**Expected Results**: OFMU shows consistent performance across both forget types, with particularly strong MIA resistance.

### WMDP Experiments (Appendix A.2)

Tests OFMU on safety-critical QA domains:

- **Models**: Zephyr-7B-beta
- **Domains**: Biosecurity, Cybersecurity, Chemistry
- **Methods**: Baseline, Fine-tuning, Gradient Ascent
- **Metrics**: QA Accuracy on domain-specific benchmarks

**Expected Results**: OFMU maintains higher utility preservation compared to baselines.

## Configuration

The `configs/experiment_config.yaml` file controls all experiment parameters:

```yaml
# Which experiments to run
experiments: ["tofu", "cifar", "wmdp"]

# TOFU settings
tofu:
  models: ["llama2", "llama3"]
  scenarios: ["forget01", "forget05", "forget10"]
  methods: ["ofmu", "gradient_ascent", "npo", "rmu"]
  batch_size: 8
  num_epochs: 5

# CIFAR settings
cifar:
  datasets: ["cifar10", "cifar100"]
  forget_types: ["class", "random"]
  forget_class: 0
  forget_ratio: 0.1

# OFMU hyperparameters
ofmu_params:
  beta: 0.1          # Similarity regularization
  rho_init: 0.01     # Initial penalty parameter
  inner_steps: 5     # Inner maximization steps
  inner_lr: 1e-5     # Inner learning rate
  outer_lr: 1e-5     # Outer learning rate
```

## Hardware Requirements

### Minimum Requirements
- **GPU**: 16GB VRAM (RTX 4080, A100-40GB)
- **RAM**: 32GB system memory
- **Storage**: 100GB free space
- **Time**: ~48 hours for all experiments

### Recommended Setup
- **GPU**: 2x H100-80GB or 2x A100-80GB
- **RAM**: 64GB system memory  
- **Storage**: 500GB SSD
- **Time**: ~12 hours with parallel execution

### GPU Memory Usage by Experiment
- TOFU (LLaMA-2-7B): ~14GB VRAM
- TOFU (LLaMA-3.2-1B): ~6GB VRAM  
- CIFAR (ResNet-18): ~2GB VRAM
- WMDP (Zephyr-7B): ~14GB VRAM

## Expected Outputs

### Result Files
```
results/
├── ofmu_llama2_forget05.json
├── cifar_cifar10_class.json
├── wmdp_zephyr_bio.json
├── all_experiments_[timestamp].json
└── experiment_summary_[timestamp].txt
```

### Generated Plots  
```
plots/
├── tofu_performance_heatmap_llama2.png
├── cifar_performance_comparison.png
├── overall_performance_scores.png
├── tofu_comparison_table.csv
└── tofu_table.tex
```

### Key Tables and Figures from Paper
- **Table 1**: TOFU results comparison (`tofu_comparison_table.csv`)
- **Table 2**: CIFAR results comparison (`cifar_comparison_table.csv`)
- **Figure 2**: Overall performance scores (`overall_performance_scores.png`)
- **Figure 3**: CIFAR performance comparison (`cifar_performance_comparison.png`)

## Troubleshooting

### Common Issues

1. **Out of GPU Memory**
   ```bash
   # Reduce batch size in config
   tofu:
     batch_size: 4  # Instead of 8
   
   # Run experiments sequentially
   python experiments/run_all_experiments.py --experiment tofu  # Then cifar, wmdp separately
   ```

2. **Model Download Failures**
   ```bash
   # Pre-download models
   python -c "from transformers import AutoModel; AutoModel.from_pretrained('meta-llama/Llama-2-7b-chat-hf')"
   ```

3. **Dataset Loading Issues**
   ```bash
   # Manual data setup
   python data/setup_data.py --dataset tofu
   python data/setup_data.py --dataset wmdp --domain bio
   ```

4. **CUDA Errors**
   ```bash
   # Clear GPU cache
   python -c "import torch; torch.cuda.empty_cache()"
   
   # Check GPU status
   nvidia-smi
   ```

### Performance Optimization

1. **Enable Mixed Precision**
   ```python
   # Add to model loading
   torch_dtype=torch.float16
   ```

2. **Use Gradient Checkpointing**
   ```python
   model.gradient_checkpointing_enable()
   ```

3. **Parallel Data Loading**
   ```yaml
   hardware:
     num_workers: 4  # Adjust based on CPU cores
   ```

## Validation

To verify your setup is working correctly:

```bash
# Quick validation run (small subset)
python experiments/tofu_experiments.py --model llama3 --forget_scenario forget01 --method ofmu --num_epochs 1

# Check outputs
ls results/
ls plots/

# Verify key metrics are reasonable
python -c "
import json
with open('results/ofmu_llama3_forget01.json') as f:
    result = json.load(f)
print(f'FQ: {result[\"forget_quality\"]:.3f}')
print(f'MU: {result[\"model_utility\"]:.3f}')
"
```

## Citation

If you use this code in your research, please cite:

```bibtex
@inproceedings{ofmu2025,
  title={OFMU: Optimization-Driven Framework for Machine Unlearning},
  author={[Authors]},
  booktitle={International Conference on Learning Representations},
  year={2025}
}
```

## Contact

For questions about reproducing results or technical issues:
- Create an issue in the repository
- Email: [contact information]

## License

This code is released under the MIT License. See LICENSE file for details.