# DataOpt: Data-Centric Unlearning Framework

This repository contains the implementation code for the paper **"Data-Centric Unlearning: Optimizing Labels and Retain Data via Learning Dynamics"**.

## Overview

DataOpt is a data-centric machine unlearning framework that optimizes both label assignment and retain set construction to improve the effectiveness of existing unlearning algorithms. The framework provides:

1. **Optimal Label Assignment**: Theoretical framework for assigning optimal soft labels to both forget and retain samples
2. **Strategic Retain Set Construction**: Selection of neighborhood, boundary, and adversarial samples for effective unlearning
3. **Universal Enhancement**: Applicable to any existing unlearning method as a preprocessing step

## Key Features

- **Classification Tasks**: Enhanced unlearning for CIFAR-100, Tiny-ImageNet, CIFAR-10
- **LLM Tasks**: Improved unlearning for large language models on TOFU benchmark
- **Controllable Unlearning**: Adjustable unlearning degree through parameter k
- **Comprehensive Baselines**: Implementation of SOTA methods (NEGGRAD, SCRUB, Bad Teacher, SalUn, GA, NPO, ICU)

## Installation

1. Clone the repository:
```bash
git clone <repository-url>
cd 11-unlearning
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Ensure you have proper GPU setup for CUDA (optional but recommended):
```bash
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
```

## Quick Start

### Run All Experiments
```bash
python run_experiments.py --all
```

### Run Specific Experiment
```bash
# SOTA Enhancement experiment
python run_experiments.py --experiment exp1

# LLM unlearning experiment  
python run_experiments.py --experiment exp2

# Retain set composition analysis
python run_experiments.py --experiment exp3
```

### Run with Custom Parameters
```bash
python run_experiments.py --experiment exp1 --args "--dataset cifar100 --device cpu"
```

## Experiments

### Experiment 1: SOTA Enhancement
Tests DataOpt's ability to enhance existing SOTA unlearning methods.

- **Datasets**: CIFAR-100 (class unlearning), Tiny-ImageNet (random subset unlearning)
- **Methods**: NEGGRAD, SCRUB, Bad Teacher, SalUn
- **Metrics**: Acc_rt, Acc_ft, MIA, RUD

```bash
cd experiments
python exp1_sota_enhancement.py --dataset both --baselines NEGGRAD SCRUB BadTeacher SalUn
```

### Experiment 2: LLM Unlearning
Evaluates DataOpt on large language model unlearning tasks.

- **Models**: Llama-3-8B, Phi-3 (proxies used for local execution)
- **Dataset**: TOFU benchmark (synthetic version)
- **Methods**: GA, NPO, ICU, DataOpt
- **Metrics**: Forget Quality, Model Utility

```bash
cd experiments
python exp2_llm_unlearning.py --models llama-3-8b phi-3 --forget_ratios 0.01 0.05 0.10
```

### Experiment 3: Retain Set Composition Analysis
Analyzes the impact of different retain set selection strategies.

- **Dataset**: CIFAR-10 (class unlearning)
- **Strategies**: Random, Neighborhood, Boundary, DataOpt (Mixed)
- **Fixed Algorithm**: NEGGRAD

```bash
cd experiments
python exp3_retain_composition.py --strategies Random Neighborhood Boundary DataOpt
```

### Experiment 4: Unlearning Controllability
Demonstrates controllable unlearning through degree parameter k.

- **Dataset**: CIFAR-10 (class unlearning)
- **k values**: 1, 3, 5, 7, 9
- **Framework**: Complete DataOpt

```bash
cd experiments
python exp4_controllability.py --k_values 1 3 5 7 9 --runs 3
```

### Experiment 5: DELETE Framework Comparison
Compares DataOpt label strategy with DELETE framework.

- **Dataset**: CIFAR-10 (no retain set)
- **Methods**: DELETE-Label vs DataOpt-Label
- **Setting**: Forget-only fine-tuning

```bash
cd experiments
python exp5_delete_comparison.py --runs 5
```

## Core Components

### DataOpt Framework (`src/dataopt.py`)
Main framework implementing:
- Label assignment optimization (Eq. 9-12 from paper)
- Retain set construction (neighborhood, boundary, adversarial samples)
- LLM-specific adaptations

### Baseline Algorithms
- **Classification** (`baselines/classification.py`): NEGGRAD, SCRUB, Bad Teacher, SalUn, DELETE
- **LLM** (`baselines/llm.py`): Gradient Ascent, NPO, ICU, DataOpt-enhanced

### Evaluation Metrics (`utils/metrics.py`)
- **Classification**: Retain Accuracy, Forget Accuracy, MIA, RUD
- **LLM**: Forget Quality, Model Utility

## Results

Results are automatically saved to the `results/` directory in both JSON and CSV formats:

```
results/
├── exp1_sota_enhancement_cifar100_results.json
├── exp2_llm_unlearning_llama-3-8b_results.json
├── exp3_retain_composition_results.json
├── exp4_controllability_results.json
├── exp5_delete_comparison_results.json
└── experiment_suite_summary.json
```

## Expected Results

Based on the theoretical analysis, you should observe:

1. **Enhanced Performance**: DataOpt-enhanced methods consistently outperform vanilla baselines
2. **Controllable Unlearning**: Higher k values lead to stronger forgetting with stable retain performance
3. **Strategic Retain Sets**: Neighborhood and boundary samples significantly outperform random selection
4. **Superior Label Assignment**: DataOpt labels provide better utility-privacy trade-off than DELETE

## Customization

### Adding New Datasets
1. Create dataset loader in appropriate experiment file
2. Implement data splitting logic for forget/retain sets
3. Update evaluation metrics if needed

### Adding New Baselines
1. Implement baseline class in `baselines/classification.py` or `baselines/llm.py`
2. Follow the interface pattern of existing methods
3. Add to experiment runner scripts

### Custom Evaluation Metrics
1. Add new metrics to `utils/metrics.py`
2. Update experiment evaluation calls
3. Modify result logging format

## Hardware Requirements

- **GPU**: CUDA-compatible GPU with ≥8GB memory (recommended)
- **CPU**: Multi-core processor (minimum for CPU-only execution)
- **RAM**: ≥16GB system memory
- **Storage**: ≥10GB free space for datasets and results

## Troubleshooting

### Common Issues

1. **CUDA Out of Memory**: Reduce batch sizes in experiment files
2. **Dataset Download Errors**: Check internet connection, manually download to `./data/`
3. **Import Errors**: Ensure all dependencies are installed with correct versions

### Debug Mode
Add `--device cpu` to run experiments on CPU for debugging:
```bash
python run_experiments.py --experiment exp1 --args "--device cpu"
```

## Citation

If you use this code, please cite our paper:

```bibtex
@article{dataopt2024,
  title={Data-Centric Unlearning: Optimizing Labels and Retain Data via Learning Dynamics},
  author={[Authors]},
  journal={[Journal]},
  year={2024}
}
```

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## Contact

For questions about the implementation, please open an issue or contact [contact-email].

## Acknowledgments

- Original paper authors and research team
- Open-source unlearning community
- PyTorch and HuggingFace teams for excellent frameworks