# ImageNet Training with TMPD and TPD

This document provides examples of how to train on ImageNet with TMPD (Temporal Masked Progressive Distillation) and TPD (Temporal Progressive Distillation) features.

## Prerequisites

1. **Teacher Model**: You need a pre-trained ANN teacher model (ResNet18/ResNet34) saved as a `.pth` file.
2. **ImageNet Dataset**: The ImageNet dataset should be organized in the standard format with `train/` and `val/` directories.

## Basic Usage

### 1. Traditional Training (without TMPD/TPD)
```bash
python experiment/imagenet/main.py \
    --dataset imagenet \
    --data_path /path/to/imagenet \
    --tea_path /path/to/teacher_model.pth \
    --stu_arch preact_resnet34 \
    --tea_arch resnet34 \
    --alpha 0.1 \
    --beta 0.1 \
    --T 4 \
    --num_epoch 100
```

### 2. Training with TPD Only
```bash
python experiment/imagenet/main.py \
    --dataset imagenet \
    --data_path /path/to/imagenet \
    --tea_path /path/to/teacher_model.pth \
    --stu_arch preact_resnet34 \
    --tea_arch resnet34 \
    --alpha 0.1 \
    --beta 0.1 \
    --use_tpd \
    --tpd_weight 0.5 \
    --tpd_temp 3.0 \
    --T 4 \
    --num_epoch 100
```

### 3. Training with TMPD Only
```bash
python experiment/imagenet/main.py \
    --dataset imagenet \
    --data_path /path/to/imagenet \
    --tea_path /path/to/teacher_model.pth \
    --stu_arch preact_resnet34 \
    --tea_arch resnet34 \
    --alpha 0.1 \
    --beta 0.1 \
    --use_tmpd \
    --mask_prob 0.3 \
    --mask_lambda 0.5 \
    --T 4 \
    --num_epoch 100
```

### 4. Training with Both TMPD and TPD (Recommended)
```bash
python experiment/imagenet/main.py \
    --dataset imagenet \
    --data_path /path/to/imagenet \
    --tea_path /path/to/teacher_model.pth \
    --stu_arch preact_resnet34 \
    --tea_arch resnet34 \
    --alpha 0.1 \
    --beta 0.1 \
    --use_tmpd \
    --mask_prob 0.3 \
    --mask_lambda 0.5 \
    --use_tpd \
    --tpd_weight 0.5 \
    --tpd_temp 3.0 \
    --T 4 \
    --num_epoch 100
```

## Multi-GPU Training

For distributed training on multiple GPUs:

```bash
python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr="localhost" \
    --master_port=12355 \
    experiment/imagenet/main.py \
    --dataset imagenet \
    --data_path /path/to/imagenet \
    --tea_path /path/to/teacher_model.pth \
    --stu_arch preact_resnet34 \
    --tea_arch resnet34 \
    --alpha 0.1 \
    --beta 0.1 \
    --use_tmpd \
    --mask_prob 0.3 \
    --mask_lambda 0.5 \
    --use_tpd \
    --tpd_weight 0.5 \
    --tpd_temp 3.0 \
    --T 4 \
    --train_batch_size 512 \
    --val_batch_size 512 \
    --num_epoch 100
```

## Parameter Explanations

### Core Parameters
- `--T`: Number of timesteps for SNN (default: 4)
- `--alpha`: Weight for teacher-student distillation loss
- `--beta`: Weight for teacher labels distillation loss
- `--stu_arch`: Student SNN architecture (preact_resnet18/preact_resnet34)
- `--tea_arch`: Teacher ANN architecture (resnet18/resnet34)

### TMPD Parameters
- `--use_tmpd`: Enable Temporal Masked Progressive Distillation
- `--mask_prob`: Probability of masking elements in teacher features (0.0-1.0)
- `--mask_lambda`: Mixing weight for masked teacher distillation (0.0-1.0)

### TPD Parameters
- `--use_tpd`: Enable Temporal Progressive Distillation
- `--tpd_weight`: Weight for temporal progressive distillation loss
- `--tpd_temp`: Temperature parameter for TPD (default: 3.0)

### Training Parameters
- `--lr`: Learning rate (default: 0.2)
- `--train_batch_size`: Training batch size (default: 512)
- `--val_batch_size`: Validation batch size (default: 512)
- `--num_epoch`: Number of training epochs (default: 100)
- `--wd`: Weight decay (default: 2e-5)

## Expected Output

During training, you'll see output like:
```
============================================================
Training Configuration
============================================================
✓ TMPD: Enabled
  - Use random mask areas for each timestep
  - Mask probability: 0.3
  - Mixing weight: 0.5
🎯 Training Method: Traditional method (temporal-wise comparison)
🔄 TPD: Enabled
  - Loss weight: 0.5
📊 Dataset: imagenet
⏱️ Timesteps: 4
🎓 Student Network: preact_resnet34
👨‍🏫 Teacher Network: resnet34
📈 Distillation Weight: α=0.1, β=0.1
============================================================

Epoch 001 - Loss: Hard:6.9573 | KD:1.3415→0.2683 | TPD:6.9076→3.4538 | Total:10.5794
```

## Tips for Best Results

1. **Teacher Model Quality**: Use a well-trained teacher model with high accuracy
2. **Hyperparameter Tuning**: 
   - Start with `mask_prob=0.3` and `mask_lambda=0.5` for TMPD
   - Use `tpd_weight=0.5` and `tpd_temp=3.0` for TPD
   - Balance `alpha` and `beta` weights (typically 0.1-0.5)
3. **Timesteps**: More timesteps (T=6-8) may give better accuracy but slower training
4. **Batch Size**: Use larger batch sizes for better stability in distributed training 