
This directory contains the code and resources for reproducing two main approaches for mitigating overthinking in LLM: decode-based methods (SEAL and PROBE) and train-based methods (SFT and DPO).

## Structure

```
/
├── decode-based/
│   ├── get_act_probe.py
│   ├── get_act_seal.py
│   ├── modeling_utils/
│   ├── probe_stop.py
│   ├── probe_train.py
|   ├── compute_steering_vector.py
│   └── seal_decode.py
├── train-based/
│   ├── dpo_train.py
│   ├── sft_train.py
│   └── trl/
└── README.md
```

### Decode-based Methods
The `decode-based/` directory contains scripts and resources for implementing SEAL and PROBE methods to mitigate overthinking during decoding. It includes:
- `get_act_probe.py`: Extracts hidden states from the model for training the PROBE
- `get_act_seal.py`: Extracts hidden states for SEAL
- `modeling_utils/`: Utility modules to support SEAL implementation
- `probe_stop.py`: Implements the confidence-based stopping criterion for PROBE
- `probe_train.py`: Trains a probe to predict when to stop decoding
- `compute_steering_vector.py`: Computes the steering vector used in SEAL
- `seal_decode.py`: Implements the SEAL decoding strategy

### Train-based Methods
The `train-based/` directory contains scripts and resources for training models using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). It includes:
- `dpo_train.py`: Script to train models using DPO
- `sft_train.py`: Script to train models using SFT
- `trl/`: Contains supporting modules and configuration files for using the Transformers Reinforcement Learning (TRL) library