# CW-DPO: Cooling-Weighted Direct Preference Optimization

[Paper (under review at ICLR 2026)](./ICLR_STF_DPO_ICLR__进度100_.pdf)

## 1. Environment Setup

1. Prepare at least one NVIDIA GPU with CUDA support.
2. Install Python (>=3.9) and PyTorch with the corresponding CUDA version.
3. Install dependencies:

```bash
pip install transformers accelerate datasets peft qwen-vl-utils pandas
```

(Optional: For visualization and logging, install `wandb` or `swanlab`.)

---

## 2. Data Preparation

We use the **COCO Captions** dataset, including `train2017` images and JSON annotations.  
Please ensure the following directory structure before training:

```
coco/
 ├── train2017/
 └── annotations/
       ├── coco_train.json.part1_18k.json   # for SFT
       └── coco_train.json.part2_10k.json   # for DPO
```

---

## 3. Training Pipeline

CW-DPO follows a **two-stage training pipeline**:

### (1) Stage One: Supervised Fine-Tuning (SFT)

Example (multi-GPU):

```bash
#!/usr/bin/env bash
set -euo pipefail

GPUS=4
DEVICES=1,2,3,4
MASTER_PORT=29500

export CUDA_VISIBLE_DEVICES="${DEVICES}"

torchrun --standalone --nproc_per_node="${GPUS}" --master_port="${MASTER_PORT}"   Qwen2.5-VL-Finetune/train_SFT_now.py   --pretrained_model Qwen/Qwen2___5-VL-7B-Instruct   --gentle_json coco/annotations/coco_train/coco_train.json.part1_18k.json   --img_root coco/train2017/train2017   --output_dir Coco_model/SFT/SFT   --batch_size 4   --grad_accum 8   --epochs 1   --lr 1e-4   --neg_weight 0.4   --ld_every 20   --ld_size 6
```

---

### (2) Stage Two: CW-DPO Fine-Tuning

Example (single-GPU):

```bash
#!/usr/bin/env bash
set -euo pipefail

SCRIPT="VLM_iclr/save/train_DPO_now.py"

GPUS=1
DEVICES=4

PRETRAINED="Qwen/Qwen2___5-VL-7B-Instruct"
REF_MODEL="Coco_model/SFT/SFT/checkpoint-720"
GENTLE_JSON="coco/annotations/coco_train/coco_train.json.part2_10k.json"
IMG_ROOT="coco/train2017/train2017"
OUT_DIR="Coco_model/DPO_LR/DPO_COCO"

BATCH_SIZE=2
GRAD_ACCUM=16
EPOCHS=1
LR=5e-5
BETA=0.1
NEG_MODE="mixed"     # dataset | onpolicy | mixed
ONPOLICY_PROB=0.5
COOL_FLOOR=-10.0
COOL_TAU=2.0
RESIZE_H=280
RESIZE_W=280
MAX_LEN=8192

export CUDA_VISIBLE_DEVICES="${DEVICES}"
export OMP_NUM_THREADS=1
export TOKENIZERS_PARALLELISM=false
export WANDB_DISABLED=true

mkdir -p "${OUT_DIR}"

COMMON_ARGS=(
  --pretrained_model "${PRETRAINED}"
  --ref_model "${REF_MODEL}"
  --gentle_json "${GENTLE_JSON}"
  --img_root "${IMG_ROOT}"
  --output_dir "${OUT_DIR}"
  --batch_size ${BATCH_SIZE}
  --grad_accum ${GRAD_ACCUM}
  --epochs ${EPOCHS}
  --lr ${LR}
  --beta ${BETA}
  --neg_mode "${NEG_MODE}"
  --onpolicy_prob ${ONPOLICY_PROB}
  --cooling_floor_logp ${COOL_FLOOR}
  --cooling_tau ${COOL_TAU}
  --resize_h ${RESIZE_H}
  --resize_w ${RESIZE_W}
  --max_len ${MAX_LEN}
)

if [[ "${GPUS}" -gt 1 ]]; then
  torchrun --nproc_per_node="${GPUS}" "${SCRIPT}" "${COMMON_ARGS[@]}"
else
  python "${SCRIPT}" "${COMMON_ARGS[@]}"
fi
```

---

## 4. Inference

```bash
python test.py --model_path Coco_model/DPO_LR/DPO_COCO
```

## 
