# ANSWER-CONSISTENT CHAIN-OF-THOUGHT REINFORCEMENT LEARNING FOR MULTI-MODAL LARGE LANGUAGE MODELS (ACRE)

This repository contains the source code, a reinforcement learning framework designed to improve video reasoning capabilities in Multimodal Large Language Models (MLLMs).

## 🚀 What is ACRE

ACRE is an advanced training technique that enhances model robustness by:

1. **Generating responses with original multiple-choice options**
2. **Extracting the reasoning process (`<think>` part)**
3. **Re-generating responses with shuffled options using the same reasoning**
4. **Rewarding consistency between the two responses**

This approach helps the model focus on the actual reasoning process rather than memorizing option patterns, leading to more robust and reliable predictions.

## 🏗️ Architecture

### Key Components

- **GRPO Training Pipeline**: Modified Group Relative Policy Optimization for video understanding
- **Dual Reasoning Reward Function**: Evaluates consistency between original and shuffled option responses
- **KL Consistency Check**: Optional regularization to penalize answer distribution changes
- **Temporal Modeling**: Support for both standard GRPO and T-GRPO (Temporal GRPO)

### Reward System

The dual reasoning algorithm assigns rewards based on:

- **Perfect Consistency + Correct**: 1.0 reward
- **Perfect Consistency + Incorrect**: 0.7 reward
- **Inconsistent + Original Correct**: -0.5 reward
- **Inconsistent + Both Wrong**: -1.0 reward

## 📁 Project Structure

```
ACRE/
├── config/
│   └── dual_reasoning_config.py    # Configuration settings
├── data/                           # Training datasets
├── models/                         # Pre-trained models
├── logs/                          # Training logs and checkpoints
├── scripts/
│   └── run_dual_reasoning.sh      # Training script
├── src/
│   ├── qwen-vl-utils/            # Video processing utilities
│   └── r1-v/
│       └── src/open_r1/
│           ├── grpo.py           # Main training script
│           └── trainer/
│               └── grpo_trainer.py  # Custom GRPO trainer
└── README.md
```

## 🛠️ Setup

### Prerequisites

- Python 3.11+
- CUDA-compatible GPUs (minimum 4x H20 or 5x A100)
- PyTorch with CUDA support

### Installation

1. **Clone the repository:**
```bash
git clone <your-repo-url>
cd ACRE
```

2. **Create conda environment:**
```bash
conda create -n ACRE python=3.11
conda activate ACRE
```

3. **Install dependencies:**
```bash
bash setup.sh

# Install qwen video utilities
cd src/qwen-vl-utils
pip install -e .[decord]
cd ../..
```

4. **Download required models and data:**
```bash
# Place your pre-trained model in:
mkdir -p models/
# Download Qwen2.5-VL-7B-COT-SFT to models/

# Place your training dataset in:
mkdir -p data/
# Add your training dataset JSON file to data/
```

## 📊 Data Format

Your training dataset should be in JSON format with the following structure:

```json
[
  {
    "id": "example_001",
    "video": "path/to/video.mp4",
    "conversations": [
      {
        "from": "human",
        "value": "<video>\\nQuestion with multiple choice options\\nA. Option 1\\nB. Option 2\\nC. Option 3\\nD. Option 4"
      },
      {
        "from": "gpt",
        "value": "<think>\\nReasoning process here\\n</think>\\n\\nThe answer is A."
      }
    ]
  }
]
```

## 🚀 Usage

### Basic Training

1. **Configure your setup:**
```bash
export MODEL_PATH="/path/to/your/Qwen2.5-VL-7B-COT-SFT"
export DATASET_PATH="/path/to/your/dataset.json"
export NUM_GPUS=8
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
```

2. **Run dual reasoning training:**
```bash
bash scripts/run_dual_reasoning.sh
```

### Advanced Configuration

Edit `config/dual_reasoning_config.py` to customize:

- **Reward values**: Modify `dual_reasoning_reward_list`
- **KL consistency**: Enable `use_kl_check` and adjust `kl_lambda`
- **Training hyperparameters**: Batch size, learning rate, etc.
- **GPU settings**: Number of GPUs and CUDA devices

### Resume Training

To resume from a checkpoint:

```bash
# Add to your training script:
--resume_from_checkpoint "./logs/your-checkpoint-directory/checkpoint-N"
```

## 📈 Key Parameters

### Dual Reasoning Specific

- `--dual_reasoning true`: Enable dual reasoning algorithm
- `--dual_reasoning_reward_list 1.0 0.7 -0.5 -1.0`: Reward values for different consistency scenarios
- `--use_kl_check false`: Enable KL consistency regularization
- `--kl_lambda 0.3`: Weight for KL consistency loss
- `--reward_funcs dual_reasoning format`: Reward functions to use

### Training Configuration

- `--temporal false`: Use standard GRPO (set to `true` for T-GRPO)
- `--len_control true`: Enable length control reward
- `--num_generations 8`: Number of generations per training step
- `--beta 0.04`: GRPO beta parameter
- `--max_grad_norm 5`: Gradient clipping norm

## 🔍 Monitoring

### Debug Mode

Enable debug mode to see model rollouts:

```bash
export DEBUG_MODE="true"
export LOG_PATH="./logs/debug_dual_reasoning.txt"
```

### Logs and Checkpoints

- **Training logs**: `logs/{model}-DualReasoning_{dataset}_bf16_frame_16/`
- **Checkpoints**: Saved every 200 steps
- **Debug logs**: `logs/debug_dual_reasoning.txt`

## 🧪 Algorithm Details

### Dual Reasoning Process

1. **Original Generation**: Model generates response with original option order
2. **Reasoning Extraction**: Extract `<think>` section from the response
3. **Shuffled Generation**: Generate new response with shuffled options + extracted reasoning
4. **Consistency Evaluation**: Compare final answers between original and shuffled responses
5. **Reward Assignment**: Assign rewards based on consistency and correctness

### KL Consistency Check (Optional)

When enabled, adds a regularization term to penalize large changes in answer distribution:

```
L_total = L_GRPO + λ * L_consistency
```

Where `L_consistency` is the KL divergence between original and shuffled option distributions.

## 📋 Requirements

### Hardware

- **Minimum**: 4x H20 GPUs (96GB) or 5x A100 GPUs (80GB)
- **Recommended**: 8x GPUs for faster training
- **Memory**: ~24GB VRAM per GPU with DeepSpeed ZeRO-3

### Software

- PyTorch >= 2.0
- transformers (custom version included)
- deepspeed >= 0.12.0
- trl == 0.16.0
- flash-attn >= 2.3.0

## 🎯 Expected Results

The dual reasoning training should improve:

- **Consistency**: More stable answers across different option orderings
- **Reasoning Quality**: Better chain-of-thought reasoning
- **Robustness**: Less sensitive to option position bias
- **Accuracy**: Improved performance on video reasoning benchmarks

## 🐛 Troubleshooting

### Common Issues

1. **CUDA Out of Memory**: Reduce `per_device_train_batch_size` or enable gradient checkpointing
2. **Model Path Not Found**: Ensure `MODEL_PATH` points to a valid Qwen2.5-VL model
3. **Dataset Format Error**: Verify JSON format matches expected structure
4. **DeepSpeed Issues**: Check DeepSpeed configuration and GPU compatibility

### Debug Tips

- Enable `DEBUG_MODE` to see detailed training logs
- Check GPU memory usage with `nvidia-smi`
- Verify dataset loading with a small subset first
- Monitor training loss curves for convergence

## 📄 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.




## 🔗 Related Work

- [Video-R1 Original Paper](https://arxiv.org/pdf/2503.21776)
- [Qwen2.5-VL Model](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
- [TRL Library](https://github.com/huggingface/trl)

---

For questions or support, please open an issue or contact [your-contact-info].