# DR<sup>2</sup>Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models

The repo is the official implement of **"DR<sup>2</sup>Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models"**.  

Overview of DR<sup>2</sup>Seg:

<div align=center>
<img width="60%" src="assets/overview.png"/>
</div>

DR<sup>2</sup>Seg demonstrates following features:
1. DR<sup>2</sup>Seg is a simple yet effective self-reward framework that enhances both efficiency and segmentation accuracy using only the model’s intrinsic capability, without requiring extra MLLMs or supervision.
2. DR<sup>2</sup>Seg designs a two-stage rollout strategy that decouples multimodal reasoning and perception in MLLM for reasoning segmentation, combined with a length-based self-reward to reduce redundant reasoning.
3. Extensive experiments validate the effectiveness and generalization of DR<sup>2</sup>Seg across MLLMs of varying scales and segmentation models, offering valuable insights into efficient reasoning perception.

**Highlight Code Features**:
1. This code is based on the [EasyR1](https://github.com/hiyouga/EasyR1), [veRL](https://github.com/volcengine/verl) and [SegZero](https://github.com/JIA-Lab-research/Seg-Zero), which supports model split during sampling and is more GPU memory friendly.
2. Supporting both Qwen2-VL and Qwen2.5-VL series MLLMs and both [SAM2](https://github.com/facebookresearch/sam2) and [SAM3](https://github.com/facebookresearch/sam3) segmentation models.


## Contents
- [Model](#model)
- [Examples](#examples)
- [Installation](#installation)
- [Inference](#inference)
- [Evaluation](#evaluation)
- [Training](#training)
- [Build Your Data](#build-your-own-training-data-optional)
- [Acknowledgement](#acknowledgement)



## Model
<div align=center>
<img width="98%" src="assets/pipeline.png"/>
</div>

DR<sup>2</sup>Seg performs a two-stage rollout. In this first pass, the model takes an image-query pair and produces a structured output comprising a CoT, a description, and an answer. In the second pass, the model is re-prompted with the image and the generated description, replacing the original query. 


## Examples

<div align=center>
<img width="98%" src="assets/examples.png"/>
</div>


## Installation

```bash
conda create -n dr2seg python=3.12
conda activate dr2seg
pip install torch==2.6.0 torchvision==0.21.0
pip install -e .
```


## Inference
Download pretrained models using the following scripts:
```bash
mkdir models
cd models
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

```

> [!TIP]
> If you encounter issues with connecting to Hugging Face, consider using `export HF_ENDPOINT=https://hf-mirror.com`.

Then run inference using: 
```bash
python inference_scripts/infer_multi_object.py
```

And the mask will be presented in **inference_scripts** folder. 


You can also provide your own image_path and text by:
```bash
python inference_scripts/infer_multi_object.py --image_path "your_image_path" --text "your question text"
```

## Evaluation  

Evaluation Data: [🤗 ReasonSeg-Test](https://huggingface.co/datasets/Ricky06662/ReasonSeg_test)  [🤗 ReasonSeg-Val](https://huggingface.co/datasets/Ricky06662/ReasonSeg_val)   

```bash
bash evaluation_scripts/eval_reasonseg_dr2seg.sh
```  

## Training

### 1. GRPO Training  

> [!NOTE]
> The recommanded training requirement for 7B model is a 4x80G GPUs server or a 8x46G GPUs server.   

Training Data:  [🤗 MultiObject-7K](https://huggingface.co/datasets/Ricky06662/VisionReasoner_multi_object_7k_840)   
Download dataset using this script: 
```bash
python training_scripts/download_dataset.py
```

> [!TIP]
> Try resize the image and re-calculate the corresponding bbox/point coordinates if you have lower GPU memory. Remeber changing the corresponding resize_size in evaluation and inference.    

Download pretrained models using the following scripts:
```bash
mkdir pretrained_models
cd pretrained_models
git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
```

Start training using this script:
```bash
bash training_scripts/run_dr2seg_7b.sh
```  

You can try change the following hyper-parameters if you have a large GPU memory.
```bash
worker.actor.micro_batch_size_per_device_for_update=1 or 2 or 4 or 8 or 16 \
worker.actor.micro_batch_size_per_device_for_experience=1 or2 or 4 or 8 or 16 \
```
If your GPU has less memory, you can change the following config. The number is depend on your GPU memory.
```bash
worker.rollout.tensor_parallel_size=[your number between 1-4]
worker.rollout.gpu_memory_utilization=[your number between 0-1]
worker.rollout.n=[your number between 2-32]
```

### 2. Merge Checkpoint in Hugging Face Format

```bash
python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]
```


## Build Your Own Training Data (Optional)
Please refer to our training data preparation [toturial](prepare_dataset/training_data_prepare_toturial.ipynb).


## Acknowledgement
We would like to thank the following repos for their great work: 

- This work is built upon the [EasyR1](https://github.com/hiyouga/EasyR1), [veRL](https://github.com/volcengine/verl), and [SegZero](https://github.com/JIA-Lab-research/Seg-Zero).
- This work utilizes models from  [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct), [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), [SAM2](https://huggingface.co/facebook/sam2-hiera-large), and [SAM3](https://github.com/facebookresearch/sam3). 
