## Model reasoning generation

Before proceeding, please follow the instructions [here](../eval/README.md) for installing required packages and running the evaluation script, and [here](../data_factory/README.md) for preparing the `AiR-D` dataset. 

The following commands load the weights for each LVLM and generate model reasoning for each image with bounding boxes. In-context learning examples are provided [here](./static/icl_examples). Note that for the `QvQ` model, we do not feed in the ICL examples and let it generate its reasoning in a zero-shot manner.

During our testing, we found that small (7B-9B) models fit into one Nvidia A30 GPU with 24GB of memory, and medium-sized models (11-32B) would require one Nvidia A100 40GB card. For large models (72B-90B), they fit into either 2 40GB A100 cards or 1 80GB A100 card. All commands below support using multiple CUDA GPUs with tensor parallelism for inference.

### Small (~7B) models

#### LLaVA-OneVision-Chat 7B

```bash
python3 explain_bounding_boxes.py --runner huggingface --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_llava-onevision-qwen2-7b-ov-chat.jsonl  --pretrained-model-name llava-hf/llava-onevision-qwen2-7b-ov-chat-hf --generate-explanation-first --enable-additional-conditions --enable-flash-attention-2 --dtype float16

```

#### Qwen2.5-VL 7B

```bash
VLLM_USE_V1=0 python3 explain_bounding_boxes.py --runner vllm --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_qwen2-5-7b.jsonl  --pretrained-model-name Qwen/Qwen2.5-VL-7B-Instruct --generate-explanation-first --enable-additional-conditions

```

#### InternVL2.5 8B-MPO

```bash
python3 explain_bounding_boxes.py --runner huggingface --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_internvl2-5-8b-mpo.jsonl  --pretrained-model-name OpenGVLab/InternVL2_5-8B-MPO --generate-explanation-first --enable-additional-conditions 
```

#### InternVL3 9B

```bash
python3 explain_bounding_boxes.py --runner huggingface --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_internvl3-9b.jsonl  --pretrained-model-name OpenGVLab/InternVL3-9B --generate-explanation-first --enable-additional-conditions 
```

#### Aya Vision 8B

```bash
python3 explain_bounding_boxes.py --runner huggingface --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_aya-vision-8b.jsonl  --pretrained-model-name CohereForAI/aya-vision-8b --enable-flash-attention-2 --generate-explanation-first --enable-additional-conditions --dtype float16
```

### Medium (11~32B) models

#### Gemma 3 12B

```bash
VLLM_USE_V1=0 python3 explain_bounding_boxes.py --runner vllm --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_gemma-3-12b-it.jsonl  --pretrained-model-name google/gemma-3-12b-it --generate-explanation-first --enable-additional-conditions

```

#### Gemma 3 27B

```bash
VLLM_USE_V1=0 python3 explain_bounding_boxes.py --runner vllm --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_gemma-3-27b-it.jsonl  --pretrained-model-name google/gemma-3-27b-it --generate-explanation-first --enable-additional-conditions

```

#### InternVL2.5 26B-MPO

```bash
python3 explain_bounding_boxes.py --runner huggingface --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_internvl2-5-26b-mpo.jsonl  --pretrained-model-name OpenGVLab/InternVL2_5-26B-MPO --generate-explanation-first --enable-additional-conditions
```

#### Qwen2.5-VL 32B

```bash
VLLM_USE_V1=0 python3 explain_bounding_boxes.py --runner vllm --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_qwen2-5-32b.jsonl  --pretrained-model-name Qwen/Qwen2.5-VL-32B-Instruct --generate-explanation-first --enable-additional-conditions

```

#### Aya Vision 32B

```bash
python3 explain_bounding_boxes.py --runner huggingface --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_aya-vision-32b.jsonl  --pretrained-model-name CohereForAI/aya-vision-32b --enable-flash-attention-2 --generate-explanation-first --enable-additional-conditions --dtype float16
```

### Large (90B) models

#### Llama 3.2 90B Vision

```bash
VLLM_USE_V1=0 python3 explain_bounding_boxes.py --runner vllm --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_llama3-2-90b.jsonl  --pretrained-model-name meta-llama/Llama-3.2-90B-Vision-Instruct --generate-explanation-first --enable-additional-conditions

```

#### Qwen2.5-VL 72B

```bash
VLLM_USE_V1=0 python3 explain_bounding_boxes.py --runner vllm --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --icl-examples-path static/icl_examples/ --output-path explain-cond_bbox_full_exp_first_qwen2-5-72b.jsonl  --pretrained-model-name Qwen/Qwen2.5-VL-72B-Instruct --generate-explanation-first --enable-additional-conditions

```

#### QvQ 72B-Preview

```bash
VLLM_USE_V1=0 python3 explain_bounding_boxes.py --runner vllm --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --output-path explain-no-icl-cond_bbox_full_exp_first_qvq-72b-preview.jsonl --pretrained-model-name Qwen/QVQ-72B-Preview --generate-explanation-first --enable-additional-conditions

```

### OpenAI models

We used [OpenAI batch API](https://platform.openai.com/docs/guides/batch) to generate reasoning with GPT4o and GPT4.1 models. The following command generates a payload for the API.

```bash
python3 explain_bounding_boxes.py --runner openai_batch_export --images-path AiR-D/images_with_boxes/full --annotations-path AiR-D/questions.json --annotations-bbox-path AiR-D/images_with_boxes/full_ground_truth.json --output-path explain-no-icl-cond_bbox_full_exp_first_openai.jsonl --generate-explanation-first --enable-additional-conditions

```
