# Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning

This repository contains the code and evaluation scripts used in our paper:

> **Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning**  
> We propose a simple yet effective training-free pipeline that decouples reasoning and perception for visual reasoning tasks.  
> A Large Language Model (LLM) handles high-level reasoning, while a Large Multimodal Model (LMM) serves purely as a visual question-answering engine to provide grounded perceptual inputs.  
> This lightweight approach effectively reduces visually-unfounded reasoning steps and improves reasoning fidelity.

---

## Quick Start

Before running evaluation scripts, **configure** the Visual Language Models (VLMs) and set the correct model paths.  
You can then use the `run.py` script to run inference and evaluation across multiple VLMs and benchmarks.

### Step 0 — Install and Set up API Keys

**Installation:**

```bash
git clone https://github.com/open-compass/VLMEvalKit.git 
cd VLMEvalKit
pip install -e .
```

**API Keys:**

To use API-based models (e.g., GPT-4V, Gemini-Pro-V) for inference or as **judgers/choice extractors**, you need to set the appropriate API keys.  

VLMEvalKit will:
- Use a judging LLM to extract answers when API keys are set.
- Fall back to **Exact-Match mode** when keys are not set (only suitable for Yes/No and multiple-choice tasks).

You can store keys in `./.env` or set them as environment variables. Example `.env`:

```bash
# Example .env file under $VLMEvalKit

# Proprietary VLMs API Keys
DASHSCOPE_API_KEY=
GOOGLE_API_KEY=
OPENAI_API_KEY=
OPENAI_API_BASE=
STEPAI_API_KEY=
REKA_API_KEY=
GLMV_API_KEY=
CW_API_BASE=
CW_API_KEY=
SENSENOVA_API_KEY=
HUNYUAN_SECRET_KEY=
HUNYUAN_SECRET_ID=
LMDEPLOY_API_BASE=

# Optional: proxy for evaluation API calls
EVAL_PROXY=
```

Fill in the required keys to enable API-based inference and evaluation.

---

### Step 1 — Model Configuration

All VLM configurations are in `vlmeval/config.py`.  
We used models such as:

- `Qwen2-5VL72BInstructQwen332BRound2`
- `Qwen2-5VL32BInstructQwen332BRound2`
- `Qwen2-5VL7BInstructQwen332BRound2`
- `Qwen2-5VL3BInstructQwen332BRound2`
- `Qwen2-5VL7BInstructQwQ32BRound2`

---

### Step 2 — Evaluation

Run evaluations with `run.py`.  
You can call it directly via `$VLMEvalKit/run.py` or create a symlink to run the script anywhere.

**Arguments:**

- `--data (list[str])`: Dataset names supported by VLMEvalKit.
- `--model (list[str])`: VLM names defined in `vlmeval/config.py` under `supported_VLM`.
- `--mode` (`str`, default: `"all"`, options: `["all", "infer"]`): `"all"` runs both inference and evaluation, `"infer"` runs inference only.
- `--api-nproc` (`int`, default: `4`): Number of API threads.
- `--work-dir` (`str`, default: `"."`): Directory to store results.

**Example Command** — Evaluate a multimodal dataset:

```bash
python run.py --data MathVision --model Qwen2-5VL72BInstructQwen332BRound2 --api-nproc 16 --reuse
```

---

## Notes

This code is modified from the open-source repository VLMEvalKit.
Please cite VLMEvalKit if you use our code or theirs in your work:

```bibtex
@inproceedings{duan2024vlmevalkit,
  title={Vlmevalkit: An open-source toolkit for evaluating large multi-modality models},
  author={Duan, Haodong and Yang, Junming and Qiao, Yuxuan and Fang, Xinyu and Chen, Lin and Liu, Yuan and Dong, Xiaoyi and Zang, Yuhang and Zhang, Pan and Wang, Jiaqi and others},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  pages={11198--11201},
  year={2024}
}
```
