
<h1 align="center"> Visually Descriptive Language Model for Vector Graphics Reasoning Code </h1>

## 💻 Environment Setup
- Minimum requirements:
    ```
    conda env create -f environment.yml
    conda activate vdlm
    ```
- (Optional) For llava inference:
    ```
    cd third_party
    git clone https://github.com/haotian-liu/LLaVA.git
    cd LLaVA
    pip install -e .
    ```
- (Optional) For ViperGPT inference:
    ```
    cd third_party
    git clone **(hidden for anonymous submission)**
    ```  
    Set up the environment for ViperGPT following the instructions.
    

## 🚀 Quick Start (Inference Demo)

- Download the pretrained SVG-to-PVD model from [here]() **(hidden for anonymous submission)**. It is an LLM finetuned from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). Make sure it is stored at `data/ckpts/PVD-160k-Mistral-7b`
    ```
    mkdir -p data/ckpts
    cd data/ckpts
    git lfs install
    git clone (hidden for anonymous submission)
    ```

- Serve the model with vllm:
    ```bash
    CUDA_VISIBLE_DEVICES=0 ./vllm_serve_model.sh
    ```

- A detailed inference demo 🚀 can be found [here](demo.ipynb).

## 📊 Downstream Task Evaluation

### Downstream Task Data Download
You can download the data for downstream tasks from [here]() **(hidden for anonymous submission)**. Unzip the file and place the `downstream_tasks` folder under `data/datasets/`.

### Run VDLM Perception: Image -> SVG -> PVD (in JSON format)
```
bash scripts/perception/eval_perception.sh    
```

### Run Reasoning: PVD + question -> answer

- VDLM-mm:
    - GPT-4o:
        ```
        bash scripts/reasoning/vdlm_mm_gpt4o_pvd.sh
        ```
    - GPT-4V:
        ```
        bash scripts/reasoning/vdlm_mm_gpt4v_pvd.sh
        ```

- VDLM-txt:
    - GPT-4 Chat API *without* Code Interpreter:
        ```
        bash scripts/reasoning/vdlm_txt_gpt4_pvd.sh
        ```
    - GPT-4 Assistant API *with* Code Interpreter:
        ```
        bash scripts/reasoning/vdlm_txt_gpt4_assistant_pvd.sh
        ```
    
- Image-based Baselines:
    - GPT-4o + Image input: 
        ```
        bash scripts/reasoning/gpt4o_image.sh
        ```
    - GPT-4v + Image input: 
        ```
        bash scripts/reasoning/gpt4v_image.sh
        ```
    - Llava-v1.5 + Image input:
        ```
        # 7b
        bash scripts/reasoning/llava_1.5_7b_image.sh
        # 13b
        bash scripts/reasoning/llava_1.5_13b_image.sh
        ```
    - ViperGPT w/ GPT-4 + Image input:
        ```
        bash scripts/reasoning/vipergpt_inference.sh
        ```

## 📂 SVG-to-PVD Model Data

### PVD-160k Dataset
The dataset used for training our SVG-to-PVD model can be downloaded from [here]() **(hidden for anonymous submission)**, which contains the preprocessed instruction-tuning data instances for training the SVG-to-PVD model. The format of each line is as follows:
```
{
    "id": "XXX",
    "conversations": [
        {"role": "system", "content": "XXX"},
        {"role": "user", "content": "XXX"},
        {"role": "assistant", "content": "XXX"}
        // ...
    ]
}
```

Additioanlly, the raw PNGs, SVGs and PVD annotations generated by our data generator can be downloaded from [here]() **(hidden for anonymous submission)**.
<!-- By default, the dataset is stored in `data/datasets/pretraining_data/pvd_160k.jsonl`. -->

### Generating custom PVD data
`pvd_data_generator/generate_pvd_img_svg.py` provides the procedural data generator we used for generating the 160K Image/SVG/PVD pairs. 

Example usage: `bash pvd_data_generator/gen_dataset_pvd_160K.sh`

To specify custom configurations, one can modify the `main()` function in `pvd_data_generator/generate_pvd_img_svg.py`.

Once generated the SVGs and PVD annotations, one can use the `pvd_data_generator/get_instruction_pair.py` to construct instruction-tuning data instances in vicuna or openai/mistral format. Modify the `#TODO` parts in the script with the generated custom dataset information. Then run: `python pvd_data_generator/get_instruction_pair.py`



## 📘 SVG-to-PVD Model Training
We finetune a Mistral-7B model using Megatron-LLM on the [PVD-160K dataset](#pvd-160k-dataset).
We follow **(hidden for anonymous submission)** for doing the preprocessing and postprocessing on the model and data. We train the model on a SLURM cluster with 4 NVIDIA-A100-40GB GPUs.

Example usage:
- clone the code-act repo:
    ```
    cd third_party
    git clone **(hidden for anonymous submission)**
    ```
- Follow the instructions in **(hidden for anonymous submission)**; for environmental setup, model preprocessing, data conversion.

- Modify the `TODO:` items in `scripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.slurm` and `scripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.sh`

- Copy `scripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.slurm` into `code-act/scripts/slurm/configs`; Copy `scripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.sh` into `code-act/scripts/models/megatron`.

- Run training by:
    ```
    cd third_party/code-act
    sbatch scripts/slurm/configs/finetune_4xA100_4tp_mistral__pvd_3ep.slurm scripts/models/megatron/finetune_4xA100_4tp_mistral__pvd_3ep.sh
    ```

- Follow **(hidden for anonymous submission)** to convert the trained model back to Huggingface format. The converted model can be served with `vllm` for inference.

