# G-Substrate / GraphAGI Project

This repository contains the codebase for the G-Substrate / GraphAGI project, including data preparation, fine-tuning, inference, and evaluation pipelines.

## 1. Environment Setup

Please ensure you have Python and Conda installed. You can set up the environment using either `environment.yml` or `requirements.txt`.

### Using Conda (Recommended)
```bash
conda env create -f environment.yml
conda activate <env_name>
```

### Using pip
```bash
pip install -r requirements.txt
```

---

## 2. Data Preparation

### Visual Genome (VG) Dataset
The Visual Genome dataset is required for this project but is too large to be included in the repository.

1.  **Download** the VG dataset images.
2.  **Place** the downloaded data under the `graphAGI/` directory.
3.  **Alignment**: Ensure the data structure aligns with the image paths referenced in the SGG (Scene Graph Generation) data. The images should correspond to the entries in the Scene Graph datasets.
4.  **Ground Truth for Evaluation**:
    *   You need to construct a ground truth file named `vg150_gt.pkl` and place it at: `Framework/evaluation/ssg_eval/vg150_gt.pkl`.
    *   **Format**: The file should be a pickle file containing a Python dictionary.
    *   **Structure**:
        ```python
        {
            image_id (int): {
                'boxes': numpy.ndarray, shape=(N, 4), dtype=float32,  # N objects, format [x1, y1, x2, y2]
                'labels': numpy.ndarray, shape=(N,), dtype=int32,     # Object category labels for the N objects
                'relation_tuple': numpy.ndarray, shape=(M, 3), dtype=int32 # M relations, format [subject_idx, object_idx, predicate_label]
            },
            ...
        }
        ```

### LLaMA Factory
This project relies on [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for training and inference.

1.  **Clone** LLaMA-Factory into the root directory of this workspace:
    ```bash
    git clone https://github.com/hiyouga/LLaMA-Factory.git
    cd LLaMA-Factory
    pip install -e "."
    cd ..
    ```

### Dataset Registration (SFT Data)
All Supervised Fine-Tuning (SFT) datasets must be registered.

1.  Open `graphAGI/dataset_info.json`.
2.  Add your dataset configurations following the existing format (e.g., `train_gskel_interleave_v2_algorithm`).
3.  Ensure the `file_name` points to the correct JSON file location relative to `graphAGI/`.

---

## 3. Fine-tuning

To perform fine-tuning on your models:

1.  Navigate to the fine-tuning directory:
    ```bash
    cd Fine_tune
    ```
2.  Open `submit_train_all.sh` and configure the following:
    *   **MODELS**: Set the model name or path (e.g., `"Qwen/Qwen2-VL-7B-Instruct"`).
    *   **Mode Selection**: Choose the appropriate training mode (e.g., `"multi"`, `"multi_gskel"`, etc.) by modifying the script arguments or uncommenting the desired section.
    *   **Resume**: Set `RESUME="false"` for a fresh start or `"true"` to continue training.
3.  Execute the training script:
    ```bash
    bash submit_train_all.sh
    ```

---

## 4. Inference

After fine-tuning is complete, perform inference to generate results.

1.  Navigate to the inference directory:
    ```bash
    cd Framework/infer
    ```
2.  Open `vllm_infer_submit.sh` and configure:
    *   **MODEL_PATHS**: Add the absolute path(s) to your fine-tuned model checkpoint(s).
    *   **Fixed Datasets**: Ensure `PREDICT_DATASET` is set correctly (default: `"test_gskel"`).
3.  Submit the inference jobs:
    ```bash
    bash vllm_infer_submit.sh
    ```

---

## 5. Evaluation

Once inference is finished, you can evaluate the generated results.

1.  Navigate to the evaluation directory:
    ```bash
    cd Framework/evaluation
    ```
2.  Open `run_evaluator.sh` and configure:
    *   **PATHS**: Add the paths to the inference output directories you want to evaluate.
3.  Run the evaluation script:
    ```bash
    bash run_evaluator.sh
    ```
