## How to Preprocess Our Data

### Caption Generation using VideoLlama2
Regardless of whether your dataset includes detailed annotations, you should run this section on at least a small subset of the data and compare the quality of the generated captions against the original manual captions.

##### Setup
1. Clone and set up the [VideoLlama2 repository](https://github.com/DAMO-NLP-SG/VideoLLaMA2).

##### Get the Captions!
2. In `VideoLlama.py`, modify **Line 10** to point to the location of your VideoLlama repository.
3. Uncomment the VideoLlama2-related code and update the data directory in `main.py`.
4. Run the VideoLlama2 section in `main.py`. This will generate captions in a `videollama.json` file, stored in your dataset’s folder.

### GPT Specification Generation and Extraction
This step converts natural language captions into programmatic specifications. You should run this code regardless of whether the dataset includes natural language captions.

##### Setup
1. If you are using captions generated by VideoLlama2, ensure the **VideoLlama2** section is complete. If you're using the original captions from the dataset, modify the caption path in the `gpt-specs-1` section of `main.py`, and adjust the data processing scheme accordingly.

##### Get the GPT Specifications!
3. Uncomment and run the **GPT spec 1** section in `main.py`.
4. If you receive a message recommending a reduction in batch size, start with a batch size of **10** and gradually reduce it (e.g., to **5**) until the message disappears.
5. This section will generate a `gpt_specs1.json` file.
6. Uncomment and run the **GPT spec 2** section to parse all GPT responses into runnable specifications.

### Generate Negative Examples!

1. Ensure both the **GPT Specs 1** and **GPT Specs 2** sections are completed.
2. Uncomment and run the **negative sampling** section. This will generate a `neg_samples.json` file.

### Object Identification and Tracking through SAM2
If your dataset already includes bounding boxes or STSGs, you can skip this section.

##### Setup
Ensure the following file structure for your dataset:
```plaintext
src/
├── data/
│   ├── <YOUR_DATASET>/
│   │   ├── masks/
│   │   ├── videos/
│   │   ├── ...
│   └── SAM/
│       └── sam2_checkpoints/
│           └── sam2_hiera_base_plus.pt
```

- Create a `masks` folder inside your dataset’s directory under `data`.
- Clone and set up the [SAM2](https://github.com/facebookresearch/segment-anything-2).
- Download the `sam2_hiera_base_plus.pt` file from SAM2’s GitHub repository.
- Create a `sam2_checkpoints` folder inside the `data/SAM` directory and place the `.pt` file there.
- Update the SAM2.py to include the path to the SAM2 repo.

##### **Important Arguments**
- **`video_segment_length`**: Defines the length (in seconds) of each video segment. The default is `5 seconds`, but you can adjust it as needed.
- **`frames_per_second`**: Limits the number of frames processed per second. If you modify this, you must also update the `vid_to_jpgs.sh` script.

##### Run SAM2!
1. Open `vid_to_jpgs.sh` and update the `DATA_DIR` variable to point to the correct dataset location.
2. Execute the `vid_to_jpgs.sh` script. This will convert your videos into images, which are necessary for SAM2. Depending on the video length, this process may take some time.
3. In `main.py`, comment out other sections and run the SAM2 part to generate bounding boxes for your dataset.
