# LLaRA: Large Language and Robotics Assistant

![llara](./assets/llara.png)

**LLaRA: Supercharging Robot Learning Data for Vision-Language Policy** 
The anonymous version.

<p float="left">
  <img src="assets/llara-vid1.gif" width="49%" />
  <img src="assets/llara-vid2.gif" width="49%" /> 
</p>

## Installation

1. **Set Up Python Environment**:

   Follow the instructions to install the same Python environment as used by [LLaVA](https://github.com/haotian-liu/LLaVA). 
   ```
   conda create -n llara python=3.10 -y
   conda activate llara
   conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
   conda install cuda=12.1 cuda-compiler=12.1 cuda-nvcc=12.1 cuda-version=12.1 -c nvidia
   ```

2. **Install Revised LLaVA**:

   Navigate to `train-llava` in this repo and install the llava package there:
   ```
   cd train-llava && pip install -e ".[train]"
   pip install flash-attn --no-build-isolation
   ```

3. **Install VIMABench**:

   Complete the setup for [VIMABench](https://github.com/vimalabs/VIMABench).
   ```
   git clone https://github.com/vimalabs/VimaBench && cd VimaBench
   pip install -e .
   ```

## Demo

1. **Download the Pretrained Model**:

   Download the following model to `./checkpoints/`
   - llava-1.5-7b-D-inBC + Aux(B) trained on VIMA-80k [Google Drive](https://drive.google.com/drive/folders/1YN1jqttAvo2k_DwKYNbC2k_57MNzmr8i?usp=drive_link)
   
2. **Run the evaluation**:

   ```
   cd eval
   # evaluate the model with oracle object detector
   python3 eval-llara.py D-inBC-AuxB-VIMA-80k --model-path ../checkpoints/llava-1.5-7b-llara-D-inBC-Aux-B-VIMA-80k --prompt-mode hso
   
   # the results will be saved to ../results/[hso]D-inBC-AuxB-VIMA-80k.json
   ```

3. **Check the results**:
   Please refer to [llara-result.ipynb](./results/llara-result.ipynb)

## Quick Start Guide

0. **Minuiment Hardware Requirement**:
- Inference: Requires at least one GPU with a minimum of 24GB RAM.
- Training: Requires a system with at least 300GB of system RAM and four Ampere (or newer) GPUs, each equipped with a minimum of 24GB of memory.

1. **Prepare the Dataset**:

   Visit the [datasets directory](./datasets/README.md) to prepare your dataset for training.

2. **Finetune a LLaVA Model**:

   To start finetuning a LLaVA model, refer to the instructions in [train-llava](./train-llava/README.md).

3. **Evaluate the Trained Model**:

   Follow the steps in [eval](./eval/README.md) to assess the performance of your trained model.

4. **Train a MaskRCNN for Object Detection**:

   If you want to train a MaskRCNN for object detection, check out [train-maskrcnn](./train-maskrcnn/README.md) for detailed steps.

## License

This project is licensed under the [Apache-2.0 License](LICENSE) - see the LICENSE file for details.
