# SutureBot

SutureBot extends OpenVLA-OFT to perform surgical suturing tasks using the da Vinci Surgical System. This extension includes several important modifications:

### Key Modifications

1. **Configuration System**:
   - Added a comprehensive config file (`vla-scripts/config.py`) with predefined configurations for different training scenarios
   - Easily toggle between training configurations using the `--config_name` parameter
   - Configurations include parameters for targeting strategies, learning rates, image inputs, and more

2. **Targeting Methods**:
   - Implemented multiple targeting strategies for precise suturing:
     - `dot`: Draws blue/green circles directly on the image to highlight entry/exit points
     - `mask`: Creates a separate mask image with highlighted entry/exit points
     - `heatmap`: Generates a color gradient visualization between entry and exit points
     - `none`: No targeting visualization (default)

3. **DVRK Dataset Integration**:
   - Custom `EpisodicDatasetDvrkGeneric` class for handling da Vinci surgical robot data
   - Support for loading multi-view imagery (endoscope + wrist cameras)
   - Automatically handles suturing-specific data with entry/exit point annotations

### Setup and Training

1. **Initial Setup**:
   - Follow the base OpenVLA setup instructions in [SETUP.md](SETUP.md)

2. **Preprocess Suturing Data**:
    - Place the [process_data.py](./process_data.py) file in the directory containing your suturing data. Running this script will result in a new directory called `processed_suturing_data_zipped`. This is the directory where you should point the dataloaders to.

3. **Computing Normalization Statistics**:
   - Before training, calculate action normalization statistics by running:
     ```
     python prismatic/vla/datasets/dvrk_dataset.py
     ```
   - This will generate statistics for proper action normalization
   - Copy the resulting stats into the `assets` directory (follow output instructions)

4. **Configuring Training**:
   - Modify configurations in `vla-scripts/config.py` based on your GPU resources
   - Adjust learning rate, batch size, and targeting strategy as needed
   - Select an appropriate predefined configuration or create a new one

5. **Starting Training**:
   - Launch training using the following command:
     ```
     torchrun --standalone --nnodes 1 --nproc-per-node 8 vla-scripts/finetune.py --config_name suturing_final_datasets_lr_5e5_heatmap
     ```
   - Adjust `--nproc-per-node` based on available GPUs
   - Replace `suturing_final_datasets_lr_5e5_heatmap` with your chosen configuration

**Note**: The dataset requires computing normalization statistics before training. This ensures proper scaling of action values for optimal learning.

# Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

**Project website: https://openvla-oft.github.io/**

**Paper: https://arxiv.org/abs/2502.19645**

**Summary video: https://youtu.be/T3Zkkr_NTSA**

## System Requirements

Inference:
* 1 GPU with ~16 GB VRAM for LIBERO sim benchmark tasks
* 1 GPU with ~18 GB VRAM for ALOHA robot tasks

Training:
* Between 1-8 GPUs with 27-80 GB, depending on the desired training setup (with default bfloat16 data type). See [this FAQ on our project website](https://openvla-oft.github.io/#train-compute) for details.

## Quick Start

First, set up a conda environment (see instructions in [SETUP.md](SETUP.md)).

Then, run the Python script below to download a pretrained OpenVLA-OFT checkpoint and run inference to generate an action chunk:

```python
import pickle
from experiments.robot.libero.run_libero_eval import GenerateConfig
from experiments.robot.openvla_utils import get_action_head, get_processor, get_proprio_projector, get_vla, get_vla_action
from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM

# Instantiate config (see class GenerateConfig in experiments/robot/libero/run_libero_eval.py for definitions)
cfg = GenerateConfig(
    pretrained_checkpoint = "moojink/openvla-7b-oft-finetuned-libero-spatial",
    use_l1_regression = True,
    use_diffusion = False,
    use_film = False,
    num_images_in_input = 2,
    use_proprio = True,
    load_in_8bit = False,
    load_in_4bit = False,
    center_crop = True,
    num_open_loop_steps = NUM_ACTIONS_CHUNK,
    unnorm_key = "libero_spatial_no_noops",
)

# Load OpenVLA-OFT policy and inputs processor
vla = get_vla(cfg)
processor = get_processor(cfg)

# Load MLP action head to generate continuous actions (via L1 regression)
action_head = get_action_head(cfg, llm_dim=vla.llm_dim)

# Load proprio projector to map proprio to language embedding space
proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM)

# Load sample observation:
#   observation (dict): {
#     "full_image": primary third-person image,
#     "wrist_image": wrist-mounted camera image,
#     "state": robot proprioceptive state,
#     "task_description": task description,
#   }
with open("experiments/robot/libero/sample_libero_spatial_observation.pkl", "rb") as file:
    observation = pickle.load(file)

# Generate robot action chunk (sequence of future actions)
actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"], action_head, proprio_projector)
print("Generated action chunk:")
for act in actions:
    print(act)
```

## Installation

See [SETUP.md](SETUP.md) for instructions on setting up the conda environment.

## Training and Evaluation

See [LIBERO.md](LIBERO.md) for fine-tuning/evaluating on LIBERO simulation benchmark task suites.

See [ALOHA.md](ALOHA.md) for fine-tuning/evaluating on real-world ALOHA robot tasks.

## Support

If you run into any issues, please open a new GitHub issue. If you do not receive a response within 2 business days, please email Moo Jin Kim (moojink@cs.stanford.edu) to bring the issue to his attention.

## Citation

If you use our code in your work, please cite [our paper](https://arxiv.org/abs/2502.19645):

```bibtex
@article{kim2025fine,
  title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
  author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
  journal={arXiv preprint arXiv:2502.19645},
  year={2025}
}
```