# SutureBot (High-Level Policy and ACT)


This is the repo for training High-level policy and language conditioned ACT for SutureBot.


If you encountered any issue, feel free to contact jchen396 (at) jh (dot) edu

## Installation
1. Clone this repository
```bash
git clone git@github.com:JuoTungChen/srth.git
cd srth
```

2. Create a virtual environment
```bash 
conda create -n srth python=3.8.10 
conda activate srth
```

3. Install packages
```bash
pip install -r requirements_ll.txt
```

(Optional)
1. Install [Whisper](https://github.com/openai/whisper)
```bash
sudo apt update && sudo apt install ffmpeg
```

2.  Install package for audio recording
```bash
sudo apt install portaudio19-dev python3-pyaudio
```


## Adding Path variables in bashrc
To avoid Python module import errors, please add the following path variables to the `~/.bashrc` file. This is useful for adding `sys.path.append("$PATH_TO_SUTUREBOT/src")` to avoid Python module import errors.

export PATH_TO_SUTUREBOT=[path to the current path]
export PATH_TO_DATASET=[path to dataset folders]
export YOUR_CKPT_PATH="$PATH_TO_SUTUREBOT/model_ckpts"



# Training and Evaluation

## Train Low-Level Policy

### Relevant files
 ``` 
.
├── src                                  # main packages
|   └── act                              # where the low level policy code are stored
|   |   ├── dvrk_scripts              
|   |   |   └── constants_dvrk.py        # the task configs for training
|   |   ├── generic_dataset.py           # dataset class
|   |   ├── auto_training_suturing.py    # script for initiating training
|   |   ├── imitate_episodes.py          # training class and functions
|   |   └── img_aug.py                   # class for image data augementations
├── script                               # useful scripts
|   ├── calculate_std_mean.py            # script for calculating norm stats
|   ├── suture_point_labeling.py         # script for labeling suture points
|   └── encode_instruction.py            # generate candidate text embeddings before training
...
 ``` 

### Data format
To train a policy, make sure your data are stored in following structure:


 ``` 
 $PATH_TO_DATASET
├── [DATASET_NAME]       # the dataset base dir
|   └── tissue_1                      # data subset
|   |   ├── 1_[task_name]             # task name
|   |   |   ├── [episode]             # should be timestamp when the data was recorded
|   |   |   |      ├── left_img_dir   # left endoscope cam images (frame000000_left.jpg)
|   |   |   |      ├── right_img_dir  # right endoscope cam images (frame000000_right.jpg)
|   |   |   |      ├── endo_psm1      # right wrist cam images (frame000000_psm1.jpg)
|   |   |   |      ├── endo_psm2      # left wrist cam images (frame000000_psm2.jpg)
|   |   |   |      └── ee_csv.csv     # kinematics
|   └── tissue_2                      # data subset
|   |   ├── 1_[task_name]             # task name
|   |   |   ├── [episode]             # should be timestamp when the data was recorded
|   |   |   |      ├── left_img_dir   # left endoscope cam images (frame000000_left.jpg)
|   |   |   |      ├── right_img_dir  # right endoscope cam images (frame000000_right.jpg)
|   |   |   |      ├── endo_psm1      # right wrist cam images (frame000000_psm1.jpg)
|   |   |   |      ├── endo_psm2      # left wrist cam images (frame000000_psm2.jpg)
|   |   |   |      └── ee_csv.csv     # kinematics
...
 ``` 

### Training procedure
After making sure the folder structure are correct, follow the following steps to train your low-level policy
1. run the following file to generate candidate embeddings before training
```
python encode_instruction.py --dataset_dir $PATH_TO_DATASET/[DATASET_NAME] --encoder distilbert --from_count
```
This should generate a json file called "candidate_embeddings_distilbert.json" containing the embeddings for the task names as well as embeddings for the direction correction commands.

2. calcuate the std and mean using the following script:
```
python script/calculate_std_mean.py
```
Make sure to specify the tissue ids and data_dir before running the script.
The result will be shown on the terminal, and also stored in script/chole/std_mean.txt 

3. Next, copy and paste the max, min, std, and mean to the corresponding task config you created in src/act/dvrk_scripts/constants_dvrk.py 

4. Set the task configs in src/act/dvrk_scripts/constants_dvrk.py.
In the task config, there are several things you need to specify:

    a. __`dataset_dir`__: which dataset you want to use and where it's stored.

    b. __`num_episodes`__: total num of episodes, should be the same as printed by calculate_std_mean.py

    c. __`tissue_samples_ids`__: the tissue ids you want to train on.

    d. __`camera_file_suffixes`__: the suffixes of the image files, should be the same size as the camera names.

    e. __`camera_names`__: the cameras you want to use (possible option: left, right, left_wrist, right_wrist).

    f. __`episode_len`__: don't need to worry about this.

    g. __`action_mode`__: action representation and the norm stats. possible action mode: "hybrid", "ego", "relative_endoscope". We typically use hybrid.

    h. __`norm_scheme`__: the normalization scheme to use. possible options: "std", "min_max". We typically use "std".

    i. __`save_frequency`__: the frequency you wish to save the checkpoints.

    j. __`goal_condition_style`__ - "plot" means it will plot the labeled needle insertion point on the image during training.
 
5. create your own or modify auto_training.py. (example can be seen in auto_training_suturing.py). This file will initiate a subprocess for training and if the training is interupted by segmentation fault it will wait for a few seconds then continue from the last checkpoint. You can set some training parameters:

    a. __`task_name`__: should be corresponding to the task config you created in src/act/dvrk_scripts/constants_dvrk.py.

    b. __`ckpt_dir`__: the checkpoint name.

    c. __`policy_class`__: use "ACT"

    d. __`batch_size`__: typically set to 16, which takes around 21GB of GPU memory on RTX4090.

    e. __`num_epochs`__: total num of epochs to train for.

    f. __`use_language`__: if set, use FiLM for vision backbones.

    g. __`language_encoder`__: we typically use "distilbert". possible options: ["distilbert", "clip"].

    h. __`image_encoder`__: we typically use "efficientnet_b3film". default is 'resnet18'. possible options: ['resnet18', 'resnet34', 'resnet50', 'efficientnet_b0', 'efficientnet_b3', 'resnet18film', 'resnet34film', 'resnet50film','efficientnet_b0film', 'efficientnet_b3film', 'efficientnet_b5film'].

    i. __`gpu`__: which gpu to use (if you have multiple gpu).

    j. __`multi_gpu`__: if set to True will use all the CUDA_VISIBLE_DEVICES for training.

    k. __`policy_level`__: should use "low" for training low level policy.

    example:
    ```
    python auto_training_suturing.py
    ```


## Train High-Level Policy
### Data
- Set DATA_DIR to the correct dataset folder in skay_jhu_private/src/instructor/constants_daVinci.py
    - There you can also define same training constants – e.g. the camera folder names, val/test tissues, ..
- Labels will be directly read from the directory names (orientate on the directory structure from the SRT-H chole dataset
- Compute the dataset mean and std for standardization using skay_jhu_private/script/chole/dataset_rgb_mean_std.py
    - Alternatively use the mean and std from imagenet – might work little worse

### Data Curation:
- Check if all recordings look good:
    - Create a video with all demonstrations concatenated for one tissue and check if all demonstrations look good and are in the correct task directory script/chole/concatenate_all_tissue_demos.py (sometimes it happens that a demonstration is saved in the previous task folder (or vice versa) 
    - Specifically check that the demonstrations are complete. If the demonstrations are incomplete (started to late or ended to early), then concatenating the task recordings is erroneous
- If a task recording started too early or is too long, you can add a “indices_curated.json” in the demonstration directory with the keys “start” and/or “end” giving them the frame index of the curated start/end

### Files 
Models:	
- model_daVinci.py -> Contains all temporal model code (Transformer, ..)
- backbone_models_daVinci.py 🡪 Backbone models that can be selected in models – e.g. ResNet, SwinT, ..
- dataset_daVinci.py contains the dataset/dataloader code (loads defined data, applies augmentations, ..)
- Concatenates the recordings of neighboring tasks ..
- Start training using example commands in the training_configs directory (this will call the train() function in train.py with example parameters
- Inference via instructor_pipeline.py
    - Select the ckpt you just trained

### Example Training Command:
```
python train_daVinci.py \
    --dataset_names [dataset_name]\
    --ckpt_dir ./model_ckpts/hl/suturing_hl_3\
    --gpu 0 \
    --recovery_probability 0.6 \
    --batch_size 16 \
    --num_epochs 2000 \
    --lr 4e-4 \
    --min_lr 1e-5 \
    --lr_cycle 25 \
    --warmup_epochs 5 \
    --weight_decay 0.05 \
    --validation_interval 10 \
    --prediction_offset 15 \
    --history_len 4 \
    --save_ckpt_interval 5 \
    --history_step_size 30 \
    --one_hot_flag \
    --early_stopping_interval 300 \
    --seed 5 \
    --plot_val_images_flag \
    --max_num_images 5 \
    --cameras_to_use left_img_dir \
    --backbone_model swin-t \
    --model_init_weights imagenet \
    --image_dim 224 224 \
    --freeze_backbone_until none \
    --multitask_loss_weight 0.6 \
    --uniform_sampling_flag \
    --extra_repeated_phase_last_frame_sampling_flag \
    --extra_repeated_phase_last_frame_sampling_probability 0.15 \
    --add_center_crop_view_flag \
    --global_pool_image_features_flag \
    --dataset_mean_std_file_names "dataset_mean_std_camera_type='left_img_dir'_image_step_size=1.json" \
    --val_split_number 0 \
    --use_complexer_multitask_mlp_head_flag \
    --selected_multitasks dominant_moving_direction 
```


### Ignore:
future_frame_predictor_model.py
hl_correction_publisher_ui_w_whisper.py
temporal_models.py -> here only TCN included (not the Transformer architecture that was used in the end)

Note: Many features were experimental so the dataset/model code contains many features that can be removed -> to clean/simplify the code 
