# Code for "Multimodal Subtask Graph Generation from Instructional Videos"


## PyTorch env install script
```
# Create a conda environment for this project conda create --name vidlang python=3.8 -y
conda activate vidlang
python -V
pip install --upgrade pip


# Install PyTorch
conda install -c anaconda numpy==1.20.3 scipy==1.7.1 mkl==2021.3.0 scikit-learn python-graphviz -y


## for GPU
conda install -c conda-forge cudatoolkit=11.3 cudnn=8.2 -y
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 torchtext==0.12.0 cudatoolkit=11.3 -c pytorch -c conda-forge -y

# install other packages
pip install tqdm ffmpeg-python tensorboard graphviz matplotlib absl.py
```


## How To Run
Please check the 'train_example.sh' file carefully. There are a set of options that you can control on top.
1. With/Without skip connection in multi-head attention (modality fusion)
- with skip:  --text_att_type=3
- (default) without skip: --text_att_type=4
- Vision-only: --text_att_type=0

2. Subsample the input sequence length:
- (default) use the entire sequence: No flag
- subsample input sequence: you can provide the keep ratio in --resample_lowerbound=0.75

3. Resume (load) and inference-only:
- Provide all parameters same as before, but just provide the model name (the tail folder) as --resume
- If you want to get the output for ILP only from the pretrained checkpoint, then you can feed --infer_only at the same time. Don't forget to provide (a) validation set path, (b) extract_ilp flag.

4. Others
- text_att_n_head: number of heads in multi-head modality fusion
- text_feedforward_dim: the size of feedforward dimension in multi-head modality fusion
- hidden_dim: Transformer input dimension
- num_layers: number of Transformer layers
- extract_ilp: in this case, you want to use all to generate the graph. This command allows you to include both the train and the test set.
- next_step_pred: this is for the next step prediction. This would predict a subsequence starting from index 1. It will obtain the complete prediction of the very last item one by one.


## Referred source implementations
- MIL-NCE (primary): https://github.com/antoine77340/MIL-NCE_HowTo100M
- ProceL (optimization): https://github.com/Yuhan-Shen/VisualNarrationProceL-CVPR21
- All the rest references are denoted as an in-line comment in the code.
