SpikingVTG
======== 
This code implements the methodology described in the paper titled: "SpikingVTG: Saliency Feedback Gating Enabled Spiking Video Temporal Grounding".

Video Temporal Grounding (VTG) seeks to retrieve consecutive intervals or specific clips from a video based on specified natural language queries.
VTG requires accurately aligning video segments with corresponding natural language instructions, highlighting the need for effective methodologies to capture semantic correspondence and maintain temporal coherence. Spiking neural networks (SNNs), previously underexplored in this domain, present a unique opportunity to tackle VTG challenges from both the architectural and energy-efficiency perspectives. In this paper, we leverage sparse spike-based communication of SNNs to propose a multimodal architecture tailored for VTG tasks, namely SpikingVTG, providing a biologically inspired and efficient solution. Leveraging temporal saliency feedback, our proposed spiking video-language model (VLM) achieves competitive performance with non-spiking VLMs across diverse moment retrieval and highlight detection tasks. We introduce a Saliency Feedback Gating (SFG) mechanism that improves performance while reducing overall neural activity. To efficiently train our spiking VLM, we analyze the convergence dynamics of each neuronal layer and utilize equilibrium states to enable training using implicit differentiation at equilibrium. This approach eliminates the need for computationally expensive backpropagation through time while also enabling the use of knowledge distillation for efficient model training. To further improve operational efficiency and facilitate the on-chip deployability of our model, we leverage a multi-stage training pipeline that focuses on eliminating non-local computations, such as softmax and layer normalization, leading to the development of the Normalization Free (NF)-SpikingVTG model. Additionally, we create an extremely quantized variant, a 1-bit NF-SpikingVTG model, which vastly improves computational efficiency during inference while maintaining minimal performance degradation from our base model. Our work introduces the first spiking model to demonstrate competitive performance on VTG benchmarks, including QVHighlights and Charades-STA.

Installation
============
Run command below to install the required packages (**using python3**).
```bash
pip install -r requirements.txt
```

## Overall Repository Structure

```
ide_methods/    
    snn_vtg_modules.py                    Model components for SpikingVTG
    snn_vtg_modules_no_norm.py            Model components for NF-SpikingVTG
    snn_vtg_modules_quantized_no_norm.py  Model components for 1-bit NF-SpikingVTG
    snn_module.py                         Network dynamics operated in this file
    snnide_vtg_multilayer_module.py       Code for training the SNN

main/
    training_spiking_kd.py                Code for doing KD
    train_spiking_output.py               Code for finetuning
    
model/
    spikingVTG.py                         Code where model is initialized
    
spiking_student_model/
    config.py                             Configuration of the student
    pytorch_model.bin                     Distilled model 

data/                                     Store datasets in this folder
```
Multi-stage Training Pipeline
====================

**Stage 1:**
In this step we leverage a pre-trained UniVTG model as a "teacher" to train our SpikingVTG "student". The dataset we use is QVHighlights.

(a) Download the pretrained model UniVTG from UniVTG paper. Preprocess the input as done in the UniVTG paper..
(b) Create the student model configuration is a separate folder. (spiking_student_model/config.py)
(c) Do Internal layer KD as described in the paper


```
python training_spiking_kd.py \
--dset_type mr \
--dset_name qvhighlights \
--clip_length 2 \
--gpu_id 0 \
--device 0 \
--exp_id qvhl \
--model_id univtg_original \
--v_feat_types slowfast_clip \
--t_feat_type clip \
--ctx_mode video_tef \
--train_path data/qvhighlights/metadata/qvhighlights_train.jsonl \
--eval_path data/qvhighlights/metadata/qvhighlights_val.jsonl \
--eval_split_name val \
--v_feat_dirs data/qvhighlights/vid_slowfast data/qvhighlights/vid_clip \
--v_feat_dim 2816 \
--t_feat_dir data/qvhighlights/txt_clip \
--t_feat_dim 512 \
--dim_feedforward 1024 \
--input_dropout 0.0 \
--dropout 0 \
--droppath 0.0 \
--bsz 32 \
--eval_bsz 4 \
--n_epoch 10 \
--num_workers 16 \
--lr 0.0001 \
--lr_drop 80 \
--lr_warmup 10 \
--wd 0.0001 \
--enc_layers 4 \
--hidden_dim 1024 \
--resume saved_non_spiking_models/qvhl_pt/model_best.ckpt           
```


**Stage 2:** In this step we perform fine-tuning to train the student model. The student model after distillation should be stored in spiking_student_model. The hyper-parameters are given for QVHighlights dataset.


``` 
python train_spiking_output.py \
--dset_type mr \
--dset_name qvhighlights \
--clip_length 2 \
--gpu_id 0 \
--device 0 \
--exp_id qvhl \
--model_id univtg_original \
--v_feat_types slowfast_clip \
--t_feat_type clip \
--ctx_mode video_tef \
--train_path data/qvhighlights/metadata/qvhighlights_train.jsonl \
--eval_path data/qvhighlights/metadata/qvhighlights_val.jsonl \
--eval_split_name val \
--eval_epoch 1 \
--v_feat_dirs data/qvhighlights/vid_slowfast data/qvhighlights/vid_clip \
--v_feat_dim 2816 \
--t_feat_dir data/qvhighlights/txt_clip \
--t_feat_dim 512 \
--dim_feedforward 1024 \
--input_dropout 0.5 \
--dropout 0 \
--droppath 0.1 \
--bsz 32 \
--eval_bsz 8 \
--n_epoch 200 \
--num_workers 16 \
--lr 0.0001 \
--lr_drop 80 \
--lr_warmup 10 \
--wd 0.0001 \
--use_cache 1 \
--enc_layers 4 \
--main_metric MR-full-R1@0.7-key \
--nms_thd 0.7 \
--max_before_nms 1000 \
--easy_negative_only 1 \
--b_loss_coef 10 \
--g_loss_coef 10 \
--eos_coef 0.1 \
--f_loss_coef 10 \
--s_loss_intra_coef 0.1 \
--s_loss_inter_coef 0.1 \
--round_multiple -1 \
--eval_mode add \
--hidden_dim 1024 \
--resume saved_non_spiking_models/qvhl_pt/model_best.ckpt

```

Stage 3 and Stage 4 both involves fine-tuning after architectural changes. The steps are explained below,


**Stage 3:** To train the NF-SpikingVTG module, in SpikingVTG.py file inside model import module from spiking_vtg_modules_no_norm instead of  spiking_vtg_modules.  Following this, the finetuning code can be run.

**Stage 4:** To train the 1-bit NF-SpikingVTG module, in SpikingVTG.py file inside model import module from spiking_vtg_modules_quantized_no_norm instead of  spiking_vtg_modules. Following this, the finetuning code can be run.

