# Supplementary Material

**Supplementary Videos (2x)**  
These videos are designed to visualize and compare the predicted outputs of our model against the ground truth (GT) captions.  

- **Ground Truth (GT)** captions are displayed at the **bottom-left corner** of the video frames in **purple font**.  
- **Predicted captions** by our model are shown at the **bottom-right corner** in **green font**.  
