This file is for Supplementary Materials for ICLR 2026 submission.
Paper Name : 'How Do You Watch a Movie? HourHDVC: Hour-Long Hierarchical Dense Video Captioning'


We have prepared our implementation code as supplementary material. We have prepared the code for training and inference, the model(LOCO) structure code, the ConSim evaluation code and the code for generating the HourHDVC. Also we have prepare the evaluation dataset of our HourHDVC.


1. Our experimental setup is as follows.

Environment: Linux,  GCC>=5.4, CUDA >= 9.2, Python>=3.7, PyTorch>=1.5.1


2. The detail of contents is below:

model/loco.py - (learnable)memory_token_video, (learnable)mem_token_speech, -> Memory Token for Inter-window memory.
Data_Generation.py -> Automatic request to GPT4o for Data Generation with LLM(Segmentation and Summarization).
LLM_Sim_eval.py -> Evaluation with LLM-Sim. We used 4 A6000 gpus to run Llama3-70B for metric measuring.
main.py -> Train and Evaluate.

3. Datasets (data/HourHDVC_eval.json)

We train and evalauate LOCO on our HourHDVC dataset. 
The dataset was developed by leveraging data from the [MAD and AutoAD research teams](https://github.com/Soldelli/MAD?tab=readme-ov-file).
For research purposes, we obtained and employed movie video features provided by the MAD team. Annotations are newly generated by our pipeline.


4. Metric

In the paper, we used LLM-Sim metric and traditional N-gram based metrics. ConSim is given in the ConSim_eval.py. For N-gram based metrics, we follow the most previous methods to use the [evaluation toolkit in ActivityNet Challenge 2018](https://github.com/ranjaykrishna/densevid_eval/tree/deba7d7e83012b218a4df888f6c971e21cfeea33). 