

# <img src="assets/flownar_icon.png" height="30"> <span style="font-variant:small-caps;">FlowNar</span>: Scalable Streaming Narration for Long-Form Videos

## Demo Videos
[![Click to Play](assets/demo_video.png)](https://drive.google.com/file/d/1aT5tbPAp4gqZyxFVCH-6T4PUi1lWX0PW/view?usp=drive_link)


## Introduction
Our paper presents a scalable approach for streaming video narration, introducing several interesting features compared to other popular multimodal models:

- **Dynamic context management (DCM).** DCM prunes the visual KV cache after each narration segment. 
This provides a dual benefit: it ensures the LLM’s active visual context does not grow unboundedly with past frame features, 
and it reduces error propagation that can arise from conditioning on potentially misaligned history.

- **Cross linear attentive memory (CLAM).** CLAM iteratively compresses visual information from processed segments into a fixed-size set of memory tokens. 
These tokens serve as the condensed summary of past visual frames passed to subsequent segments, offering constant O(1) memory usage and per-step computational complexity for historical visual frames.
Additionally, we provide an [efficient CUDA implementation](models/clustering/scan.py) to ensure high parallelizability during training.

- **Realistic autoregressive narration generation.** We provide an autoregressive pipiline, 
where the model generates predictions conditioned on its own previously generated narrations.

## Installation

Ensure you have Conda and Python version >= 3.10 installed, then run:
```sh
conda env create -f environment.yaml
```

PyTorch source will make ffmpeg installed, but it is an old version and usually make very low quality preprocessing. Please install newest ffmpeg following:
```sh
wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar xvf ffmpeg-release-amd64-static.tar.xz
rm ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-7.0.1-amd64-static ffmpeg
```


## Preprocessing

### Visual feature extraction

Distributed preprocess video frames: 2 FPS and 384 resolution, then using ```google/siglip-large-patch16-384``` to extract CLS with avg pooled  3x3 spatial tokens. Please refer to instructions under [data/preprocess/](data/preprocess/).

**Note**: This step can be skipped if you only want to understand the code without running actual experiments (see [Quick debug](#quick-debug)).

### Narration refinement
Please refer to instructions under [data/preprocess/](data/preprocess/). 
For EgoExo4D and EpicKitchens100, we use similar code as for Ego4D, but use GPT-4o for a second-round refinement. 

**Note**: This step can be skipped, as we provide our refined narrations for all three datasets [here](https://drive.google.com/file/d/1VXU2lYI-50B0g6X74xcwyGsrCxubDS86/view?usp=drive_link).

## Training and Evaluation

### Quick debug
To facilitate understanding of the codebase without running real experiments, we provide a `local_debug` flag.
If `local_debug` is set to True, video features will be randomly initialized vectors, and 
the visual feature extraction step can therefore be skipped.

### Oracle protocol
Please refer to scripts under [scripts/*/narration](scripts/ek100/narration). Key engine is [trainer_with_gen2eval.py](engine/trainer_with_gen2eval.py).
Please change the root path in [each dataset](data/ego4d/ego4d.py).

### Autoregressive protocol
Please refer to scripts under [scripts/*/stream_generate](scripts/ek100/stream_generate). Key engine is [trainer_stream_generate.py](engine/trainer_stream_generate.py).
Please change the root path in [each dataset](data/ego4d/segsummary.py) and the [evaluation file](evaluate_metrics.py).


## Model Zoo
We will release our model zoo after the review process. We have two versions of models listed below.
### <span style="font-variant:small-caps;">FlowNar</span>-1B
* LLM: meta-llama/Llama-3.2-1B-Instruct
* Vision Strategy:
    * Frame Encoder: google/siglip-large-patch16-384
    * Frame Tokens: CLS token + 3x3 average pooled spatial tokens
    * Frame FPS: 2 for training, >25 for inference on a H100 GPU
    * Frame Resolution: max resolution 384, with zero-padding to keep aspect ratio

### <span style="font-variant:small-caps;">FlowNar</span>-8B
* LLM: meta-llama/Meta-Llama-3-8B-Instruct
* Vision Strategy:
    * Frame Encoder: google/siglip-large-patch16-384
    * Frame Tokens: CLS token + 3x3 average pooled spatial tokens
    * Frame FPS: 2 for training, >10 for inference on a H100 GPU
    * Frame Resolution: max resolution 384, with zero-padding to keep aspect ratio

## Anonymous Submission
This repository contains the code associated with our anonymous submission.

For evaluation only. Please do not share or cite.

Acknowledgments and related references will be added after the review process.
