Abstract: In this paper we present a text-conditioned video resampler
(TCR) module that uses a pre-trained and frozen visual encoder and large
language model (LLM) to process long video sequences for a task. TCR
localises relevant visual features from the video given a text condition and
provides them to a LLM to generate a text response. Due to its lightweight
design and use of cross-attention, TCR can process more than 100 frames
at a time with plain attention and without optimised implementations.
We make the following contributions: (i) we design a transformer-based
sampling architecture that can process long videos conditioned on a task,
together with a training method that enables it to bridge pre-trained
visual and language models; (ii) we identify tasks that could benefit from
longer video perception; and (iii) we empirically validate its efficacy on a
wide variety of evaluation tasks including NextQA, EgoSchema, and the
EGO4D-LTA challenge.
Loading