Keywords: video understanding
Abstract: We present an approach to understand video from encoded bytes, e.g., mp4s. These compressed videos are 99\% smaller than the RGB pixel representations which are currently commonly used for video understanding. Encoded videos are able to compress the pixels by taking advantage of the redundant information across the frames using special encoding, such as key frames and motion residuals to handle this. However, standard video understanding models do not take advantage of this significant compression already available for each video, and instead either heavily subsample the frames or only work on short segments of the video. Here, we present an approach to understanding video from encoded bytes directly. We note that simply applying existing models, e.g., Transformers or State-Space models, to video byte sequences does not work, both due to difficulty in handling very long video byte sequences and easy overfitting. To address these challenges, we design a State-Space model with sequence parallelism to handle very long byte sequences, reaching 15 Million tokens in training, and essentially unlimited tokens in inference. We also propose a multilevel SSM activation fusion that reduces sequence length, which we find also benefits video understanding. We evaluate on common video understanding and natural extension to video + audio understanding tasks and demonstrate competitive performance, illustrating, for the first time, the feasibility of learning from compressed video byte representations.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2135
Loading