Continuous Perception Benchmark

Zeyu Wang; Zhenzhen Weng; Serena Yeung-Levy

Continuous Perception Benchmark

Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy

28 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video understanding, Continuous perception

Abstract: Humans continuously perceive and process visual signals. However, current video models typically either sample key frames sparsely or divide videos into chunks and densely sample within each chunk. This approach stems from the fact that most existing video benchmarks can be addressed by analyzing key frames or aggregating information from separate chunks. We anticipate that the next generation of vision models will emulate human perception by processing visual input continuously and holistically. To facilitate the development of such models, we propose the Continuous Perception Benchmark, a video question answering task that cannot be solved by focusing solely on a few frames or by captioning small chunks and then summarizing using language models. Extensive experiments demonstrate that existing vision models, whether commercial or open-source, struggle with these tasks, indicating the need for new technical advancements in this direction.

Supplementary Material: pdf

Submission Number: 1281

Loading