Understanding Complexity in VideoQA via Visual Program Generation

Cristobal Eyzaguirre; Igor Vasiljevic; Achal Dave; Jiajun Wu; Rares Andrei Ambrus; Thomas Kollar; Juan Carlos Niebles; Pavel Tokmakov

Understanding Complexity in VideoQA via Visual Program Generation

Cristobal Eyzaguirre, Igor Vasiljevic, Achal Dave, Jiajun Wu, Rares Andrei Ambrus, Thomas Kollar, Juan Carlos Niebles, Pavel Tokmakov

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We propose a data-driven approach to analyzing query complexity in Video Question Answering (VideoQA). Previous efforts in benchmark design have relied on human expertise to design challenging questions, yet we experimentally show that humans struggle to predict which questions are difficult for machine learning models. Our automatic approach leverages recent advances in code generation for visual question answering, using the complexity of generated code as a proxy for question difficulty. We demonstrate that this measure correlates significantly better with model performance than human estimates. To operationalize this insight, we propose an algorithm for estimating question complexity from code. It identifies fine-grained primitives that correlate with the hardest questions for any given set of models, making it easy to scale to new approaches in the future. Finally, to further illustrate the utility of our method, we extend it to automatically generate complex questions, constructing a new benchmark that is 1.9 times harder than the popular NExT-QA.

Lay Summary: In this paper we tackle the problem of estimating how hard it is for video models to answer some questions about videos. This is important because not all questions are equally difficult, and we're interested in understanding the reasons for why models succeed at correctly answering some questions and fail at answering others. We also show that humans and VideoLLMs are bad at estimating question difficulty. Instead, we use existing models to generate computer code that represents how to answer each question, and then show that the complexity of the code is a good proxy to estimate how hard the question really is for models. Our most effective approach is to train a model (CodePlexity) that takes the code as input and uses it to estimate the question complexity. Finally, we show that the resulting model can be used to create a new benchmark for video QA: we generate questions and then filter out the easy ones.

Link To Code: https://github.com/ceyzaguirre4/codeplexity

Primary Area: Applications->Computer Vision

Keywords: videoqa, complexity, codegen

Submission Number: 7577

Loading