Abstract: Movie story understanding necessitates modeling of, and reasoning about characters and their relationships with the surroundings and others as the story goes. In Movie QA, this poses the challenges of effectively capturing the visual moments relevant to questions in long videos, and efficiently navigating the web of dynamic, contextual character-centric relationships over time. This paper presents a novel character-centric method that efficiently supports reasoning about relational dynamics for Movie QA. Central to the method is a T ime- E volving C onditional C H aracter-centric graph network ( TECH ) which models the characters, objects, and their question-conditioned relationships in space-time. TECH first maps the raw video data into a question-focused temporal neural graph over visual entities within and across shots and then distills the graph into a character-centric network which gives rise to the answer. At the core of this graph reasoning machine, TECH uses a two-stage feature refinement process for feature movie characters and their relationships, using their interactions with the surroundings as contextual information. TECH draws its efficiency over long videos from a “skim and scan” technique to rapidly localize the most query-relevant moments in the movie. Tested on the three large-scale datasets, TECH clearly shows advantages over recent state-of-the-art models.
Loading