Abstract: In this report, we describe SPELL, a novel spatial-temporal graph learning framework for active speaker detection (ASD). First, each person in a video frame is encoded in a unique node for that frame. The nodes corresponding to each person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes with low computation cost.
0 Replies
Loading