Abstract: Crossmodal grounding is a key technical challenge when generating relevant and well-timed gestures from spoken language. Often, the same gesture can accompany semantically different spoken language phrases which makes crossmodal grounding especially challenging. For example, a gesture (semi-circular with both hands) could co-occur with semantically different phrases ”entire bottom row” (referring to a physical point) and ”molecules expand and decay” (referring to a scientific phenomena). In this paper, we introduce a self-supervised approach to learn representations better suited to such many-to-one grounding relationships between spoken language and gestures. As part of this approach, we propose a new contrastive loss function, Crossmodal Cluster NCE, that guides the model to learn spoken language representations which are consistent with the similarities in the gesture space. This gesture-aware space can help us generate more relevant gestures given language as input. We demonstrate the effectiveness of our approach on a publicly available dataset through quantitative and qualitative evaluations. Our proposed methodology significantly outperforms prior approaches for gestures-language grounding. Link to code: https://github.com/dondongwon/CC_NCE_GENEA.
0 Replies
Loading