Crossmodal clustered contrastive learning: Grounding of spoken language to gesture

Dong Won Lee; Chaitanya Ahuja; Louis-Philippe Morency

Crossmodal clustered contrastive learning: Grounding of spoken language to gesture

Dong Won Lee, Chaitanya Ahuja, Louis-Philippe Morency

Published: 19 Jul 2021, Last Modified: 05 May 2023GENEA Workshop 2021 OralReaders: Everyone

Abstract: Crossmodal grounding is a key challenge for the task of generating relevant and well-timed gestures from just spoken language as an input. Often, the same gesture can be accompanied by semantically different spoken language phrases which makes crossmodal grounding especially challenging. For example, a deictic gesture of spanning a region could co-occur with semantically different phrases "entire bottom row" (referring to a physical point) and "molecules expand and decay" (referring to a scientific phenomena). In this paper, we introduce a self-supervised approach to learn such many-to-one grounding relationships between spoken language and gestures. As part of this approach, we propose a new contrastive loss function, Crossmodal Cluster NCE , that guides the model to learn spoken language representations which are consistent with the similarities in the gesture space. By doing so, we impose a greater level of grounding between spoken language and gestures in the model. We demonstrate the effectiveness of our approach on a publicly available dataset through quantitative and qualitative studies. Our proposed methodology significantly outperforms prior approaches for grounding gestures to language. Link to code: https://github.com/dondongwon/CC_NCE_GENEA.

3 Replies

Loading