Abstract: Crossmodal grounding is a key challenge for the task of generating relevant and well-timed gestures from just spoken language as an input. Often, the same gesture can be accompanied by semantically different spoken language phrases which makes crossmodal grounding especially challenging. For example, a deictic gesture of spanning a region could co-occur with semantically different phrases "entire bottom row" (referring to a physical point) and "molecules expand and decay" (referring to a scientific phenomena). In this paper, we introduce a self-supervised approach to learn such many-to-one grounding relationships between spoken language and gestures. As part of this approach, we propose a new contrastive loss function, Crossmodal Cluster NCE , that guides the model to learn spoken language representations which are consistent with the similarities in the gesture space. By doing so, we impose a greater level of grounding between spoken language and gestures in the model. We demonstrate the effectiveness of our approach on a publicly available dataset through quantitative and qualitative studies. Our proposed methodology significantly outperforms prior approaches for grounding gestures to language. Link to code: https://github.com/dondongwon/CC_NCE_GENEA.