Combining diverse sources to guess the right dogDownload PDF


04 Mar 2022 (modified: 05 May 2023)Submitted to NLP for ConvAIReaders: Everyone
Keywords: multimodal, dialogue, guesswhat, ensemble
TL;DR: Ensembling and better features boost referential guessing game performance of a lightweight model to match a multimodal Transformer-based model.
Abstract: Representing the candidates in referential guessing tasks is understudied, compared to representing the multimodal input. We investigate how to improve candidate representations in a grounded dialogue guessing game, GuessWhat?!. We find improvements in guessing accuracy by using richer combinations of complementary representations. Furthermore, using ensembles of models leads to large accuracy gains as well as enabling uncertainty analyses. Finally, we show that an ensemble of lightweight encoders paired with rich representations of the candidates can match the performance of a model based on a state-of-the-art universal multimodal encoder.
0 Replies
