Keywords: multimodal, dialogue, guesswhat, ensemble
TL;DR: Ensembling and better features boost referential guessing game performance of a lightweight model to match a multimodal Transformer-based model.
Abstract: Representing the candidates in referential guessing tasks is understudied, compared to representing the multimodal input.
We investigate how to improve candidate representations in a grounded dialogue guessing game, GuessWhat?!.
We find improvements in guessing accuracy by using richer combinations of complementary representations.
Furthermore, using ensembles of models leads to large accuracy gains as well as enabling uncertainty analyses.
Finally, we show that an ensemble of lightweight encoders paired with rich
representations of the candidates can match the performance of
a model based on a state-of-the-art universal multimodal encoder.
0 Replies
Loading