Adapting Uni-Modal Language Models for Dense Multi-Modal Co-Reference Resolution using Parameter Augmentation

Samuel Osebe; Prashan Wanigasekara; Thanh Tran; Thomas Gueudre

Adapting Uni-Modal Language Models for Dense Multi-Modal Co-Reference Resolution using Parameter Augmentation

Samuel Osebe, Prashan Wanigasekara, Thanh Tran, Thomas Gueudre

Published: 11 Mar 2024, Last Modified: 15 Mar 2024LLMAgents @ ICLR 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-modality, co-referencing, parameter-augmentation

Abstract: The context of modern smart voice assistants are often multi-modal, where images, audio and video content are consumed by users simultaneously. In such a setup, co-reference resolution is especially challenging, and runs across modalities and dialogue turns. We explore the problem of multi-modal co-reference resolution in multi-turn dialogues and quantify the performance of multi-modal LLMs on a specially curated dataset of long, image-interleaved conversations between a voice assistant and a human for a shopping use case. We propose and evaluate a custom architecture for multi-modal embedding alignment using a novel parameter augmentation technique. Our proposed Parameter Augmented LLM approach shows a $4.9\%$ absolute F1 improvement above a baseline while reducing the number of parameters being trained by $13.3\%$ for a complex co-referencinging task on a multi-turn shopping dataset.

Submission Number: 104

Loading