Abstract: We present \(\mathbb {MST}_\mathbb {MIXER}\) – a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short in two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in-the-wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, \(\mathbb {MST}_\mathbb {MIXER}\) first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities, further refining its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). \(\mathbb {MST}_\mathbb {MIXER}\) achieves new state-of-the-art results on five challenging benchmarks.
Loading