RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog

Published: 01 Jan 2023, Last Modified: 14 May 2025PRCV (1) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recently, benefiting from the powerful representation ability learned from large-scale image-text pre-training, pre-trained vision-language models show significant improvements in visual dialog task. However, these works suffer from two main challenges: 1) how to incorporate the sequential nature of multi-turn dialog systems for better capturing temporal dependencies of visual dialog; 2) how to align the semantics among different modal-specific features for better multi-modal interactions and understandings. To address the above issues, we propose a recurrent multi-modal transformer (named RecFormer) to capture temporal dependencies between utterances via encoding dialog utterances and interacting with visual information turn by turn. Specifically, we equip a pre-trained transformer with a recurrent function that maintains cross-modal history encoding for the dialog agent. Thus, the dialog agent can make better predictions by considering temporal dependencies. Besides, we also propose history-aware contrastive learning as an auxiliary task to align visual features and dialog history features for improving visual dialog understanding. The experimental results demonstrate that our RecFormer can achieve new state-of-the-art performances on both VisDial v0.9 (72.52 MRR score and 60.47 R@1 on val split) and VisDial v1.0 (69.29 MRR score and 55.90 R@1 on test-std split) datasets.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview