Deep Residual Weight-Sharing Attention Network With Low-Rank Attention for Visual Question Answering

Published: 01 Jan 2023, Last Modified: 11 Apr 2025IEEE Trans. Multim. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The attention-based networks have become prevailing recently in visual question answering (VQA) due to their high performances. However, the extensive memory consumption of attention-based models poses excessive-high demand for the implementation equipment, raising concerns about their future application scenarios. Therefore, designing an efficient and lightweight VQA model is central to expanding possible application areas. Our work presents a novel lightweight attention-based VQA model, namely residual weight-sharing attention network (RWSAN), consisting of residual weight-sharing attention (RWSA) layers cascaded in depth. Each RWSA layer models the textual representation with self residual weight-sharing attention (SRWSA) and captures question features and question-image interactions with self-guided residual weight-sharing attention (SGRWSA). Inside each RWSA layer, the proposed low-rank attention (LRA) units perform residual learning with learned connection patterns and shared parameters, and every stacked RWSA layer also uses the same parameters. Extensive ablation experiments with quantitative and qualitative analysis are conducted to illustrate the effectiveness and generality of RWSA. Experiments on VQA-v2, GQA, and CLEVR datasets show that the RWSAN achieves competitive performance with much fewer parameters over the state-of-the-art methods. We release our code at https://github.com/BrightQin/RWSAN .
Loading