Abstract: In this paper, we investigate the integration of transformer-based feature extractors in a Remote Sensing Visual Question Answering (RSVQA) framework. Our findings demonstrate an improvement to the baseline achieved through additional attention modules after feature extraction and using MUTAN (Multimodal Tucker Fusion). Further, we delve into the potential of multi-task learning, observing a considerable boost in performance when feature extractors are trained. Our results suggest a promising future research avenue in multitask learning for RSVQA, while also emphasizing the need for careful selection of hyperparameters per question type as well as finding the proper balance for training the shared backbone and individual classifiers simultaneously to further improve performance.
Loading