What to do if language models disagree? Black-box model ensembling for textual and visual question answering
Abstract: A diverse range of large language models (LLMs), e.g., ChatGPT, and visual question answering (VQA) models, e.g., BLIP, has been developed for addressing text and visual question answering tasks. However, both LLMs and VQA models encounter challenges when applied to out-domain datasets. Fine-tuning these models for domain adaptation is either impossible (only accessible by APIs as black-box models) or computationally expensive (big model size), and often only limited labeled out-domain data is available. Under these constraints, ensemble techniques provide a compelling alternative. In this paper, we aim to improve out-domain model performance by utilizing the capabilities of existing black-box models with limited computational cost and labeled data. To address this challenge, we introduce a novel data-efficient ensemble method, InfoSel, which trains small-size (<120M parameters) ensemble models to select the best answers without relying on prediction confidences for both text and visual question answering tasks. Our results demonstrate that InfoSel improves the performance compared to the ensembled base models over four mini datasets sampled from SQuAD-V2, NQ-Open, GQA and VizWiz.
Paper Type: long
Research Area: Question Answering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading