Abstract: The advancement of Large Language Models~(LLMs) has brought substantial attention to the Chain of Thought~(CoT) approach, primarily due to its ability to enhance the capability of LLMs on tasks requiring complex reasoning. Moreover, the significance of CoT approaches extends to the application of LLMs for multi-modal tasks. However, the selection of optimal CoT demonstration examples in multi-modal reasoning for LLMs remains less explored for LLMs due to the inherent complexity of multi-modal examples. In this paper, we introduce a novel approach that addresses this challenge by using retrieval mechanisms to dynamically and automatically select demonstration examples based on cross-modal and intra-modal similarities. Furthermore, we employ a stratified sampling method categorising demonstration examples into groups based on their types and retrieving examples from different groups respectively to promote the diversity of demonstration examples. Through a series of experiments on two popular benchmark datasets - ScienceQA and MathVista, we demonstrate that our approach significantly improves the performance of LLMs by more than 2.5\%, achieving state-of-the-art results in multi-modal reasoning tasks.
Paper Type: long
Research Area: Question Answering
Contribution Types: NLP engineering experiment
Languages Studied: English
0 Replies
Loading