Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor

Jiali Chen; Xusen Hei; Yuqi Xue; Yuancheng Wei; Jiayuan Xie; Yi Cai; Qing Li

Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor

Jiali Chen, Xusen Hei, Yuqi Xue, Yuancheng Wei, Jiayuan Xie, Yi Cai, Qing Li

Published: 20 Jul 2024, Last Modified: 04 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large multimodal models (LMMs) have shown remarkable performance in the visual commonsense reasoning (VCR) task, which aims to answer a multiple-choice question based on visual commonsense within an image. However, the ability of LMMs to correct potential visual commonsense errors in the distractor upon their occurrence is yet under-explored. Drawing inspiration from how a human teacher crafts challenging distractors to test students' comprehension of the concepts or skills and assists them in identifying and correcting errors toward the answer, we are the pioneering research for LMMs to simulate this error correction learning process. To this end, we employ GPT-4 as a ``teacher'' to collect the explainable feedback dataset VCR-DF for error correction, which serves as a benchmark to evaluate the ability of LMMs to identify misconceptions and clarify reasons behind the error in VCR distractors toward final answers. In addition, we propose an LMM-based Pedagogical Expert Instructed Feedback Generation (PEIFG) model to incorporate the learnable expert prompts and multimodal instruction as guidance for feedback generation. Experimental results show that our PEIFG significantly outperforms existing LMMs. We believe our benchmark carves out a new direction for evaluating the capabilities of LMMs.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Experience] Multimedia Applications, [Engagement] Summarization, Analytics, and Storytelling, [Content] Vision and Language

Relevance To Conference: This work contributes to multimedia/multimodal processing by concerning a critical aspect in the visual commonsense reasoning (VCR) task, specifically the ability of large multimodal models (LMMs) to correct visual commonsense errors. By introducing an error-correction dataset, VCR-DF, and a Pedagogical Expert Instructed Feedback Generation (PEIFG) model, this research pioneers in simulating an error correction learning process akin to a human teacher's method of providing feedback. This approach enhances LMMs' capability to generate explainable feedback for error correction. The introduction of a benchmark for evaluating error correction in LMMs, and the PEIFG model marks an advancement in multimedia processing. We believe our benchmark carves out a new direction for assessing the capabilities of LMMs.

Supplementary Material: zip

Submission Number: 4892

Loading