DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024

Kwanghyeon Lee; Mina Kang; Hyungho Na; HeeSun Bae; Byeonghu Na; Doyun Kwon; Seungjae Shin; Yeongmin Kim; Kim taewoo; Seungmin Yun; Il-chul Moon

DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024

Kwanghyeon Lee, Mina Kang, Hyungho Na, HeeSun Bae, Byeonghu Na, Doyun Kwon, Seungjae Shin, Yeongmin Kim, Kim taewoo, Seungmin Yun, Il-chul Moon

Published: 10 Aug 2024, Last Modified: 05 Sept 2024MFM-EAI@ICML2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal Foundation Models, Embodied AI, Fine-tuning

TL;DR: This paper details the method of fine-tuning MLLMs with DPO and RAG for the EgoPlan Challenge.

Abstract: This paper presents technical details for solving a multi-modal task, EgoPlan-Bench. Our model adopts Direct Preference Optimization (DPO), which is originally developed for a single-modal task, to be utilized in a multi-modal setting. This DPO adaptation improves prediction accuracy by highlighting positive answers over negative choices. Additionally, we apply Retrieval-Augmented Generation (RAG) to further enhance generation performance in Multi-modal Large Language Models (MLLMs). However, in our settings, the RAG method does not result in a performance improvement due to the limited retrieval of similar tasks. Our model utilizing DPO shows performance improvements and achieves 53.98% test accuracy compared to baseline methods of 41.35%. Our code is available at https://github.com/aailabkaist/EgoPlan_Challenge_Team_AAILab.

Submission Number: 28

Loading