DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024
Keywords: Multi-modal Foundation Models, Embodied AI, Fine-tuning
TL;DR: This paper details the method of fine-tuning MLLMs with DPO and RAG for the EgoPlan Challenge.
Abstract: This paper presents technical details for solving a multi-modal task, EgoPlan-Bench. Our model adopts Direct Preference Optimization (DPO), which is originally developed for a single-modal task, to be utilized in a multi-modal setting. This DPO adaptation improves prediction accuracy by highlighting positive answers over negative choices. Additionally, we apply Retrieval-Augmented Generation (RAG) to further enhance generation performance in Multi-modal Large Language Models (MLLMs). However, in our settings, the RAG method does not result in a performance improvement due to the limited retrieval of similar tasks. Our model utilizing DPO shows performance improvements and achieves 53.98% test accuracy compared to baseline methods of 41.35%. Our code is available at https://github.com/aailabkaist/EgoPlan_Challenge_Team_AAILab.
Submission Number: 28
Loading