Two-Stage LVLM system: 1st Place Solution for ECCV 2024 Corner Case Scene Understanding Challenge

Published: 07 Sept 2024, Last Modified: 15 Sept 2024ECCV 2024 W-CODA Workshop Abstract Paper TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models, Autonomous Driving, Scene Understanding
Subject: Large Language Models techniques adaptable for self-driving systems
Confirmation: I have read and agree with the submission policies of ECCV 2024 and the W-CODA Workshop on behalf of myself and my co-authors.
Abstract: This technical report outlines the methods we employed for Track 1 Corner Case Scene Understanding of ECCV 2024 Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving Challenge. The challenge is to generate general perception, region perception description, and driving suggestions for the corner case driving scene. We propose a two-stage method consisting of preliminary output and refinement. We first fine-tune the LLaVA-Next model with LoRA to get a coarse output, then utilize GPT-4 to refine the result. This system combines the learning ability of LLaVA-Next and the strong reasoning ability of GPT-4. As a result, our system achieved the top 1 score of 72.12 on the final leaderboard. The code and checkpoints are released on https://github.com/Chloe-gra/ECCV2024_Challenge_llmforad_solution.
Submission Number: 4
Loading