The Third Edition of Large Vision – Language Model Learning and Applications Grand Challenge (LAVA Challenge)
Keywords: vision and language, large vision - language models, question - answering task, document understanding
TL;DR: How can LVLMs understand multi-page documentations and slides?
Abstract: Recent advances in Large Vision-Language Models (LVLMs) hold immense promise across various domains, including healthcare, education, entertainment, transportation, and finance, by enabling more sophisticated and context-aware multimedia interactions.
Indeed, the outcomes of our previous challenges, held in conjunction with the Asian Conference on Computer Vision (ACCV) 2024 in Hanoi, Vietnam, and the ACM International Conference on Multimedia (ACMMM) 2025 in Dublin, Ireland (https://lava-workshop.github.io), underscored the limitations of LVLMs in processing such documentations and presentations.
To enhance the capability of LVLMs to accurately interpret and generate descriptive text from complex visual inputs within multi-page business-related documents and slides, we continue to organize the Large Vision Language Model Learning and Applications (LAVA) Challenge, building upon the success of previous years.
LAVA Challenge focuses on question-answering tasks, including both multiple-choice and open-ended questions, based on multi-page documentations and slides, including diverse data representations such as Graphs, Charts, Tables, Diagrams, Data Flow Diagrams (DFDs), Class Diagrams, Gantt Charts, and Building Design Drawings.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 14
Loading