From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond

Hao Fei; Xiangtai Li; Haotian Liu; Fuxiao Liu; Zhuosheng Zhang; Hanwang Zhang; Shuicheng Yan

From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond

Hao Fei, Xiangtai Li, Haotian Liu, Fuxiao Liu, Zhuosheng Zhang, Hanwang Zhang, Shuicheng Yan

Published: 01 Jan 2024, Last Modified: 13 Dec 2024ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Artificial intelligence (AI) encompasses knowledge acquisition and real-world grounding across various modalities, including language, visual, auditory, and sensory data. Multimodal large language models (MLLMs) have thus recently garnered growing interest in both academia and industry, showing an unprecedented trend to achieve human-level AI. This tutorial aims to deliver a comprehensive review of cutting-edge research in MLLMs, focusing on three key areas: MLLM architecture design, instructional learning, and multimodal reasoning of MLLMs. We will explore technical advancements, synthesize key challenges, and discuss potential avenues for future research. All the resources and materials will be made available online. https://mllm2024.github.io/ACM-MM2024

Loading