Advances in Large Multi-Modal Models from the Perspective of Representation Space Extension: A Survey

ACL ARR 2024 December Submission9 Authors

02 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The success of large language models (LLMs) has attracted much focus on extending these models to multi-modal domains, giving rise to large multi-modal models (LMMs). Unlike existing reviews that focuses on specific model frameworks or scenarios, this paper aims to provides an encyclopedic survey on LMMs from a general perspective, i.e. representation space extension. By systematically analyzing the input-output representations of existing LMMs, this paper summarizes the design of model architectures to align the constructed multi-modal representation space. Lastly, this paper demonstrates the extensibility of LMMs as embodied agents in view of proposed representation space extension. With the insights revealed through surveying the field, this paper discusses several fundamental problems of constructing LMMs and inspires future work at the end.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Large Multimodal Models, Representation Space Extension
Contribution Types: Surveys
Languages Studied: English
Submission Number: 9
Loading