Advances in Large Multi-Modal Models from the Perspective of Representation Space Extension: A Survey
Abstract: The success of large language models (LLMs) has attracted much focus on extending these models to multi-modal domains, giving rise to large multi-modal models (LMMs). Unlike existing reviews that focuses on specific model frameworks or scenarios, this paper aims to provides an encyclopedic survey on LMMs from a general perspective, i.e. representation space extension. By systematically analyzing the input-output representations of existing LMMs, this paper summarizes the design of model architectures to align the constructed multi-modal representation space. Lastly, this paper demonstrates the extensibility of LMMs as embodied agents in view of proposed representation space extension. With the insights revealed through surveying the field, this paper discusses several fundamental problems of constructing LMMs and inspires future work at the end.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Large Multimodal Models, Representation Space Extension
Contribution Types: Surveys
Languages Studied: English
Submission Number: 9
Loading