LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Shilong Liu; Hao Cheng; Haotian Liu; Hao Zhang; Feng Li; Tianhe Ren; Xueyan Zou; Jianwei Yang; Hang Su; Jun Zhu; Lei Zhang; Jianfeng Gao; Chunyuan Li

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Large Language Model, Large Multi-modal Model, Large Agent

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We introduce LLaVA-Plus, an end-to-end training approach to system-atically expanding the capabilities of large multimodal models (LMM).

Abstract: In this paper, we introduce LLaVA-Plus, an end-to-end training approach to systematically expanding the capabilities of large multimodal models (LMM), towards building general-purpose multimodal agents. It maintains a skill repository that contains a wide range of vision and vision-language pre-trained models as multimodal tools. Based on the user instruction and input image, LMM is trained to activate the appropriated tools when needed, grasping skills on the fly and aggregating the tool execution results to complete the real-world tasks in the wild. To facilitate the model capability on learning to use skills, we make the first attempt to build multimodal instruction-following data for tool use, covering skills in visual understanding, generation, external knowledge and their compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities, and extends many new capabilities. Compared with large language model (LLM) based tool use methods, LLaVA-Plus is distinct in that the query image is considered throughout the entire interaction process, yielding higher multimodal tool use performance and enabling new scenarios.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5663

Loading