LLaVA-OneVision: Easy Visual Task Transfer

Bo Li; Yuanhan Zhang; Dong Guo; Renrui Zhang; Feng Li; Hao Zhang; Kaichen Zhang; Peiyuan Zhang; Yanwei Li; Ziwei Liu; Chunyuan Li

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

Published: 10 Feb 2025, Last Modified: 10 Feb 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Submission Length: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=zKv8qULV6n&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)

Changes Since Last Submission: Dear Editor and Reviewers, We sincerely appreciate your time and consideration. We have updated the camera-ready version and would be grateful if you could kindly review it to ensure it aligns with the template. In conclusion, during the revision phase, we have addressed the issues raised during the rebuttal and added two subsections in Appendix Section C to discuss “Dataset Statistics” and “Training Resource Details.” Additionally, all supplementary experiments and ablation studies have been incorporated into Section E, covering the aspects of "Transferability Evaluations," "Scaling Effects on Grids/Tokens" and "Different Training Strategy Design". Furthermore, we included insights in a dedicated "Discussion" subsection.

Supplementary Material: zip

Assigned Action Editor: ~Jianbo_Jiao2

Submission Number: 3432

Loading