Abstract: We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=zKv8qULV6n&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: Dear Editor and Reviewers,
We sincerely appreciate your time and consideration. We have updated the camera-ready version and would be grateful if you could kindly review it to ensure it aligns with the template.
In conclusion, during the revision phase, we have addressed the issues raised during the rebuttal and added two subsections in Appendix Section C to discuss “Dataset Statistics” and “Training Resource Details.” Additionally, all supplementary experiments and ablation studies have been incorporated into Section E, covering the aspects of "Transferability Evaluations," "Scaling Effects on Grids/Tokens" and "Different Training Strategy Design". Furthermore, we included insights in a dedicated "Discussion" subsection.
Supplementary Material: zip
Assigned Action Editor: ~Jianbo_Jiao2
Submission Number: 3432
Loading