Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Jiarui Zhang; Ollie Liu; Tianyu Yu; Jinyi Hu; Willie Neiswanger

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Jiarui Zhang, Ollie Liu, Tianyu Yu, Jinyi Hu, Willie Neiswanger

Published: 04 Mar 2025, Last Modified: 17 Apr 2025ICLR 2025 Workshop SynthDataEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal LLMs, Geometric Perception, Low-level Visual Perception

TL;DR: We study low-level geometric understanding in multimodal LLMs by (1) releasing a benchmark (Geoperception), (2) conducting an empirical study on the MLLM design space, and (3) training a model (Euclid) with strong geometric understanding abilities.

Abstract: Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP)---particularly the ability to accurately describe the geometric details of an image. In this paper, we first demonstrate this limitation by introducing Geoperception, a benchmark designed to evaluate an MLLM’s ability to accurately transcribe 2D geometric information from an image. We then conduct a comprehensive empirical study to explore strategies for improving LLVP performance through the use of synthetic high-fidelity visual description data. Our findings highlight the benefits of certain model architectures and training techniques, including the use of CNN-based visual encoders and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Lastly, we develop \emph{Euclid}, a family of models specifically optimized for strong low-level geometric perception. Although trained on synthetic multimodal data, Euclid shows strong generalization ability on novel real-world geometry shapes. For instance, Euclid outperforms the best closed-source model in our benchmark by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.

Submission Number: 64

Loading