Multimodal large language models (MLLMs) have made rapid progress in recent years, yet continue to struggle with low-level visual perception (LLVP)---particularly the ability to accurately describe the geometric details of an image. In this paper, we first demonstrate this limitation by introducing Geoperception, a benchmark designed to evaluate an MLLM’s ability to accurately transcribe 2D geometric information from an image. We then conduct a comprehensive empirical study to explore strategies for improving LLVP performance through the use of synthetic high-fidelity visual description data. Our findings highlight the benefits of certain model architectures and training techniques, including the use of CNN-based visual encoders and multi-stage training with a data curriculum. Notably, we find that a data curriculum enables models to learn challenging geometry understanding tasks which they fail to learn from scratch. Lastly, we develop \emph{Euclid}, a family of models specifically optimized for strong low-level geometric perception. Although trained on synthetic multimodal data, Euclid shows strong generalization ability on novel real-world geometry shapes. For instance, Euclid outperforms the best closed-source model in our benchmark by up to 58.56% on certain Geoperception benchmark tasks and 10.65% on average across all tasks.
Keywords: Multimodal LLMs, Geometric Perception, Low-level Visual Perception
TL;DR: We study low-level geometric understanding in multimodal LLMs by (1) releasing a benchmark (Geoperception), (2) conducting an empirical study on the MLLM design space, and (3) training a model (Euclid) with strong geometric understanding abilities.
Abstract:
Submission Number: 64
Loading