Euclid: Lessons for Geometric Low-level Visual Perception in Multimodal LLMs

Jiarui Zhang; Ollie Liu; Yudi Lin; Tianyu Yu; Jinyi Hu; Willie Neiswanger

Euclid: Lessons for Geometric Low-level Visual Perception in Multimodal LLMs

Jiarui Zhang, Ollie Liu, Yudi Lin, Tianyu Yu, Jinyi Hu, Willie Neiswanger

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal LLMs, Geometric Perception, Low-level Visual Perception

TL;DR: We study low-level geometric visual perception in multimodal LLMs by (1) releasing a benchmark (Geoperception), (2) conducting an empirical study on the MLLM design space, and (3) training a model (Euclid) with strong geometric perception abilities.

Abstract: Multimodal large language models (MLLMs) have advanced rapidly in recent years, yet they continue to struggle with *low-level visual perception*—particularly, in accurately identifying and describing geometric relationships within images. In this paper, we first diagnose this shortcoming by introducing a dedicated benchmark, *Geoperception*, which focuses exclusively on evaluating geometric low-level perceptual capabilities that serve as essential prerequisites for higher-level visual reasoning. We then present a comprehensive empirical study that investigates strategies for improving model performance in this setting, making use of synthetic geometry data. Our findings highlight the benefits of certain model architectures and training techniques, including training with a data curriculum, the use of CNN-based visual encoders and the effect of LLM size and finetuning visual encoders. Of note, we find that adopting a data curriculum enables models to learn challenging geometric concepts that they fail to acquire from scratch. Finally, we explore how well we can endow multiple geometric low-level visual perception capabilities into one generalist MLLM. We demonstrate that training on such data with carefully chosen composition can significantly enhance a model's geometric visual perception ability *without compromising its general multimodal capabilities*, shedding light on the development of future generalist MLLMs that can excel simultaneously across multiple challenging domains.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14972

Loading