MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?

ACL ARR 2026 January Submission5648 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal; Benchmarking; Multilingual corpora; Multimodal Automated Theorem Proving
Abstract: Theorem proving in fields such as geometry often relies on visual reasoning with combined text and diagrams. While Multimodal Large Language Models (MLLMs) have shown potential in mathematics, their application in multimodal automated theorem proving remains a largely unexplored area. In this paper, we introduce the Multimodal Automated Theorem Proving benchmark (MATP-BENCH), a novel multi-modal, multi-level, and multi-language benchmark designed to evaluate MLLMs in this role as multimodal automated theorem provers. MATP-BENCH consists of 1,056 multimodal theorems drawn from high school, university, and competition-level mathematics. All these multimodal problems are accompanied by formalizations in Lean 4, Coq and Isabelle, making the benchmark compatible with a wide range of theorem-proving frameworks. Grounding our analysis in a Structural Causal Model, we identify the distinct challenge of MATP: it requires not only direct mapping of explicit inputs for theorem formalization but, more critically, latent causal planning to discover unobserved auxiliary constructions. Our evaluation reveals that while advanced MLLMs show promise in formalization, they struggle significantly with synthesizing these latent auxiliary constructions, often generating ineffective auxiliary steps or ignoring visual constraints.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Multimodal; benchmarking; multilingual corpora; mathematical NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: English
Submission Number: 5648
Loading