Boosting Handwritten Mathematical Expression Recognition Through Contextual Reasoning with Vision Large Language Models (vLLMs)
Abstract: Handwritten Mathematical Expression Recognition (HMER) has traditionally relied on dedicated models specifically designed for this task. With the emergence of Vision Large Language Models (vLLMs), we investigate their potential for HMER tasks. This paper presents a comprehensive benchmark study evaluating various state-of-the-art vLLMs on the CROHME dataset. We develop an experimental pipeline that includes language model inference, cleaning and conversion to structural graphs, and evaluation using graph-based metrics. Our experiments explore two prompting strategies: without context (image-to-text only) and with context (image-to-text with surrounding contextual information). Results show that while vLLMs demonstrate promising capabilities, dedicated models still outperform general-purpose vLLMs on HMER tasks. We analyze the performance gaps and discuss future directions for improving vLLM-based mathematical expression recognition.
External IDs:dblp:conf/icdar/ZouaidXM25
Loading