Abstract: While encoder-decoder architectures have significantly advanced Handwritten Mathematical Expression Recognition (HMER), current language model (LM)-based correction methods face inherent limitations. Although LM can mitigate contextual errors by leveraging textual semantics, their effectiveness is ultimately hindered by the semantic-poverty problem: mathematical expressions convey complex structural relationships that go beyond linguistic representation alone. In this paper, we argue that visual features, such as the spatial relationships between symbols, should be given equal importance alongside linguistic information when correcting HMER system outputs. We introduce the Position-Aware Cross-Modality correction framework, which creates a visual-linguistic equivalence for error rectification. Specifically, we propose a Dual-Stream Fusion Module to integrate symbol-level visual features with linguistic context in parallel, enabling joint visual-linguistic reasoning to address ambiguities caused by segmentation errors and symbol misinterpretations. Additionally, we introduce the Position-Aware Module, which applies dynamic spatial attention driven by symbol density heatmaps, allowing for adaptive focus on both global layout and local symbol interactions. Experiments on two widely used benchmark datasets validate the superiority of our method, achieving state-of-the-art performance of 62.32% / 61.58 % / 63.38% on CROHME 2014 / 2016 / 2019 and 70.81% on HME100K.
External IDs:dblp:conf/icdar/LiWSMWZ25
Loading