Abstract: Image-text matching aims to bridge the semantic gap between visual and textual modalities, which is a fundamental task for multimodal learning applications. However, most Transformer-based multimodal retrieval model architectures ignore the capture and learning of textual semantic structures. To address this issue, we propose a novel architecture named Syntactic Dependency-Oriented Vision and Language Transformer (SDO-ViLT). Firstly, we introduce a syntax de-pendency parsing module that leverages directional syntax dependency graphs. Based on the syntax dependency graph with dependency directions, syntactic dependency features are obtained through Graph Convolutional Networks (GCNs), enhancing the modeling capability of structural semantic relations. It enforces learning predicate-centric structured de-pendency semantics, thereby addressing ambiguity issues. Secondly, we propose a dependency distance-aware strategy. To address the limitations imposed by physical distance in text processing to expedite retrieval efficiency, we construct a dependency distance-aware attention mechanism that performs dependency tree pruning based on the dependency matrix. When computing the semantic matrix, we directly weight the core semantics represented by short dependency distances, enhancing the model's semantic understanding capability. The strategy retains the original accuracy while reducing the time required for text-image retrieval. A series of experiments are conducted to verify the effectiveness of the proposed method. The recall rate (R@ K) is improved on average by 1.7%, and up to 2.2% in the best case.
Loading