Abstract: Text-based referring expression comprehension requires reading and understanding scene text in an image to locate a specific object described by a natural language expression. Existing methods predominantly focus on the literal interpretation of scene text. Such methods often fall short when the scene text has a tenuous connection to the objects in the referring expressions. To address this limitation, we introduce a novel approach that leverages the implicit contextual knowledge underlying scene text. More specifically, we construct a comprehensive knowledge base utilizing the Amazon Review Data Dataset. This knowledge base serves as the foundation for our proposed knowledge-enhanced scene text encoder. This encoder stands out for its proficiency in integrating common knowledge to extract features more effectively. A key advantage of our method is its compatibility and ease of integration with existing text-based referring expression comprehension frameworks, leading to enhanced performance outcomes. Furthermore, we improved the quality of expressions in the TextREC dataset, by re-annotating the referring expressions with richer semantic information. Experiments on two benchmark datasets, TextREC-v2 and RefText, show that our method outperforms the state-of-the-art by 3.3% and 1.0% in terms of precision@1 measure.
Loading