Hierarchical Multi-label Learning for Incremental Multilingual Text Recognition

Published: 20 Jul 2024, Last Modified: 24 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multilingual text recognition (MLTR) is increasingly essential for facilitating cultural communication. However, existing methods often struggle with retaining previous language knowledge when learning new languages. A straightforward solution is performing incremental learning (IL) on MLTR tasks. However, it ignores the shared words and characters across incremental languages, which we first term as an incremental sharing problem. Motivated by this observation, we propose a HierArchical Multi-label learning framework for Multilingual tExt Recognition, termed HAMMER. An online knowledge analysis is designed to identify shared knowledge and provide corresponding multi-label language supervision. Specifically, only words and characters appearing simultaneously in multiple languages are considered shared knowledge. Additionally, to further capture language dependencies, we introduce a hierarchical language evaluation mechanism to predict language scores at word and character levels. These scores, supervised by the knowledge analysis, guide the specific recognizers to effectively utilize both old and new language knowledge, thereby mitigating catastrophic forgetting caused by imbalanced rehearsal sets. Extensive experiments conducted on benchmark datasets, MLT17 and MLT19, show that HAMMER exhibits remarkable results and outperforms other state-of-the-art approaches.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: Text recognition is closely related to multimedia as it involves extracting text from images, videos, and documents. In multimedia, text appears as part of various media types, and text recognition enables tasks like content retrieval, translation, and sentiment analysis. Integrating text recognition with multimedia processing allows for comprehensive analysis, enhancing the accessibility and searchability of multimedia content. Scene text recognition (STR) is to identify text information in natural scene images, thereby obtaining textual descriptions of the images, which can help computers understand textual content. Multilingual text recognition (MLTR) is a subfield of scene text recognition that can identify scene text in multiple languages, making it a more challenging task.
Supplementary Material: zip
Submission Number: 3520
Loading