Abstract: Recently, Detection Transformer has become a trendy paradigm in object detection by virtual of eliminating complicated post-processing procedures. Some previous literatures have already explored DETR in scene text detection. However, arbitrary-shaped texts in the wild vary greatly in scale, predicting control points of text instances directly might achieve sub-optimal training efficiency and performance. To solve this problem, this paper proposes Scalable Text Detection Transformer (SText-DETR), a concise DETR framework using scalable query and content prior to improve detection performance and boost training process. The whole pipeline is built upon the two-stage variant of Deformable-DETR. In particular, we present a Scalable Query Module in the decoder stage to modulate position query with text’s width and height, making each text instance more sensitive to its scale. Moreover, Content Prior is presented as auxiliary information to offer better prior and speed up the training process. We conduct extensive experiments on three curved text benchmarks Total-Text, CTW1500, and ICDAR19 ArT, respectively. Results show that our proposed SText-DETR surpasses most existing methods and achieves comparable performance to the state-of-art method.
Loading