Abstract: In recent years, the dominant paradigm for text spotting is
to combine the tasks of text detection and recognition into a single endto-end framework. Under this paradigm, both tasks are accomplished
by operating over a shared global feature map extracted from the input
image. Among the main challenges that end-to-end approaches face is
the performance degradation when recognizing text across scale variations (smaller or larger text), and arbitrary word rotation angles. In this
work, we address these challenges by proposing a novel global-to-local
attention mechanism for text spotting, termed GLASS, that fuses together global and local features. The global features are extracted from
the shared backbone, preserving contextual information from the entire
image, while the local features are computed individually on resized, high
resolution rotated word crops. The information extracted from the local
crops alleviates much of the inherent difficulties with scale and word
rotation. We show a performance analysis across scales and angles, highlighting improvement over scale and angle extremities. In addition, we
introduce an orientation-aware loss term supervising the detection task,
and show its contribution to both detection and recognition performance
across all angles. Finally, we show that GLASS is general by incorporating it into other leading text spotting architectures, improving their text
spotting performance. Our method achieves state-of-the-art results on
multiple benchmarks, including the newly released TextOCR.
0 Replies
Loading