Abstract: Text samples captured in natural scenes often come from different scripts written in various directions, resulting in significant diversities in character set, orientation, and text length. Current methods handle such diversities by routing input images to corresponding (sub)modules with hardwired rules that are usually tied to image aspect-ratio ranges. However, setting hardwired rules requires extensive knowledge of the testing data and model-building experience, hence is often suboptimal. To resolve this, we propose an end-to-end trainable watch-and-act (WnA) framework, which first watches thumbnails to generate a routing plan and then routes the samples to corresponding experts to produce the recognition results. The framework shows a significant robustness improvement over the corresponding rule-based baselines by 2% of line accuracy on the recent multi-orientation open-set text recognition benchmark. The proposed framework, as a system, also shows 3–10% line accuracy advantages over previous open-set text recognition methods on horizontal samples.
External IDs:dblp:conf/icdar/LiuS25
Loading