Counting the Corner Cases: Revisiting Robust Reading Challenge Data Sets, Evaluation Protocols, and Metrics

Published: 01 Jan 2024, Last Modified: 07 Apr 2025ICDAR (4) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: For two decades, robust reading challenges (RRCs) have driven and measured progress of text recognition systems in new and difficult domains. Such standardized benchmarks benefit the field by allowing participants and observers to systematically track steady performance improvements as interest in the problem continues to grow. To better understand their impacts and create opportunities for further improvements, this work empirically analyzes three important aspects of several challenges from the last decade: data sets, evaluation protocols, and competition metrics. First, we explore implications of certain annotation protocols. Second, we identify limitations in existing evaluation protocols that cause even the ground truth annotations to receive less than perfect scores. To remedy this, we propose evaluation protocol updates that boost both recall and precision. Accounting for these corner cases causes almost no changes to current rankings; however, such cases may become more prominent and important to consider as challenges focus on increasingly complex reading tasks. Finally, inspired by the recent HierText challenge’s use of Panoptic Quality (PQ), we explore the impact of this simple, parameter-free tightness-aware metric on six prior challenges, and we propose a new variant—Panoptic Character Quality (PCQ)—for simultaneously measuring character-level accuracy and word detection tightness. We find PQ-based metrics have a greater re-ranking impact on detection-only tasks, but predict end-to-end rankings slightly better than F-score. In sum, our empirical analysis and associated code should allow future challenge designers to make better-informed choices.
Loading