Abstract: This paper highlights a problem of evaluation metrics adopted in the open-vocabulary segmentation. The evaluation process relies heavily on closed-set metrics on zero-shot or cross-dataset pipelines without considering the similarity between predicted and ground truth categories. We first survey eleven similarity measurements between two categorical words using WordNet linguistics statistics, text embedding, or language models by comprehensive quantitative analysis and user study to tackle this issue. Based on those explored measurements, we design novel evaluation metrics, Open mIoU, Open AP, and Open PQ, tailored for three open-vocabulary segmentation tasks. We benchmark the proposed evaluation metrics on twelve open-vocabulary methods in three segmentation tasks. Despite the relative subjectivity of similarity distance, we demonstrate that our metrics can still well evaluate the open ability of the existing open-vocabulary segmentation methods. We hope our work can bring the community new thinking about evaluating model ability for open-vocabulary segmentation.
External IDs:doi:10.1109/tpami.2025.3562930
Loading