LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Out-of-Distribution Generalization, representation evaluation, Hierarchy, Vision Language Model, Class Taxonomy, Zero-shot
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose to use LCA distance on WordNet hierarchy to estimate ImageNet OOD performance, and for the first time, we have shown a strong linear correlation across 75 models (including vision-only models and VLM) over 5 natural shrift dataset.
Abstract: In this paper, we address the challenge of assessing model generalization under Out-of-Distribution (OOD) conditions. We reintroduce the Least Common Ancestor (LCA) distance, a metric that has been largely overshadowed since ImageNet. By leveraging the WordNet hierarchy, we utilize the LCA to measure the taxonomic distance between labels and predictions, presenting it as a benchmark for model generalization. The LCA metric proves especially robust in comparison to previous state-of-the-art metrics when evaluating diverse models, including both vision-only and vision-language models on natural distribution shift datasets. To validate our benchmark's efficacy, we perform an extensive empirical study on 75 models spanning five distinct ImageNet-OOD datasets. Our findings reveal a strong linear correlation between in-domain ImageNet LCA scores and OOD Top1 performance across ImageNet-S/R/A/ObjectNet. This discovery gives rise to a novel evaluation framework termed "LCA-on-the-Line", facilitating unified and consistent assessments across a broad spectrum of models and datasets. Beside introducing an evaluative tool, we also delve into the intricate ties between the LCA metric and model generalization. By aligning model predictions more closely with the WordNet hierarchy and refining prompt engineering in zero-shot vision-language models, we offer tangible strategies to improve model generalization. We challenge the prevailing notion that LCA offers no added evaluative value over top-1 accuracy, our research provides invaluable insights and actionable techniques to enhance model robustness and generalization across various tasks and scenarios.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4697
Loading