ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection

30 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0
Keywords: computer vision, grounded vision-language models, open-vocabulary object detection, social interaction recognition, benchmark
TL;DR: A benchmark dataset and evaluation framework for social action recognition in urban street-view imagery using open-vocabulary object detection models.
Abstract: Existing Open Vocabulary Detection (OVD) models exhibit a number of challenges. They often struggle with semantic consistency across diverse inputs, and are often sensitive to slight variations in input phrasing, leading to inconsistent performance. The calibration of their predictive confidence, especially in complex multi-label scenarios, remains suboptimal, frequently resulting in overconfident predictions that do not accurately reflect their context understanding. The Understanding of those limitations requires multi-label detection benchmarks. Among those, one challenging domain is social activity interaction. Due to the lack of multi-label benchmarks for social interactions, in this work we present ELSA: Evaluating Localization of Social Activities. ELSA draws on theoretical frameworks in urban sociology and design and uses in-the-wild street-level imagery, where the size of social groups and the types of activities can vary significantly. ELSA includes more than 900 manually annotated images with more than 4,000 multi-labeled bounding boxes for individual and group activities. We introduce a novel re-ranking method for predictive confidence and new evaluation techniques for OVD models. We report our results on the widely-used, SOTA model Grounding DINO. Our evaluation protocol considers semantic stability and localization accuracy and sheds more light on the limitations of the existing approaches.
Supplementary Material: zip
Submission Number: 2390
Loading