Keywords: Vision-Language Models, Zero-shot Learning, Anomaly Detection, Dataset Benchmarking, Medical Imaging, Brain MRI, Multi-modal Data, Rare Diseases
TL;DR: NOVA is an extreme OOD stress-test dataset of ∼900 multi-modal brain MRI scans (with 281 rare pathologies) for benchmarking VLMs on three clinical tasks: anomaly localization, captioning, and diagnostic reasoning.
Abstract: In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Open-world recognition ensures that such systems remain robust as ever-emerging, previously _unknown_ categories appear and must be addressed without retraining.
Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging.
However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use.
We therefore present NOVA, a challenging, real-life _evaluation-only_ benchmark of $\sim$900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning.
Because NOVA is never used for training, it serves as an _extreme_ stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space.
Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops, with approximately a 65\% gap in localisation compared to natural-image benchmarks and 40\% and 20\% gaps in captioning and reasoning, respectively, compared to resident radiologists. Therefore, NOVA establishes a testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/c-i-ber/Nova
Supplementary Material: pdf
Primary Area: AL/ML Datasets & Benchmarks for health sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 2090
Loading