Keywords: Building year estimation, Ordinal regression, VLM
TL;DR: We introduce YearGuessr, a 55K building age dataset revealing that vision-language models achieve 34% higher accuracy on famous landmarks, suggesting they memorize popular structures rather than learning architectural features.
Abstract: Building age is a crucial yet underexplored factor for sustainability, heritage, and safety, lacking a public benchmark that is both global and ordinal. We uncover that state-of-the-art vision-language models achieve up to 34% better accuracy on famous landmarks compared to ordinary buildings, suggesting they memorize popular structures from training data rather than learning generalizable architectural features. To investigate this phenomenon, we introduce the largest open benchmark for building age estimation: the **YearGuessr** dataset and our proposed baseline model, **YearCLIP**. **YearGuessr** comprises 55,546 Wikipedia facades from 157 countries, despite geographic skew toward Western architecture, with continuous ordinal labels from 1001 to 2024 CE and rich multi-modal attributes including GPS, captions, and page-view counts. We frame age prediction as ordinal regression and introduce popularity-based MAE plus interval accuracy ($\pm$ 5/20/50/100 yr). In addition, we benchmark 30+ models, including CNN-based, Transformer-based, CLIP-based models, and VLMs. Our **YearCLIP** model shows ordinal training halves MAE, while GPS priors further reduce the error. Zero-shot VLMs excel on landmarks but struggle on unrecognized buildings, exposing a popularity bias that our metric captures. We will make our dataset and code publicly available and offer the largest open benchmark for building age estimation and reasoning.
Primary Area: datasets and benchmarks
Submission Number: 2844
Loading