Not All That’s Colorful Is Real: Rethinking Metrics for Image Colorization

Published: 24 Sept 2025, Last Modified: 07 Nov 2025NeurIPS 2025 Workshop GenProCCEveryoneRevisionsBibTeXCC BY 4.0
Track: Regular paper
Keywords: colorization, metric, evaluation, image editing, benchmark
Abstract: Image colorization is the task of colorizing grayscale images. Unlike tasks with a well-defined ground truth, colorization is inherently ambiguous: a grayscale scene admits many plausible colorizations. Consequently, reference-based metrics are ill-suited for the problem. Distribution metrics such as FID cannot evaluate a single image and colorfulness scores often fail to reflect perceptual naturalness. We study how to evaluate image colorization at the single-image level. We benchmark 20+ no-reference IQA metrics and colorfulness variants across three datasets and >100k colorized images and introduce a rank-based framework that compares how well each metric places the real image relative to its colorized variants. We find that NR-IQA metrics, especially HyperIQA and TOPIQ, consistently prefer real images over synthetic ones and align with distribution-level trends while providing per-image interpretability. Our study positions NR-IQA as a practical tool for evaluating colorization realism and offers a diagnostic benchmark for future methods.
Submission Number: 20
Loading