The Photographer’s Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers

Daiqing Qi, Handong Zhao, Jing Shi, Simon Jenni, Yifei Fan, Franck Dernoncourt, Scott Cohen, Sheng Li

Published: 20 Jun 2025, Last Modified: 23 Sept 2025CVPR 2025EveryoneCC BY 4.0

Abstract: Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component—a pure color block (blue). % a pure expanse of blue, appreciated for its contribution to visual aesthetics purely as a color block. % Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). % in understanding image aesthetics. % % limitations % Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig.~\ref{fig:head}), which require extensive expertise—including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. % To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a \textbf{novel dataset}, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. % characterized by (1) scale: a collection with 20 times the images of existing works, (2) expertise: insights from extensive discussions by photographers and enthusiasts, and (3) diversity: a wide range of photo types and aesthetic perspectives. % Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a \textbf{novel model}, PhotoEye, featuring a language-guided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. % Finally, we present a \textbf{novel benchmark}, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. % On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models. Datasets and code will be publicly available.