The Photographer’s Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers
Abstract: Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component—a pure color block (blue).
% a pure expanse of blue, appreciated for its contribution to visual aesthetics purely as a color block.
%
Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs).
% in understanding image aesthetics.
%
% limitations
%
Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig.~\ref{fig:head}), which require extensive expertise—including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description.
%
To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a \textbf{novel dataset}, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity.
% characterized by (1) scale: a collection with 20 times the images of existing works, (2) expertise: insights from extensive discussions by photographers and enthusiasts, and (3) diversity: a wide range of photo types and aesthetic perspectives.
%
Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a \textbf{novel model}, PhotoEye, featuring a language-guided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives.
%
Finally, we present a \textbf{novel benchmark}, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding.
%
On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models. Datasets and code will be publicly available.
Loading