Have Multimodal Large Language Models Really Learned to Tell the Time on Analog Clocks?

Tairan Fu, Miguel González, Javier Conde, Elena Merino-Gómez, Pedro Reviriego

Published: 01 Jul 2025, Last Modified: 08 Jan 2026IEEE Internet ComputingEveryoneRevisionsCC BY-SA 4.0

Abstract: Multimodal large language models (MLLMs), which can answer complex questions on an image struggle to tell the time on analog clocks. Reading the time on an analog clock requires identifying the hands, their directions, and then computing the correct time value, thus involving different functions, and at the same time enabling a simple analysis of the performance. In this article, we use this simple task to explore how MLLMs learn during training and fine-tuning. The results of our evaluation illustrate the limitations of MLLMs in generalizing and abstracting, even on simple tasks, and call for approaches that enable learning at higher levels of abstraction.

External IDs:doi:10.1109/mic.2025.3618144