Abstract: Multimodal large language models (MLLMs), which can answer complex questions on an image struggle to tell the time on analog clocks. Reading the time on an analog clock requires identifying the hands, their directions, and then computing the correct time value, thus involving different functions, and at the same time enabling a simple analysis of the performance. In this article, we use this simple task to explore how MLLMs learn during training and fine-tuning. The results of our evaluation illustrate the limitations of MLLMs in generalizing and abstracting, even on simple tasks, and call for approaches that enable learning at higher levels of abstraction.
External IDs:doi:10.1109/mic.2025.3618144
Loading