Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Diogo Nuno Freitas; Brigt Håvardstun; Cesar Ferri; Dario Garigliotti; Jan Arne Telle; Jose Hernandez-Orallo

Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Diogo Nuno Freitas, Brigt Håvardstun, Cesar Ferri, Dario Garigliotti, Jan Arne Telle, Jose Hernandez-Orallo

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Language Models, Machine Teaching, Concept Identification, Drawing Simplification

TL;DR: Multimodal GPT-4 is evaluated using machine teaching on concepts presented as image and TikZ formats, showing that while image-based is more effective, both modalities ranked concept complexity similarly, suggesting simplicity transcending modality.

Abstract: Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of car as an image, for instance, should map to the similar area in the latent space as a textual description of the strokes that conform the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In particular, we apply this to GPT-4V, a multimodal version of GPT-4 that includes support for image analysis, to evaluate the complexity of teaching a subset of objects in the _Quick, Draw!_ dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy when compared to coordinate-based representations. But, surprisingly, for concepts recognized by both modalities, the teaching size ranks concepts similarly across both modalities, even when controlling for (a human proxy of) concept priors. This could also suggest that the simplicity of concepts is an inherent property that transcends modality representations.

Supplementary Material: pdf

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9649

Loading