Keywords: mutlimodal deep learning, vision-and-language, semantic knowledge, conceptual knowledge, representational analysis, human-likeness
Abstract: Learning conceptual structures requires acquiring knowledge of how members of a class share a set of semantic properties. The challenge is that some properties are more efficiently learned by perceptual experience (e.g., an image of a dog that shows its texture, shape and color) while others benefit from language input (e.g., “a dog is a mammal”). Unimodal machine learning systems, as opposed to human brains, are therefore fundamentally limited in this respect. In contrast, systems integrating multimodal information should be able to learn a more human-like representational space since they can leverage both types of complementary sources of information. Multimodal neural network models offer a unique opportunity to test this hypothesis. We evaluate this proposal through a series of experiments on architecturally diverse vision-and-language networks trained on massive caption image datasets. We introduce an analytic framework that characterizes the semantic information behind the discrimination of concepts (i.e., lexicalized categories) through image-text matching tasks and representational similarity. We further compare how this discrimination (i.e., the model’s “conceptual behavior”) differs from that of humans and unimodal networks, and to what extent it depends on the multimodal encoder mechanism. Our results suggest promising avenues to align human and machine representational invariants via multimodal inputs.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
Supplementary Material: zip
10 Replies
Loading