A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Published: 01 Feb 2024, Last Modified: 01 Feb 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates. Code and data are available at https://github.com/lil-lab/phrase_grounding.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: - Information provided to access the dataset, code, and models are added. - "a short discussion on this aspect, e.g. including if the training on one dataset e.g. Flickr 30k and evaluating on Touchdown SDR, could be used for such a setting as part of future work." is added. - The format tailored for the camera-ready
Assigned Action Editor: ~Marcus_Rohrbach1
Submission Number: 1533