Language Controls More Than Top-Down Attention: Modulating Bottom-Up Visual Processing with Referring ExpressionsDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: Referring Expression Understanding, Language-Vision Problems, Grounded Language Understanding
Abstract: How to best integrate linguistic and perceptual processing in multimodal tasks is an important open problem. In this work we argue that the common technique of using language to direct visual attention over high-level visual features may not be optimal. Using language throughout the bottom-up visual pathway, going from pixels to high-level features, may be necessary. Our experiments on several English referring expression datasets show significant improvements when language is used to control the filters for bottom-up visual processing in addition to top-down attention.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
One-sentence Summary: We modulate both top-down and bottom-up visual processing with referring expressions.
Reviewed Version (pdf): https://openreview.net/references/pdf?id=HCnZWJgNTb
9 Replies

Loading