Language Modulated Detection and Detection Modulated Language Grounding in 2D and 3D Scenes

Ayush Jain; Nikolaos Gkanatsios; Ishita Mediratta; Katerina Fragkiadaki

Language Modulated Detection and Detection Modulated Language Grounding in 2D and 3D Scenes

Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Language Grounding, Modulated Object Detection, Attention, Vision and Language

Abstract: To localize an object referent, humans attend to different locations in the scene and visual cues depending on the utterance. Existing language and vision systems often model such task-driven attention using object proposal bottlenecks: a pre-trained detector proposes objects in the scene, and the model is trained to selectively process those proposals and then predict the answer without attending to the original image. Object detectors are typically trained on a fixed vocabulary of objects and attributes that is often too restrictive for open-domain language grounding, where the language utterance may refer to visual entities in various levels of abstraction, such as a cat, the leg of a cat, or the stain on the front leg of the chair. This paper proposes a model that reconciles language grounding and object detection with two main contributions: i) Architectures that exhibit iterative attention across the language stream, the pixel stream, and object detection proposals.In this way, the model learns to condition on easy-to-detect objects (e.g., “table”) and language hints (e.g. “on the table”) to detect harder objects (e.g., “mugs”)mentioned in the utterance. ii) Optimization objectives that treat object detection as language grounding of a large predefined set of object categories. In this way,cheap object annotations are used to supervise our model, which results in performance improvements over models that are not co-trained across both referential grounding and object detection. Our model has a much lighter computational footprint, achieves faster convergence and has shown on par or higher performance compared to both detection-bottlenecked and non-detection bottlenecked language-vision models on both 2D and 3D language grounding benchmarks.

One-sentence Summary: We attend on the visual and the language stream as well as object proposals to detect more objects conditioning on easy to detect objects and language hints

Supplementary Material: zip

5 Replies

Loading