A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning

Gaoussou Youssouf Kebe; Padraig Higgins; Patrick Jenkins; Kasra Darvish; Rishabh Sachdeva; Ryan Barron; John Winder; Donald Engel; Edward Raff; Francis Ferraro; Cynthia Matuszek

A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning

Gaoussou Youssouf Kebe, Padraig Higgins, Patrick Jenkins, Kasra Darvish, Rishabh Sachdeva, Ryan Barron, John Winder, Donald Engel, Edward Raff, Francis Ferraro, Cynthia Matuszek

Published: 29 Jul 2021, Last Modified: 24 May 2023NeurIPS 2021 Datasets and Benchmarks Track (Round 1)Readers: Everyone

Keywords: Grounded Language Acquisition, Speech Processing, Computer Vision, Natural Language Processing

TL;DR: We present a multimodal dataset of objects with spoken as well as textual descriptions.

Abstract: Grounded language acquisition is a major area of research combining aspects of natural language processing, computer vision, and signal processing, compounded by domain issues requiring sample efficiency and other deployment constraints. In this work, we present a multimodal dataset of RGB+depth objects with spoken as well as textual descriptions. We analyze the differences between the two types of descriptive language and our experiments demonstrate that the different modalities affect learning. This will enable researchers studying the intersection of robotics, NLP, and HCI to better investigate how the multiple modalities of image, depth, text, speech, and transcription interact, as well as how differences in the vernacular of these modalities impact results.

Supplementary Material: zip

URL: https://github.com/iral-lab/gold

Contribution Process Agreement: Yes

Dataset Url: https://github.com/iral-lab/gold

Author Statement: Yes

License: Creative Commons Attribution 4.0 International (CC BY 4.0) https://creativecommons.org/licenses/by/4.0/

10 Replies

Loading