NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data

Tahir Javed; Kaushal Santosh Bhogale; Mitesh M Khapra

NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data

Tahir Javed, Kaushal Santosh Bhogale, Mitesh M Khapra

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: continual learning, speech, recognition, datasets, indian languages, multilingual asr

TL;DR: This paper introduces Nirantar, a large-scale continual learning benchmark for Automatic Speech Recognition, featuring real-world data episodes that incrementally introduce new languages, domains, or both.

Abstract: We present Nirantar based on a large-scale effort to collect extempore and conversational speech data from participants spanning 22 languages across diverse locations in India. Given the extensive number of languages and locations involved, data is collected in incremental batches. Each batch introduces new languages, new domains (locations), or both, creating a practical playground for continual learning (CL). Nirantar contains a total of 3250 hours of human-transcribed speech data covering 208 Indian districts across 22 languages, with 1720 hours newly released as a part of this work. The data inflow and resulting multilingual multi-domain episodes are based on real-world data collection rather than simulated episodes commonly found in existing CL datasets. In particular, the amount of data collected and the number of languages and domains involved are not uniform across episodes, reflecting a practical and real-world continual learning scenario. This dataset serves as a playground for training and evaluating CL approaches in three different scenarios: Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL), which has not been studied before. To establish the dataset's usefulness, we evaluate several existing CL approaches within these scenarios. Our findings indicate that the behaviour of these algorithms varies across the three scenarios, emphasizing the need for detailed independent studies of each.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13603

Loading