Keywords: automatic speech recognition, low-resource languages, universal phone recognition
Working Group: WG1: Corpus annotation, WG3: Multilingual and cross-lingual language technology
WG1 Tasks: Task 1.5: Annotation of Spoken data
Abstract: The predominance of English in language technology applications and Natural Language Processing (NLP) (Bender, 2009) has been gaining visibility (Schwartz, 2022), as research strives to become more linguistically diverse and inclusive. Most commonly, a lack of data [for a specific task] (Nigatu et al., 2024) is cited as the reason why some languages are catered to by research and technology, while others (often called “low resource languages”) are seemingly left behind. However, for many NLP technologies, there is also an implicit demand that the training data for current, data-hungry models be in the form of standardised, digitised orthographical text. In contrast, many of the world’s languages are primarily (or exclusively) oral and may have either no written form at all, or one that is non-standardised. As a consequence, NLP models often handle variation poorly (Bergmanis et al., 2020), aspects of sociolinguistic variation are missed, and, despite being the most common modality for human natural language, the particularities of spoken language are comparatively under-researched. We add to the recently emerging studies in speech-to-IPA which use Automatic Speech Recognition (ASR) models specifically trained to predict phonetic transcriptions in a language-agnostic format to i) take first steps towards making spoken data tangible across more linguistically diverse and low-resource languages, ii) add greater attention to sociolinguistic detail within languages, and iii) avoid standardisation or imposing orthographies.
WG3 Tasks: Task 3.1 Documentation of multilingual tools and resources
Tracks For Type Of Contribution: Complete work (including previously published work)
Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 59
Loading