Abstract: In this paper, we introduce a novel dataset specifically curated for detecting vulgar content in audio, focusing on two low-resource Indic languages, Hindi and Telugu. Unlike previous work, we propose a new class, \textit{Playful}, which distinguishes vulgar expressions that lack intent to incite hate from more extreme forms. The dataset is sourced from diverse platforms and contains audio recordings featuring potentially offensive or inappropriate language. To evaluate the dataset, we employed state-of-the-art models as baselines, achieving F1 scores of 0.66 for Hindi and 0.58 for Telugu, highlighting the unique challenges and opportunities this dataset presents for further research in low-resource language processing.
Paper Type: Short
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: Indic Vulgur Detection, Social Computing, Hate Speech, Speech Processing
Contribution Types: Data resources, Data analysis
Languages Studied: Hindi, Telugu
Submission Number: 51
Loading