Abstract: Social communication systems must identify toxic voice audio to support moderation that protects the safety and civility of their communities. Toxicity classification for voice depends on both audio style, such as volume and tone, and content, such as the words in the speech individually and in context. We introduce a novel end-to-end multi-task learning (MTL) paradigm for audio-based toxicity detection, addressing the challenges associated with existing automatic speech recognition (ASR) and text-based systems. By employing a hard parameter-sharing backbone and flexible soft-attention task adapters, our model performs two tasks: a multi-label toxicity classification task that targets specific categories of toxic behavior, and an auxiliary Audio to Keyword detection task that focuses on transcribing only toxic keywords, thereby enhancing computational efficiency and complementing classification output. We observe that the classifier significantly improves the quality of keyword detection. We also contribute a data pipeline for automated offline labeling of training sets.
Loading