Abstract: Content moderation (CM) systems have become essential following the monumental increase in multimodal and online social platforms; and while increasingly published work focuses on text-based solutions, there is still limited work on audio-based methods. In this study we aim to explore relation-ships between speech emotions and toxic speech, as part of a CM scenario. We first investigate an appropriate framework for combining speech emotion recognition (SER) and audio-based CM models. We then investigate which emotional aspects (i.e., attribute, sentiment, or attitude) could contribute the most in facilitating audio-based CM recognition platforms. Our experi-mental results indicate that conventional shared feature encoder approaches may fail to capture additional discriminative features for boosting audio-based CM tasks while utilizing SER learning. We further investigate performance trade-offs of late-fusion frameworks for combining SER and CM information. We argue that these observations could be attributed to an emotionally-biased distribution in the CM scenario, concluding that SER could in deed play a role in content moderation frameworks, given added application-specific emotional information.
Loading