Abstract: Computational recognition of verbal humour re-mains a challenging task, requiring an understanding of lan-guage, delivery style, emotions, and cultural context. Most existing approaches focus on binary classification and lack datasets that capture psychological dimensions of humour alongside variations in expression. We introduce MultiHuSE, a multi-modal dataset comprising 2,407 high-definition videos of 50 demographically diverse actors performing 1,463 text samples across four psychological humour styles (affiliative, aggressive, self-enhancing, and self-deprecating), as well as neutral content. A subset is additionally annotated for underlying emotions. The dataset uniquely captures multiple actor interpretations of the same texts, enabling systematic analysis of expressive diversity. Baseline experiments show that multimodal fusion outperforms unimodal approaches (80.1 % vs. 77.4% accuracy) in humour style classification, with particularly strong gains for affiliative humour (66 % to 74 %). While text provides the strongest individual signal, fusion models deliver meaningful im-provements. We hope that MultiHuSE provides empirical support for psychological theories linking humour and emotion, while also opening new avenues for research in human communication, well-being, and AI-driven interaction. The dataset is available for academic use under an End-User Licence Agreement.
External IDs:dblp:conf/cbmi/KennethKE25
Loading