Engagement Undermines Safety: How stereotypes and toxicity shape humor in language models

Engagement Undermines Safety: How stereotypes and toxicity shape humor in language models

ACL ARR 2025 May Submission7961 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As large language models (LLMs) are increasingly used in creative environments like storytelling, journalism, and even comedy, ensuring they do not propagate harmful stereotypes or toxicity has become a central safety concern. While past research focuses on evaluating crude preferences of stereotypes and toxicity in models, we improve upon this by devising an evaluation task through humor generation, which builds the stage for subtle attempts at injecting harmful elements. To understand the deep embedding of such behaviours, we investigate how modern large language models pipelines and certain metrics prefer humor that leans on stereotypes and toxicity. We observe that LLMs can exploit stereotypes and toxicity to sound funnier when asked to create humor. Our evaluations show a rise of $10-21$ percent in mean humor score for stereotypical and toxic jokes, showing a preference in current metrics for the same. Another, LLM-based, metric showed stereotypical jokes to hold $11$ and $28$ percent higher relative proportions among the funniest jokes subset than the harmless jokes. Also, we observe a 5 percentage points amplification of stereotypical and toxic generations with role-assigned LLMs, when asked to ``talk like a comedian'', for example, Robin Williams or Bill Cosby. Our findings highlight risks in LLM-driven humor generation and general usage for engagement and the creativity industry and call for more nuanced safety interventions.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Humor, AI Safety

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 7961

Loading