Abstract: As large language models (LLMs) are increasingly used in creative environments like storytelling, journalism, and even comedy, ensuring they do not propagate harmful stereotypes or toxicity has become a central safety concern. While past research focuses on evaluating crude preferences of stereotypes and toxicity in models, we improve upon this by devising an evaluation task through humor generation, which builds the stage for subtle attempts at injecting harmful elements. To understand the deep embedding of such behaviours, we investigate how modern large language models pipelines and certain metrics prefer humor that leans on stereotypes and toxicity. We observe that LLMs can exploit stereotypes and toxicity to sound funnier when asked to create humor. Our evaluations show a rise of $10-21$ percent in mean humor score for stereotypical and toxic jokes, showing a preference in current metrics for the same. Another, LLM-based, metric showed stereotypical jokes to hold $11$ and $28$ percent higher relative proportions among the funniest jokes subset than the harmless jokes. Also, we observe a 5 percentage points amplification of stereotypical and toxic generations with role-assigned LLMs, when asked to ``talk like a comedian'', for example, Robin Williams or Bill Cosby. Our findings highlight risks in LLM-driven humor generation and general usage for engagement and the creativity industry and call for more nuanced safety interventions.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Humor, AI Safety
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 7961
Loading