Not Funny Anymore: LLM Judges Confuse Literal Similarity for Humor in Translated Jokes

Published: 14 Dec 2025, Last Modified: 09 Jan 2026LM4UC@AAAI2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Translation Quality Evaluation, Humor Translation, LLM-as-a-Judge, Cross-Lingual Evaluation, Literalness Bias, Semantic Alignment, Multilingual NLP
Abstract: Automatic humor translation is both a challenging task and a very difficult problem to evaluate. Reference-based metrics struggle in assessing humor preservation in joke translation, often rewarding towards literal similarity over the preserved comedic effect, and they require costly manual gold reference translations. In this work, we study the task of reference-free humor translation evaluation, and analyze the performance of LM judges using 7 models on 162 English to Chinese joke pairs with 5-point Likert scale human annotations. We find that these judges struggle, with strict agreement often near or even below the 20% random baseline. To better understand this limitation, we test the hypothesis that these metrics are over-attending to literalness as a signal for quality by introducing a correlation-based literalness metric in a multilingual embedding space. With this novel analysis we demonstrate quantitatively that poor LM evaluator performance is in fact driven by this over-literal bias, suggesting that future metrics which explicitly contend with this literalness might close this gap.
Submission Number: 37
Loading