Keywords: Interactive dialogue evaluation, multilingual LLM dialogue, human-like LLM, evaluation and metircs
Abstract: Current Large Language Models (LLMs) evaluations rely heavily on static benchmarks, often failing to capture the interaction essential for human-like communication during multi-turn continuous human-LLM conversations. We introduce a novel evaluation framework grounded in the Common European Framework of Reference for Languages (CEFR) and Social Relationship and Power Distance (SRPD) Interaction in social communications to evaluate multilingual dialogue interactions. Unlike static metrics, our approach analyzes emergent behaviors, such as repair and alignment in dynamic, multi-turn interactions without manual annotation. Validated across diverse 18 languages, from high-resources (e.g., Spanish, French, English) to low-resources languages (e.g., Bengali, Thai, Swahili), the framework aligns with established statistic baselines results while uncovering critical behavioral nuances in lower-resource settings that static evaluations miss. This work provides a scalable methodology for measuring how effectively models adapt to user languages and domain-specific contexts in social contexts from more dynamic interaction evaluations.
Paper Type: Long
Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good
Research Area Keywords: human-like LLM, dialogue evaluation, multilingual LLM cultural analysis; Multilingual dialogue evaluation
Contribution Types: Model analysis & interpretability, Reproduction study, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources, Data analysis
Languages Studied: Arabic, English, Portuguese, French, Italian, Turkish, Hindi, Mandarin, Japanese, Vietnamese, Thai, Swahili, Bengali, Indonesian, Spanish, Yoruba
Submission Number: 9321
Loading