LlaMa meets Cheburashka: impact of cultural background for LLM quiz reasoning

Mikhail Lifar; Bogdan Protsenko; Daniil Kupriianenko; Nazar Chubkov; Kulaev Kirill Dmitrievich; Alexander Guda; Irina Piontkovskaya

LlaMa meets Cheburashka: impact of cultural background for LLM quiz reasoning

Mikhail Lifar, Bogdan Protsenko, Daniil Kupriianenko, Nazar Chubkov, Kulaev Kirill Dmitrievich, Alexander Guda, Irina Piontkovskaya

Published: 30 Oct 2024, Last Modified: 13 Dec 2024LanGame PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM reasoning, quiz games

TL;DR: We evaluate LLaMA-405B ability to play intellectual game "What?Where?When?" in Russian

Abstract: Quiz games is the type of intellectual competition which are well suited for testing LLMs reasoning and problem solving skills. Indeed, a good quiz puzzle requires not only factual knowledge, but also the ability to analyse clues given in question, generate hypothesis, and choose the best one using logical reasoning and subtle hints. Recently, modern LLMs have made significant progress in general reasoning tasks, making this kind of evaluation extremely interesting. In this paper, we address a major limitation in the current LLMs' assessment: the models are usually evaluated on English language, or on the multi-lingual benchmarks reflecting English-centric culture, obtained by the translation from the English originals. In the contrary, we test the ability of the modern LLM to deal with the questions of real human quiz games from non-English-speaking society. Namely, we apply LlaMa3-405B to solve the quiz tasks created by the "What?Where?When?" Russian-speaking intellectual gaming community. First, we show, that although the LLM demonstrates strong reasoning and linguistic proficiency in Russian language, the performance diminishes significantly because of the poor knowledge of culture-specific facts. Second, we show the importance of the choice of the reasoning strategy for answering medium-difficulty questions, for which the model "posses" the necessary knowledge, but the correct answer cannot be given immediately. Evaluating several single- and multi-agent approaches, we obtain 6\% improvement in the overall accuracy comparing to the baseline step-by-step reasoning.

Submission Number: 42

Loading