Injecting Knowledge from Social Science Journals to Improve Indonesian Cultural Understanding by LLMs

ACL ARR 2026 January Submission10462 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cultural understanding of LLM, Indonesia, Indonesian
Abstract: Recently there have been intensifying efforts to improve the understanding of Indonesian cultures by large language models (LLMs). An attractive source of cultural knowledge that has been largely overlooked is local journals of social science, which likely contain substantial cultural studies from a native perspective. We present a novel text dataset of journal article passages, created from 151 open-source Indonesian social science journals, called IndoSoSci. We demonstrate an effective recipe for injecting Indonesian cultural knowledge therein into LLMs: extracting the facts related to Indonesian culture, and apply retrieval-augmented generation (RAG) with LLM-generated hypothetical documents as queries during retrieval. The proposed recipe yields strong performance gains over several strong baselines on the IndoCulture benchmark. Additionally, by combining IndoSoSci with Indonesian Wikipedia, we set a new state-of-the-art accuracy on the IndoCulture benchmark.
Paper Type: Long
Research Area: Multilinguality and Language Diversity
Research Area Keywords: multilingual benchmarks, less-resourced languages, resources for less-resourced languages, multilingualism
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: Indonesian
Submission Number: 10462
Loading