Adapting Chat Language Models Using Only Target Unlabeled Language Data

Atsuki Yamaguchi; Terufumi Morishita; Aline Villavicencio; Nikolaos Aletras

Adapting Chat Language Models Using Only Target Unlabeled Language Data

Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras

Published: 12 Oct 2025, Last Modified: 12 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vocabulary expansion (VE) is the de-facto approach to language adaptation of large language models (LLMs) by adding new tokens and continuing pre-training on target data. While this is effective for base models trained on unlabeled data, it poses challenges for chat models trained to follow instructions through labeled conversation data. Directly adapting the latter with VE on target unlabeled data may result in forgetting chat abilities. While ideal, target chat data is often unavailable or costly to create for low-resource languages, and machine-translated alternatives are not always effective. To address this issue, previous work proposed using a base and chat model from the same family. This method first adapts the base LLM with VE on target unlabeled data and then converts it to a chat model by adding a chat vector (CV) derived from the weight difference between the source base and chat models. We propose ElChat, a new language adaptation method for chat LLMs that adapts a chat model directly on target unlabeled data, without a base model. It elicits chat abilities by injecting information from the source chat model. ElChat offers more robust and competitive target language and safety performance while achieving superior English, chat, and instruction-following abilities compared to CV.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Camera-ready submission with updated references. On the definition of the "chat model": The camera-ready version explains its definitions in Abstract (L3-L4), the second paragraph of the Introduction, and the caption of Figure 1. On the motivation for VE against LLMs with large vocabularies: The camera-ready version explicitly mentions that (i) recent LLMs often have large vocabularies (e.g., 152K for Qwen2.5), but (ii) they still require substantially more inference steps for underrepresented languages like Amharic than their high-resource counterparts (e.g., English), necessitating the use of VE to mitigate this issue and achieve inference speedups.

Code: https://github.com/gucci-j/chat-cve

Assigned Action Editor: ~Ruoyu_Sun1

Submission Number: 4876

Loading