Stealing Brains: From English to Czech Language Model

Petr Hyner; Petr Marek; David Adamczyk; Jan Hula; Jan Sedivý

Stealing Brains: From English to Czech Language Model

Petr Hyner, Petr Marek, David Adamczyk, Jan Hula, Jan Sedivý

Published: 01 Jan 2024, Last Modified: 29 Jul 2025IJCCI 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We present a simple approach for efficiently adapting pre-trained English language models to generate text in lower-resource language, specifically Czech. We propose a vocabulary swap method that leverages parallel corpora to map tokens between languages, allowing the model to retain much of its learned capabilities. Experiments conducted on a Czech translation of the TinyStories dataset demonstrate that our approach significantly outperforms baseline methods, especially when using small amounts of training data. With only 10% of the data, our method achieves a perplexity of 17.89, compared to 34.19 for the next best baseline. We aim to contribute to work in the field of cross-lingual transfer in natural language processing and we propose a simple to implement, computationally efficient method tested in a controlled environment.

Loading