Research Area: Data, Science of LMs, Engineering for large LMs
Keywords: Chinese LLM, pretrain, alignment
TL;DR: CT-LLM, a groundbreaking 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in the development of LLMs.
Abstract: In this study, we introduce $\textbf{CT-LLM}$, a groundbreaking 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in the development of LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens and 400 billion English tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques including supervised fine-tuning (SFT) and direct preference optimization (DPO). Demonstrating remarkable performance on the ChineseHardCase Benchmark, CT-LLM not only excels in Chinese language tasks but also showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing full-process of CT-LLM, we aim to foster further exploration and innovation within both the academic and industrial spheres, paving the way for more inclusive and versatile language models in the future.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 385
Loading