# TransLLM: Why Not Transform Chat Large Language Models to Non-English?

## Data
Due to the maximum file size is 100MB. We only provide the following data:
- Recovery KD data in English: ./code/train/distil_alpaca_en_52k_llama2-7b-chat.json
- Recovery KD data in Thai: ./code/train/distil_alpaca_en_52k_llama2-7b-chat_th_googlemt.json
- Alpaca-GPT-4 data in English: ./code/train/alpaca_gpt4_data_en.json
- Alpaca-GPT-4 data in Thai: ./code/train/alpaca_gpt4_data_th_googlemt.json
- MT-Bench in Thai: ./code/test/mt_bench_question.xlsx
- Alpaca-Eval in Thai: ./code/test/alpaca_eval.xlsx
- Example data format of experiments: ./code/train/example

## Traning

### Model Extension
Use SentencePiece to learn the Thai vocabulary on mc4-TH. Merege the vocabulary as described in Chinese-LLaMA-Alpaca-2.

### Target Language Pre-Training

- Prepare mc4-TH in txt format, and the target chat model (such as llama2-chat-7b-hf).
- Change the data path and model path in the ./train/run_pt_1.sh.
- Run run_pt_1.sh.

### Translation Pre-Training
- Prepare Pile data and EN-TH parallel data in txt format
- Change the data path and model path in the ./train/run_pt_2.sh.
- Run run_pt_2.sh.

### Transfer Fine-Tuning
- Translate the Recovery KD data to Thai, organize TCOT data and SFT Translation data.
- Change the data path and model path in the ./train/run_sft.sh.
- Run run_sft.sh.

## Evluation

We provide the following scripts for evaluation
- Merge the LoRA model: ./Chinese-LLaMA-Alpaca-2/scripts/merge_llama2_with_chinese_lora_low_mem.py
- Generate output for mt_bench: ./eval/mt_bench_generate.py
- Generate output for alpaca_eval: ./eval/alpaca_eval_generate.py
- Generate GPT-4 evaluations: ./eval/gpt4_eval.py

## Notice
We have modify some files in ./Chinese-LLaMA-Alpaca-2/scripts/training.

## License
The code and data is released under Apache License 2.0.