XLM supports multi-GPU and multi-node training, and contains code for:
Open the illustrative notebook in colab
Note : Most of the bash scripts used in this repository were written on the windows operating system, and can generate this error on linux platforms.
This problem can be corrected with the following command:
filename=my_file.sh
cat $filename | tr -d '\r' > $filename.new && rm $filename && mv $filename.new $filename
At this level, if you have pre-processed binary data in pth format (for example from XLM experimentation or improvised by yourself), group them in a specific folder that you will mention as a parameter by calling the script train.py.
If this is not the case, we assume that you have txt files available for preprocessing. Look at the following example for which we have three translation tasks: English-French, German-English and German-French
.
We have the following files available for preprocessing:
- en-fr.en.txt and en-fr.fr.txt
- de-en.de.txt and de-en.en.txt
- de-fr.de.txt and de-fr.fr.txt
All these files must be in the same folder (PARA_PATH
).
You can also (only or optionally) have monolingual data available (en.txt, de.txt and fr.txt
; in MONO_PATH
folder).
Parallel and monolingual data can all be in the same folder.
Note : Languages must be submitted in alphabetical order (de-en and not en-de, fr-ru and not ru-fr...
). If you submit them in any order you will have problems loading data during training, because when you run the train.py script the parameters like the language pair are put back in alphabetical order before being processed. Don't worry about this alphabetical order restriction, XLM for MT is naturally trained to translate sentences in both directions. See translate.py.
OPUS collections is a good source of dataset. We illustrate in the opus.sh script how to download the data from opus and convert it to a text file.
Changing parameters ($PARA_PATH and $SRC) in opus.sh.
cd meta_XLM
chmod +x ./scripts/opus.sh
./scripts/opus.sh de-fr
Another source for other_languages-english
data is anki Tab-delimited Bilingual Sentence Pairs. Simply download the .zip file, unzip to extract the other_language.txt
file. This file usually contains data in the form of sentence_en sentence_other_language other_information
on each line. See anki.py and anky.sh in relation to a how to extract data from anki. Example of how to download and extract de-en
and en-fr
pair data.
cd meta_XLM
output_path=/content/data/para
mkdir $output_path
chmod +x ./scripts/anki.sh
./scripts/anki.sh de,en deu-eng $output_path scripts/anki.py
./scripts/anki.sh en,fr fra-eng $output_path scripts/anki.py
After that you will have in data/para
following files : de-en.de.txt, de-en.en.txt, deu.txt, deu-eng.zip and _about.txt
Move to the XLM
folder in advance.
cd XLM
Install the following dependencies (fastBPE and Moses) if you have not already done so.
git clone https://github.com/moses-smt/mosesdecoder tools/mosesdecoder
git clone https://github.com/glample/fastBPE tools/fastBPE && cd tools/fastBPE && g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
Changing parameters in data.sh. Between lines 94 and 100 of data.sh, you have two options corresponding to two scripts to execute according to the distribution of the folders containing your data. Option 2 is chosen by default, kindly uncomment the lines corresponding to your option.
With too many BPE codes (depending on the size of the dataset) you may get this error. Decrease the number of codes (e.g. you can dichotomously search for the appropriate/maximum number of codes that make the error disappear)
languages=de,en,fr
chmod +x ../data.sh
../data.sh $languages
If you stop the execution when processing is being done on a file please delete this erroneous file before continuing or restarting the processing, otherwise the processing will continue with this erroneous file and potential errors will certainly occur.
After this you will have the following (necessary) files in $OUTPATH
(and $OUTPATH/fine_tune
depending on the parameter $sub_task
):
- monolingual data :
- training data : train.fr.pth, train.en.pth and train.de.pth
- test data : test.fr.pth, test.en.pth and test.de.pth
- validation data : valid.fr.pth, valid.en.pth and valid.de.pth
- parallel data :
- training data :
- train.en-fr.en.pth and train.en-fr.fr.pth
- train.de-en.en.pth and train.de-en.de.pth
- train.de-fr.de.pth and train.de-fr.fr.pth
- test data :
- test.en-fr.en.pth and test.en-fr.fr.pth
- test.de-en.en.pth and test.de-en.de.pth
- test.de-fr.de.pth and test.de-fr.fr.pth
- validation data
- valid.en-fr.en.pth and valid.en-fr.fr.pth
- valid.de-en.en.pth and valid.de-en.de.pth
- valid.de-fr.de.pth and valid.de-fr.fr.pth
- code and vocab
To use the biblical corpus, run bible.sh instead of data.sh. Here is the list of languages available (and to be specified as $languages
value) in this case :
Francais, Anglais, Fulfulde_Adamaoua or Fulfulde_DC (formal name : Fulfulde), Bulu, KALATA_KO_SC_Gbaya or KALATA_KO_DC_Gbaya (formal name : Gbaya), BIBALDA_TA_PELDETTA (formal name : MASSANA), Guiziga, Kapsiki_DC (formal name : Kapsiki), Tupurri
.Bafia, Ejagham, Ghomala, MKPAMAN_AMVOE_Ewondo (formal name : Ewondo), Ngiemboon, Dii, Vute, Limbum, Mofa, Mofu_Gudur, Doyayo, Guidar, Peere_Nt&Psalms, Samba_Leko, Du_na_sdik_na_wiini_Alaw
.csv_path
a folder named csvs. Here is the drive link of its zipped version.Bafi
instead of Bafia
for example), except KALATA_KO_SC_Gbaya/KALATA_KO_DC_Gbaya which becomes Gbay (first letters of Gbaya), BIBALDA_TA_PELDETTA which becomes MASS (first letters of MASSANA), MKPAMAN_AMVOE_Ewondo which becomes Ewon (first letters of Ewondo), Francais and Anglais which becomes repectively fr and en
. Indeed, bible.sh uses these abbreviations to create the files and not the language names themselves.languages=Bafia,Bafia
instead of languages=Bafia
.Install the following dependencie (Apex) if you have not already done so.
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex
Instead of passing all the parameters of train.py, put them in a json file and specify the path to this file in parameter (See lm_template.json file for more details).
config_file=../configs/lm_template.json
python train.py --config_file $config_file
If you pass a parameter by calling the script train.py (example: python train.py --config_file $config_file --data_path my/data_path
), it will overwrite the one passed in $config_file
.
Once the training is finished you will see a file named train.log
in the $dump_path/$exp_name/$exp_id
folder information about the training. You will find in this same folder your checkpoints and best model.
When "mlm_steps":"..."
, train.py automatically uses the languages to have "mlm_steps":"de,en,fr,de-en,de-fe,en-fr"
(give a precise value to mlm_steps if you don't want to do all MLM and TLM, example : "mlm_steps":"en,fr,en-fr"
). This also applies to "clm_steps":"..."
which deviates "clm_steps":"de,en,fr"
in this case.
Note :
-en
means MLM on en
, and requires the following three files in data_path
: a.en.pth, a ∈ {train, test, valid} (monolingual data)
-en-fr
means TLM on en and fr
, and requires the following six files in data_path
: a.en-fr.b.pth, a ∈ {train, test, valid} and b ∈ {en, fr} (parallel data)
-en,fr,en-fr
means MLM+TLM on en, fr, en and fr
, and requires the following twelve files in data_path
: a.b.pth and a.en-fr.b.pth, a ∈ {train, test, valid} and b ∈ {en, fr}
To train with multiple GPUs use:
export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --config_file $config_file
Tips: Even when the validation perplexity plateaus, keep training your model. The larger the batch size the better (so using multiple GPUs will improve performance). Tuning the learning rate (e.g. [0.0001, 0.0002]) should help.
## main parameters
exp_name # experiment name
exp_id # Experiment ID
dump_path # where to store the experiment (the model will be stored in $dump_path/$exp_name/$exp_id)
## data location / training objective
data_path # data location
lgs # considered languages/meta-tasks
clm_steps # CLM objective
mlm_steps # MLM objective
## transformer parameters
emb_dim # embeddings / model dimension
n_layers # number of layers
n_heads # number of heads
dropout # dropout
attention_dropout # attention dropout
gelu_activation # GELU instead of ReLU
## optimization
batch_size # sequences per batch
bptt # sequences length
optimizer # optimizer
epoch_size # number of sentences per epoch
max_epoch # Maximum epoch size
validation_metrics # validation metric (when to save the best model)
stopping_criterion # end experiment if stopping criterion does not improve
## dataset
#### These three parameters will always be rounded to an integer number of batches, so don't be surprised if you see different values than the ones provided.
train_n_samples # Just consider train_n_sample train data
valid_n_samples # Just consider valid_n_sample validation data
test_n_samples # Just consider test_n_sample test data for
#### If you don't have enough RAM/GPU or swap memory, leave these three parameters to True, otherwise you may get an error like this when evaluating :
###### RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
remove_long_sentences_train # remove long sentences in train dataset
remove_long_sentences_valid # remove long sentences in valid dataset
remove_long_sentences_test # remove long sentences in test dataset
See mt_template.json file for more details.
config_file=../configs/mt_template.json
python train.py --config_file $config_file
When the ae_steps
and bt_steps
objects alone are specified, this is unsupervised machine translation, and only requires monolingual data. If the parallel data is available, give mt_step
a value based on the language pairs for which the data is available.
The description made above remains valid here
## main parameters
reload_model # model to reload for encoder,decoder
## data location / training objective
ae_steps # denoising auto-encoder training steps
bt_steps # back-translation steps
mt_steps # parallel training steps
word_shuffle # noise for auto-encoding loss
word_dropout # noise for auto-encoding loss
word_blank # noise for auto-encoding loss
lambda_ae # scheduling on the auto-encoding coefficient
## transformer parameters
encoder_only # use a decoder for MT
## optimization
tokens_per_batch # use batches with a fixed number of words
eval_bleu # also evaluate the BLEU score
Evaluated on (cols)--------- Trained on (rows) | Bafi | Bulu | Ewon | Ghom | Limb | Ngie | Dii | Doya | Peer | Samb | Guid | Guiz | Kaps | Mofa | Mofu | Du_n | Ejag | Fulf | Gbay | MASS | Tupu | Vute |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bafi | 15.155782/46.113990 | 3522.435230/12.694301 | 10532.574414/3.108808 | 3414.970521/10.103627 | 3662.233924/10.880829 | 4476.028980/2.072539 | 4594.588311/10.362694 | 3840.575574/13.989637 | 3111.148085/13.212435 | 4210.511141/8.031088 | 6607.939683/2.590674 | 7506.246899/3.108808 | 11121.594025/3.367876 | 3122.591005/13.212435 | 3183.283705/10.621762 | 5504.065998/8.549223 | 4127.620979/3.108808 | 9107.779213/6.994819 | 7440.762805/3.886010 | 4916.778213/12.176166 | 8239.932584/4.922280 | 3192.590598/10.362694 |
Bulu | 577.711688/9.585492 | 18.602898/43.264249 | 795.094593/17.357513 | 589.636415/13.471503 | 1482.709434/8.549223 | 1113.122905/12.435233 | 994.030274/11.658031 | 820.063393/10.103627 | 828.162228/11.658031 | 1519.449874/3.367876 | 1183.604483/9.326425 | 671.542857/13.989637 | 1427.515245/5.440415 | 657.031222/13.212435 | 1018.342338/6.217617 | 602.305603/10.880829 | 1066.765090/6.994819 | 1349.669421/6.476684 | 605.298410/13.989637 | 1615.328636/5.699482 | 2493.141092/8.290155 | 699.009937/13.730570 |
Ewon | 2930.433348/13.730570 | 784.556467/12.435233 | 439.343693/11.139896 | 8576.270483/3.886010 | 1408.305834/12.176166 | 6329.517824/5.181347 | 4374.527024/8.031088 | 5703.222147/4.922280 | 3226.438808/13.471503 | 5147.417352/9.585492 | 7383.547206/3.886010 | 2049.974847/13.730570 | 3458.765537/12.176166 | 1428.351000/11.139896 | 4890.406327/1.813472 | 2050.215975/11.917098 | 4693.132443/2.331606 | 3796.911033/9.844560 | 4985.892435/7.253886 | 3737.211837/11.658031 | 8497.461052/1.036269 | 8105.614715/2.590674 |
Ghom | 10826.769423/12.176166 | 7919.745037/10.621762 | 13681.624683/6.735751 | 112.759549/22.538860 | 8550.764036/13.212435 | 21351.213307/11.658031 | 5724.234345/11.917098 | 7638.186054/10.621762 | 8992.791640/6.735751 | 9870.502751/5.440415 | 8671.271306/14.248705 | 7952.305962/9.844560 | 17073.248866/7.253886 | 17507.383398/3.626943 | 6253.188979/12.435233 | 6616.060359/9.585492 | 31826.000072/3.108808 | 11636.816092/11.398964 | 6129.150512/14.507772 | 9667.854370/11.139896 | 14276.187678/8.031088 | 7152.109226/12.953368 |
Limb | 2348.605310/7.772021 | 5910.088736/10.103627 | 11640.836610/2.331606 | 2234.982947/8.031088 | 16.486114/48.186528 | 5240.029343/10.880829 | 3485.743598/11.139896 | 1744.289850/10.880829 | 2357.786346/11.658031 | 2829.453145/10.362694 | 6097.658965/6.735751 | 2806.032546/9.326425 | 2530.422427/11.139896 | 2234.037369/14.507772 | 3106.412553/9.067358 | 5675.990382/8.549223 | 4323.215519/10.880829 | 5303.094881/7.512953 | 3222.476499/10.362694 | 2619.771393/12.435233</ td> | 6315.916126/12.435233 | 1965.282639/9.326425 |
Ngie | 2494.668579/10.621762 | 1683.610004/7.772021 | 645.074490/13.212435 | 2747.857945/10.621762 | 865.229192/8.031088 | 53.604331/32.642487 | 3487.877577/5.440415 | 2973.100164/9.844560 | 1694.041692/9.844560 | 2285.872589/8.808290 | 3555.658122/3.626943 | 2240.803918/4.663212 | 8214.745127/2.849741 | 2162.964776/8.290155 | 4130.931993/5.699482 | 1251.907556/9.585492 | 1406.624816/6.735751 | 1134.593481/8.031088 | 3484.481404/9.844560 | 1587.951832/9.326425 | 1786.015603/9.326425 | 2117.031454/10.103627 |
Dii | 5369.974508/5.181347 | 3526.951377/11.917098 | 4466.736657/2.590674 | 3468.181916/8.808290 | 1524.457754/10.880829 | 856.533233/10.362694 | 16.031832/47.150259 | 3570.945172/11.658031 | 1933.128270/11.139896 | 3086.805425/7.253886 | 5545.945984/3.626943 | 1592.451661/11.139896 | 7351.154713/2.331606 | 1430.511351/14.248705 | 4198.900876/4.145078 | 2587.338616/8.290155 | 3315.158358/2.590674 | 2903.721453/8.808290 | 4416.753252/3.886010 | 3044.769713/5.440415 | 3276.637193/10.362694 | 3551.309415/8.808290 |
Doya | 2413.178389/7.253886 | 2925.237118/9.326425 | 3035.126064/9.844560 | 6431.020717/4.404145 | 2888.802299/10.362694 | 4296.348738/9.585492 | 1963.357861/9.067358 | 225.399738/14.507772 | 2647.241446/4.663212 | 3559.797389/1.036269 | 3224.327707/8.549223 | 1628.560179/16.062176 | 7036.636934/2.072539 | 2378.384535/7.772021 | 2526.667089/10.103627 | 2560.562728/10.362694 | 3486.425933/7.253886 | 4898.016349/6.217617 | 1336.163366/12.176166 | 5378.777228/0.518135 | 2334.347220/9.585492 | 4210.426671/6.476684 |
Peer | 5417.812131/7.253886 | 3718.857566/8.290155 | 3921.429577/10.103627 | 8042.333854/2.590674 | 4744.329113/12.435233 | 2378.606152/7.772021 | 4297.265443/7.253886 | 7835.525318/3.108808 | 27.612503/46.113990 | 8547.481994/3.367876 | 7819.217930/4.922280 | 2009.553562/13.730570 | 7929.664487/2.590674 | 5227.466016/3.108808 | 2828.595071/10.103627 | 3109.933571/11.398964 | 3449.171674/7.512953 | 7517.809582/5.181347 | 3593.460649/9.326425 | 6490.444215/5.181347 | 8583.548031/6.994819 | 3640.649700/9.585492 |
Samb | 1921.203126/10.621762 | 2876.156252/8.808290 | 5222.268404/2.331606 | 2258.419159/8.808290 | 2940.603464/9.844560 | 757.885957/10.362694 | 2852.564926/3.886010 | 3568.046199/9.585492 | 3198.132105/11.658031 | 14.473909/45.336788 | 2135.946491/9.326425 | 1882.791510/12.435233 | 1380.449126/12.694301 | 2739.728389/6.217617 | 1114.151589/13.989637 | 2588.952886/10.362694 | 2408.673909/9.844560 | 1012.804391/13.471503 | 4310.704371/6.217617 | 2429.426652/3.108808 | 1681.603952/7.772021 | 2305.207465/4.404145 |
Guid | 11105.869490/11.917098 | 11350.393050/8.549223 | 24157.732815/2.331606 | 28800.139343/5.440415 | 9497.473893/11.139896 | 11941.642599/11.658031 | 26891.060403/2.072539 | 35288.834478/3.367876 | 11458.390164/9.326425 | 8581.012321/12.953368 | 669.152371/22.020725 | 8237.415053/12.953368 | 24641.309182/3.626943 | 12256.261503/6.735751 | 8329.239657/15.025907 | 18733.469719/2.590674 | 13013.633062/11.398964 | 22151.485850/4.922280 | 15139.079118/12.176166</ td> | 12649.997596/11.139896 | 13526.708187/9.844560 | 14521.723680/13.471503 |
Guiz | 1900.984819/11.917098 | 3422.299591/5.440415 | 2920.779863/13.212435 | 2657.232975/3.886010 | 7763.772745/6.217617 | 2516.088934/11.398964 | 1556.474440/12.953368 | 1450.939238/12.694301 | 1852.263760/12.435233 | 3503.139397/5.440415 | 1957.981930/7.772021 | 5.612643/60.362694 | 2030.975178/10.621762 | 3100.456750/9.585492 | 3816.057439/9.067358 | 2527.372931/10.103627 | 2017.135324/9.585492 | 1771.010720/12.953368 | 2467.262902/9.067358 | 6465.542228/6.735751 | 4936.521836/5.181347 | 3251.372451/4.663212 |
Kaps | 4787.151015/7.772021 | 4026.495938/9.067358 | 2591.212157/13.730570 | 3963.789278/11.139896 | 4835.168698/9.844560 | 3738.018788/5.958549 | 3472.599548/9.067358 | 2846.824328/9.067358 | 3964.442923/6.217617 | 8248.174848/4.663212 | 3178.776910/9.326425 | 4521.187784/6.476684 | 6.392693/63.730570 | 4535.673748/6.476684 | 2285.708359/13.730570 | 5222.426332/5.699482 | 4409.982716/5.440415 | 2124.534904/10.362694 | 4863.209844/10.362694 | 4875.780156/3.886010 | 4278.744225/12.176166 | 4661.710772/9.067358 |
Mofa | 5555.267163/7.772021 | 5328.793555/11.658031 | 6064.913246/13.730570 | 8844.481560/5.181347 | 14355.051790/6.217617 | 10773.098216/8.290155 | 5702.554716/11.398964 | 11819.967712/5.958549 | 5810.652609/12.435233 | 10899.166334/6.476684 | 9606.038800/5.699482 | 4528.077873/13.471503 | 10261.988658/9.844560 | 38.718690/38.341969 | 7191.371927/8.290155 | 4847.594375/14.248705 | 8110.295270/9.844560 | 14375.814958/5.699482 | 10070.806870/3.626943 | 10826.318474/8.290155 | 10187.374717/7.772021 | 16953.170797/3.626943 |
Mofu | 2175.168540/11.658031 | 3005.393159/10.621762 | 2773.793897/7.253886 | 2257.313709/6.476684 | 1807.203325/13.471503 | 2481.194623/2.331606 | 1626.688315/12.435233 | 1473.207901/13.212435 | 3206.638463/8.290155 | 1358.112972/12.435233 | 2550.513183/10.880829 | 1867.275865/12.694301 | 2847.897967/4.145078 | 1645.699003/13.471503 | 50.399227/32.642487 | 3831.820284/3.108808 | 1679.421861/9.844560 | 1957.944241/13.989637 | 1655.398024/13.212435 | 3439.753108/6.735751 | 4164.392749/9.844560 | 2176.478824/10.103627 |
Du_n | 3358.977688/12.694301 | 8269.025689/5.958549 | 6784.926221/4.922280 | 4034.987828/10.362694 | 8317.977821/5.440415 | 4469.988388/9.326425 | 4581.242219/9.585492 | 4046.289387/10.880829 | 4587.843666/10.880829 | 4061.430238/12.435233 | 4116.231632/8.031088 | 4043.687467/11.658031 | 8587.884922/5.699482 | 2518.760103/13.989637 | 9252.838415/6.217617 | 38.646292/34.196891 | 2823.000209/11.658031 | 7688.259347/5.699482 | 4184.395191/9.844560 | 6460.323149/9.844560 | 12418.880207/5.699482 | 4394.753911/10.362694 |
Ejag | 878.221181/8.290155 | 2977.854246/10.362694 | 1122.454274/13.212435 | 4066.806240/3.626943 | 4401.408293/12.694301 | 1324.839235/11.139896 | 2760.972117/9.585492 | 802.718089/8.808290 | 1935.328428/6.735751 | 2456.134064/8.549223 | 948.726346/11.658031 | 1464.326862/6.994819 | 1999.633312/6.476684 | 2483.815842/4.663212 | 790.752998/11.917098 | 1436.471564/10.362694 | 27.125567/39.896373 | 2701.314483/8.549223</ td> | 739.895562/13.989637 | 1119.207373/9.844560 | 2061.967307/3.367876 | 3116.635849/4.663212 |
Fulf | 3122.754082/11.139896 | 3172.412810/8.290155 | 2632.034499/10.103627 | 1803.237123/14.507772 | 3015.507576/12.953368 | 4697.430105/10.621762 | 2221.398811/11.917098 | 3338.511704/7.772021 | 5857.163684/4.663212 | 2631.329961/12.694301 | 1756.767457/14.248705 | 3965.216351/8.031088 | 2961.580251/10.362694 | 1850.532804/14.248705 | 2431.677037/8.808290 | 2688.040706/8.549223 | 6237.846441/3.108808 | 9.819160/53.108808 | 1794.314668/12.435233 | 2633.154009/4.922280 | 5899.732260/9.585492 | 6035.594459/5.440415 |
Gbay | 3537.010215/8.808290 | 2213.336729/9.326425 | 958.976958/14.766839 | 2170.105117/2.849741 | 2381.840897/8.549223 | 1092.011356/11.398964 | 989.079405/15.284974 | 2110.708219/12.953368 | 1212.493865/13.989637 | 1342.159428/12.953368 | 784.478130/16.321244 | 1404.757907/15.284974 | 1949.759014/13.730570 | 1165.979838/12.694301 | 1940.255308/5.699482 | 1073.951745/13.730570 | 2180.263932/7.253886 | 2639.229412/8.031088 | 4.503568/64.766839 | 2711.475687/5.440415 | 2879.142805/11.139896 | 2777.515280/3.626943 |
MASS | 2052.763675/6.476684 | 2123.090411/11.139896 | 1150.690864/11.398964 | 404.857470/19.170984 | 4114.380214/2.849741 | 1177.460159/10.880829 | 1553.261634/11.917098 | 767.332823/13.212435 | 1558.036793/6.217617 | 673.483311/13.730570 | 1308.799442/6.735751 | 2525.700131/5.440415 | 1157.282835/14.248705 | 1665.795367/8.031088 | 969.622799/11.139896 | 2236.251124/10.621762 | 1768.310288/9.585492 | 1530.460913/10.621762 | 703.513823/14.766839 | 9.311520/52.072539 | 3781.478640/5.440415 | 783.170102/16.580311 |
Tupu | 499.010245/24.611399 | 2789.182977/9.844560 | 1176.557896/16.062176 | 335.366353/21.243523 | 3759.854817/4.922280 | 1473.248900/8.290155 | 1637.969909/15.284974 | 444.487258/23.056995 | 729.184899/19.430052 | 326.348924/24.611399 | 530.140976/24.611399 | 834.757176/20.207254 | 1014.747872/11.398964 | 1361.103340/11.398964 | 447.754239/17.875648 | 1313.622745/15.803109 | 2020.767969/9.326425 | 1234.031067/13.730570 | 242.696296/29.533679 | 1209.709716/14.766839 | 5.328121/62.953368 | 678.820813/13.730570 |
Vute | 5247.001730/8.290155 | 2972.688386/11.398964 | 3141.040872/9.067358 | 4304.014532/12.435233 | 2981.350915/10.880829 | 7944.078280/2.331606 | 3013.186151/13.730570 | 2532.120943/12.176166 | 4688.069751/9.844560 | 8022.399859/3.886010 | 5315.095277/3.626943 | 2075.166168/12.694301 | 3794.597938/12.176166 | 2879.870276/13.212435 | 4364.837110/3.367876 | 3858.872867/8.549223 | 2749.070864/10.880829 | 9917.265191/3.367876 | 8091.176547/3.108808 | 5939.386425/4.404145 | 7670.501815/2.849741 | 43.658700/33.419689 |
If you want to evaluate the LM on a language lang
, you must first have a file named lang.txt
in the $src_path
directory of eval_data.sh.
For example if you want to use the biblical corpus, you can run scripts/bible.py :
# folder containing the csvs folder
csv_path=
# folder in which the objective folders will be created (mono or para)
output_dir=
# monolingual one ("mono") or parallel one ("para")
data_type=mono
# list of languages to be considered in alphabetical order and separated by a comma
# case of one language
languages=lang,lang
# case of many languages
languages=lang1,lang2,...
# old_only : use only old testament
# use only new testament
new_only=True
python ../scripts/bible.py --csv_path $csv_path --output_dir $output_dir --data_type $data_type --languages $languages --new_only $new_only
See other parameters in scripts/bible.py
Modify parameters in eval_data.sh
# languages to be evaluated
languages=lang1,lang2,...
chmod +x ../eval_data.sh
../eval_data.sh $languages
We take the language to evaluate (say Bulu
), replace the files test.Bulu.pth
(which was created with the VOCAB
and CODE
of Bafi
, the evaluating language) with test.Bafi.pth
(since Bafi
evaluates, the train.py
script requires that the dataset has the (part of the) name of the lgs
). Then we just run the evaluation, the results (acc and ppl) we get is the result of LM Bafia on the Bulu language.
# evaluating language
tgt_pair=
# folder containing the data to be evaluated (must match $tgt_path in eval_data.sh)
src_path=
# You have to change two parameters in the configuration file used to train the LM which evaluates ("data_path":"$src_path" and "eval_only": "True")
# You must also specify the "reload_model" parameter, otherwise the last checkpoint found will be loaded for evaluation.
config_file=../configs/lm_template.json
# languages to be evaluated
eval_lang=
chmod +x ../scripts/evaluate.sh
../scripts/evaluate.sh $eval_lang
When the evaluation is finished you will see a file named eval.log
in the $dump_path/$exp_name/$exp_id
folder containing the evaluation results.
Note :The description given below is only valid when the LM evaluator has been trained on only one language (and therefore without TLM). But let's consider the case where the basic LM has been trained on en-fr
and we want to evaluate it on de
or de-ru
. $tgt_pair
deviates from en-fr
, but language
varies depending on whether the evaluation is going to be done on one language or two:
de
: lang=de-de
de-ru
: lang=de-ru
.Please cite [1] and [2] if you found the resources in this repository useful.
[1] G. Lample , A. Conneau Cross-lingual Language Model Pretraining and facebookresearch/XLM
* Equal contribution. Order has been determined with a coin flip.
@article{lample2019cross,
title={Cross-lingual Language Model Pretraining},
author={Lample, Guillaume and Conneau, Alexis},
journal={Advances in Neural Information Processing Systems (NeurIPS)},
year={2019}
}
See the LICENSE file for more details.