<h2 id="i-cross-lingual-language-model-pretraining-xlm-https-github-com-facebookresearch-xlm-">I. Cross-lingual language model pretraining (<a href="https://github.com/facebookresearch/XLM">XLM</a>)</h2>
<p>XLM supports multi-GPU and multi-node training, and contains code for:</p>
<ul>
<li><strong>Language model pretraining</strong>:<ul>
<li><strong>Causal Language Model</strong> (CLM)</li>
<li><strong>Masked Language Model</strong> (MLM)</li>
<li><strong>Translation Language Model</strong> (TLM)</li>
</ul>
</li>
<li><strong>GLUE</strong> fine-tuning</li>
<li><strong>XNLI</strong> fine-tuning</li>
<li><strong>Supervised / Unsupervised MT</strong> training:<ul>
<li>Denoising auto-encoder</li>
<li>Parallel data training</li>
<li>Online back-translation</li>
</ul>
</li>
</ul>
<h4 id="dependencies">Dependencies</h4>
<ul>
<li>Python 3</li>
<li><a href="http://www.numpy.org/">NumPy</a></li>
<li><a href="http://pytorch.org/">PyTorch</a> (currently tested on version 0.4 and 1.0)</li>
<li><a href="https://github.com/facebookresearch/XLM/tree/master/tools#fastbpe">fastBPE</a> (generate and apply BPE codes)</li>
<li><a href="https://github.com/facebookresearch/XLM/tree/master/tools#tokenizers">Moses</a> (scripts to clean and tokenize text only - no installation required)</li>
<li><a href="https://github.com/nvidia/apex#quick-start">Apex</a> (for fp16 training)</li>
</ul>
<h2 id="ii-train-your-own-model">II. Train your own model</h2>
<p><strong>Open the illustrative notebook in colab</strong><a href="https://colab.research.google.com/github/Tikquuss/meta_XLM/blob/master/notebooks/demo/tuto.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a></p>
<p><strong>Note</strong> : Most of the bash scripts used in this repository were written on the windows operating system, and can generate this <a href="https://prograide.com/pregunta/5588/configure--bin--sh--m-mauvais-interpreteur">error</a> on linux platforms.<br>This problem can be corrected with the following command: </p>
<pre><code>filename=my_file.sh 
cat <span class="hljs-built_in">$filename</span> | tr -<span class="hljs-keyword">d</span> '\<span class="hljs-keyword">r</span>' &gt; <span class="hljs-built_in">$filename</span>.<span class="hljs-keyword">new</span> &amp;&amp; rm <span class="hljs-built_in">$filename</span> &amp;&amp; <span class="hljs-keyword">mv</span> <span class="hljs-built_in">$filename</span>.<span class="hljs-keyword">new</span> <span class="hljs-built_in">$filename</span>
</code></pre><h3 id="1-preparing-the-data">1. Preparing the data</h3>
<p>At this level, if you have pre-processed binary data in pth format (for example from XLM experimentation or improvised by yourself), group them in a specific folder that you will mention as a parameter by calling the script <a href="XLM/train.py">train.py</a>.<br>If this is not the case, we assume that you have txt files available for preprocessing. Look at the following example for which we have three translation tasks: <code>English-French, German-English and German-French</code>.</p>
<p>We have the following files available for preprocessing: </p>
<pre><code>- en-fr<span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.txt</span> and en-fr<span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.txt</span> 
- de-en<span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.txt</span> and de-en<span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.txt</span> 
- de-fr<span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.txt</span> and de-fr<span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.txt</span>
</code></pre><p>All these files must be in the same folder (<code>PARA_PATH</code>).<br>You can also (only or optionally) have monolingual data available (<code>en.txt, de.txt and fr.txt</code>; in <code>MONO_PATH</code> folder).<br>Parallel and monolingual data can all be in the same folder.</p>
<p><strong>Note</strong> : Languages must be submitted in alphabetical order (<code>de-en and not en-de, fr-ru and not ru-fr...</code>). If you submit them in any order you will have problems loading data during training, because when you run the <a href="XLM/train.py">train.py</a> script the parameters like the language pair are put back in alphabetical order before being processed. Don&#39;t worry about this alphabetical order restriction, XLM for MT is naturally trained to translate sentences in both directions. See <a href="scripts/translate.py">translate.py</a>.</p>
<p><a href="http://opus.nlpl.eu/">OPUS collections</a> is a good source of dataset. We illustrate in the <a href="scripts/opus.sh">opus.sh</a> script how to download the data from opus and convert it to a text file.<br>Changing parameters ($PARA_PATH and $SRC) in <a href="scripts/opus.sh">opus.sh</a>.</p>
<pre><code><span class="hljs-keyword">cd</span> meta_XLM
chmod +x ./scripts/opus.<span class="hljs-keyword">sh</span>
./scripts/opus.<span class="hljs-keyword">sh</span> <span class="hljs-keyword">de</span>-fr
</code></pre><p>Another source for <code>other_languages-english</code> data is <a href="http://www.manythings.org/anki/">anki Tab-delimited Bilingual Sentence Pairs</a>. Simply download the .zip file, unzip to extract the <code>other_language.txt</code> file. This file usually contains data in the form of <code>sentence_en sentence_other_language other_information</code> on each line. See <a href="scripts/anki.py">anki.py</a> and <a href="scripts/anki.sh">anky.sh</a> in relation to a how to extract data from <a href="http://www.manythings.org/anki/">anki</a>. Example of how to download and extract <code>de-en</code> and <code>en-fr</code> pair data.</p>
<pre><code>cd meta_XLM
output_path=/<span class="hljs-attribute">content</span>/data/para
mkdir <span class="hljs-variable">$output_path</span>
chmod +x ./scripts/anki<span class="hljs-selector-class">.sh</span>
./scripts/anki<span class="hljs-selector-class">.sh</span> de,en deu-eng <span class="hljs-variable">$output_path</span> scripts/anki<span class="hljs-selector-class">.py</span>
./scripts/anki<span class="hljs-selector-class">.sh</span> en,fr fra-eng <span class="hljs-variable">$output_path</span> scripts/anki.py
</code></pre><p>After that you will have in <code>data/para</code> following files : <code>de-en.de.txt, de-en.en.txt, deu.txt, deu-eng.zip and _about.txt</code>  </p>
<p>Move to the <code>XLM</code> folder in advance.  </p>
<pre><code><span class="hljs-built_in">cd</span> XLM
</code></pre><p>Install the following dependencies (<a href="https://github.com/facebookresearch/XLM/tree/master/tools#fastbpe">fastBPE</a> and <a href="https://github.com/facebookresearch/XLM/tree/master/tools#tokenizers">Moses</a>) if you have not already done so. </p>
<pre><code>git <span class="hljs-keyword">clone</span> <span class="hljs-title">https</span>://github.com/moses-smt/mosesdecoder tools/mosesdecoder
git <span class="hljs-keyword">clone</span> <span class="hljs-title">https</span>://github.com/glample/fastBPE tools/fastBPE &amp;&amp; cd tools/fastBPE &amp;&amp; g++ -<span class="hljs-attr">std=</span>c++<span class="hljs-number">11</span> -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
</code></pre><p>Changing parameters in <a href="data.sh">data.sh</a>. Between lines 94 and 100 of <a href="data.sh">data.sh</a>, you have two options corresponding to two scripts to execute according to the distribution of the folders containing your data. Option 2 is chosen by default, kindly uncomment the lines corresponding to your option.<br>With too many BPE codes (depending on the size of the dataset) you may get this <a href="https://github.com/glample/fastBPE/issues/7">error</a>. Decrease the number of codes (e.g. you can dichotomously search for the appropriate/maximum number of codes that make the error disappear)</p>
<pre><code>languages=<span class="hljs-keyword">de</span>,<span class="hljs-keyword">en</span>,fr
chmod +x ../data.<span class="hljs-keyword">sh</span> 
../data.<span class="hljs-keyword">sh</span> <span class="hljs-variable">$languages</span>
</code></pre><p>If you stop the execution when processing is being done on a file please delete this erroneous file before continuing or restarting the processing, otherwise the processing will continue with this erroneous file and potential errors will certainly occur.  </p>
<p>After this you will have the following (necessary) files in <code>$OUTPATH</code> (and <code>$OUTPATH/fine_tune</code> depending on the parameter <code>$sub_task</code>):  </p>
<pre><code>- monolingual data :
    - training data   : train<span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.pth</span>, train<span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.pth</span> and train<span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.pth</span>
    - test data       : test<span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.pth</span>, test<span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.pth</span> and test<span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.pth</span>
    - validation data : valid<span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.pth</span>, valid<span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.pth</span> and valid<span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.pth</span>
- parallel data :
    - training data : 
        - train<span class="hljs-selector-class">.en-fr</span><span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.pth</span> and train<span class="hljs-selector-class">.en-fr</span><span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.pth</span> 
        - train<span class="hljs-selector-class">.de-en</span><span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.pth</span> and train<span class="hljs-selector-class">.de-en</span><span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.pth</span>
        - train<span class="hljs-selector-class">.de-fr</span><span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.pth</span> and train<span class="hljs-selector-class">.de-fr</span><span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.pth</span> 
    - test data :
        - test<span class="hljs-selector-class">.en-fr</span><span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.pth</span> and test<span class="hljs-selector-class">.en-fr</span><span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.pth</span> 
        - test<span class="hljs-selector-class">.de-en</span><span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.pth</span> and test<span class="hljs-selector-class">.de-en</span><span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.pth</span>
        - test<span class="hljs-selector-class">.de-fr</span><span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.pth</span> and test<span class="hljs-selector-class">.de-fr</span><span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.pth</span> 
    - validation data
        - valid<span class="hljs-selector-class">.en-fr</span><span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.pth</span> and valid<span class="hljs-selector-class">.en-fr</span><span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.pth</span> 
        - valid<span class="hljs-selector-class">.de-en</span><span class="hljs-selector-class">.en</span><span class="hljs-selector-class">.pth</span> and valid<span class="hljs-selector-class">.de-en</span><span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.pth</span>
        - valid<span class="hljs-selector-class">.de-fr</span><span class="hljs-selector-class">.de</span><span class="hljs-selector-class">.pth</span> and valid<span class="hljs-selector-class">.de-fr</span><span class="hljs-selector-class">.fr</span><span class="hljs-selector-class">.pth</span> 
 - <span class="hljs-selector-tag">code</span> and vocab
</code></pre><p>To use the biblical corpus, run <a href="bible.sh">bible.sh</a> instead of <a href="data.sh">data.sh</a>. Here is the list of languages available (and to be specified as <code>$languages</code> value) in this case : </p>
<ul>
<li><strong>Languages with data in the New and Old Testament</strong> : <code>Francais, Anglais, Fulfulde_Adamaoua or Fulfulde_DC (formal name : Fulfulde), Bulu, KALATA_KO_SC_Gbaya or KALATA_KO_DC_Gbaya (formal name :  Gbaya), BIBALDA_TA_PELDETTA (formal name : MASSANA), Guiziga, Kapsiki_DC (formal name : Kapsiki), Tupurri</code>.</li>
<li><strong>Languages with data in the New Testament only</strong> : <code>Bafia, Ejagham, Ghomala, MKPAMAN_AMVOE_Ewondo (formal name : Ewondo), Ngiemboon, Dii, Vute, Limbum, Mofa, Mofu_Gudur, Doyayo, Guidar, Peere_Nt&amp;Psalms, Samba_Leko, Du_na_sdik_na_wiini_Alaw</code>.<br>It is specified in <a href="bible.sh">bible.sh</a> that you must have in <code>csv_path</code> a folder named csvs. Here is the <a href="https://drive.google.com/file/d/1NuSJ-NT_BsU1qopLu6avq6SzUEf6nVkk/view?usp=sharing">drive link</a> of its zipped version.<br>Concerning training, specify the first four letters of each language (<code>Bafi</code> instead of <code>Bafia</code> for example), except <code>KALATA_KO_SC_Gbaya/KALATA_KO_DC_Gbaya which becomes Gbay (first letters of Gbaya), BIBALDA_TA_PELDETTA which becomes MASS (first letters of MASSANA), MKPAMAN_AMVOE_Ewondo which becomes Ewon (first letters of Ewondo), Francais and Anglais which becomes repectively fr and en</code>. Indeed, <a href="bible.sh">bible.sh</a> uses these abbreviations to create the files and not the language names themselves.<br>One last thing in the case of the biblical corpus is that when only one language is to be specified, it must be specified twice. For example: <code>languages=Bafia,Bafia</code> instead of <code>languages=Bafia</code>.</li>
</ul>
<h3 id="2-pretrain-a-language-model">2. Pretrain a language model</h3>
<p>Install the following dependencie (<a href="https://github.com/nvidia/apex#quick-start">Apex</a>) if you have not already done so.</p>
<pre><code>git clone https:<span class="hljs-comment">//github.com/NVIDIA/apex</span>
pip install -v --<span class="hljs-keyword">no</span>-cache-<span class="hljs-keyword">dir</span> --<span class="hljs-keyword">global</span>-option=<span class="hljs-string">"--cpp_ext"</span> --<span class="hljs-keyword">global</span>-option=<span class="hljs-string">"--cuda_ext"</span> ./apex
</code></pre><p>Instead of passing all the parameters of train.py, put them in a json file and specify the path to this file in parameter (See <a href="configs/lm_template.json">lm_template.json</a> file for more details).</p>
<pre><code>config_file=../configs/lm_template<span class="hljs-selector-class">.json</span>
python train<span class="hljs-selector-class">.py</span> --config_file <span class="hljs-variable">$config_file</span>
</code></pre><p>If you pass a parameter by calling the script <a href="XLM/train.py">train.py</a> (example: <code>python train.py --config_file $config_file --data_path my/data_path</code>), it will overwrite the one passed in <code>$config_file</code>.<br>Once the training is finished you will see a file named <code>train.log</code> in the <code>$dump_path/$exp_name/$exp_id</code> folder information about the training. You will find in this same folder your checkpoints and best model.<br>When <code>&quot;mlm_steps&quot;:&quot;...&quot;</code>, train.py automatically uses the languages to have <code>&quot;mlm_steps&quot;:&quot;de,en,fr,de-en,de-fe,en-fr&quot;</code> (give a precise value to mlm_steps if you don&#39;t want to do all MLM and TLM, example : <code>&quot;mlm_steps&quot;:&quot;en,fr,en-fr&quot;</code>). This also applies to <code>&quot;clm_steps&quot;:&quot;...&quot;</code> which deviates <code>&quot;clm_steps&quot;:&quot;de,en,fr&quot;</code> in this case.    </p>
<p>Note :<br>-<code>en</code> means MLM on <code>en</code>, and requires the following three files in <code>data_path</code>: <code>a.en.pth, a ∈ {train, test, valid} (monolingual data)</code><br>-<code>en-fr</code> means TLM on <code>en and fr</code>, and requires the following six files in <code>data_path</code>: <code>a.en-fr.b.pth, a ∈ {train, test, valid} and b ∈ {en, fr} (parallel data)</code><br>-<code>en,fr,en-fr</code> means MLM+TLM on <code>en, fr, en and fr</code>, and requires the following twelve files in <code>data_path</code>: <code>a.b.pth and a.en-fr.b.pth, a ∈ {train, test, valid} and b ∈ {en, fr}</code>  </p>
<p>To <a href="https://github.com/facebookresearch/XLM#how-can-i-run-experiments-on-multiple-gpus">train with multiple GPUs</a> use:</p>
<pre><code>export NGPU=<span class="hljs-number">8</span>; <span class="hljs-keyword">python</span> -m torch.distributed.<span class="hljs-keyword">launch</span> --nproc_per_node=$NGPU train.py --config_file $config_file
</code></pre><p><strong>Tips</strong>: Even when the validation perplexity plateaus, keep training your model. The larger the batch size the better (so using multiple GPUs will improve performance). Tuning the learning rate (e.g. [0.0001, 0.0002]) should help.</p>
<h6 id="description-of-some-essential-parameters">Description of some essential parameters</h6>
<pre><code><span class="hljs-comment">## main parameters</span>
exp_name                     <span class="hljs-comment"># experiment name</span>
exp_id                       <span class="hljs-comment"># Experiment ID</span>
dump_path                    <span class="hljs-comment"># where to store the experiment (the model will be stored in $dump_path/$exp_name/$exp_id)</span>

<span class="hljs-comment">## data location / training objective</span>
data_path                    <span class="hljs-comment"># data location </span>
lgs                          <span class="hljs-comment"># considered languages/meta-tasks</span>
clm_steps                    <span class="hljs-comment"># CLM objective</span>
mlm_steps                    <span class="hljs-comment"># MLM objective</span>

<span class="hljs-comment">## transformer parameters</span>
emb_dim                      <span class="hljs-comment"># embeddings / model dimension</span>
n_layers                     <span class="hljs-comment"># number of layers</span>
n_heads                      <span class="hljs-comment"># number of heads</span>
dropout                      <span class="hljs-comment"># dropout</span>
attention_dropout            <span class="hljs-comment"># attention dropout</span>
gelu_activation              <span class="hljs-comment"># GELU instead of ReLU</span>

<span class="hljs-comment">## optimization</span>
<span class="hljs-keyword">batch_size </span>                  <span class="hljs-comment"># sequences per batch</span>
<span class="hljs-keyword">bptt </span>                        <span class="hljs-comment"># sequences length</span>
optimizer                    <span class="hljs-comment"># optimizer</span>
epoch_size                   <span class="hljs-comment"># number of sentences per epoch</span>
max_epoch                    <span class="hljs-comment"># Maximum epoch size</span>
validation_metrics           <span class="hljs-comment"># validation metric (when to save the best model)</span>
stopping_criterion           <span class="hljs-comment"># end experiment if stopping criterion does not improve</span>

<span class="hljs-comment">## dataset</span>
<span class="hljs-comment">#### These three parameters will always be rounded to an integer number of batches, so don't be surprised if you see different values than the ones provided.</span>
train_n_samples              <span class="hljs-comment"># Just consider train_n_sample train data</span>
valid_n_samples              <span class="hljs-comment"># Just consider valid_n_sample validation data </span>
test_n_samples               <span class="hljs-comment"># Just consider test_n_sample test data for</span>
<span class="hljs-comment">#### If you don't have enough RAM/GPU or swap memory, leave these three parameters to True, otherwise you may get an error like this when evaluating :</span>
<span class="hljs-comment">###### RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered</span>
remove_long_sentences_train <span class="hljs-comment"># remove long sentences in train dataset      </span>
remove_long_sentences_valid <span class="hljs-comment"># remove long sentences in valid dataset  </span>
remove_long_sentences_test  <span class="hljs-comment"># remove long sentences in test dataset</span>
</code></pre><h6 id="there-are-other-parameters-that-are-not-specified-here-see-train-py-xlm-train-py-">There are other parameters that are not specified here (see <a href="XLM/train.py">train.py</a>)</h6>
<h3 id="3-train-a-unsupervised-supervised-mt-from-a-pretrained-model">3. Train a (unsupervised/supervised) MT from a pretrained model</h3>
<p>See <a href="configs/mt_template.json">mt_template.json</a> file for more details.</p>
<pre><code>config_file=../configs/mt_template<span class="hljs-selector-class">.json</span>
python train<span class="hljs-selector-class">.py</span> --config_file <span class="hljs-variable">$config_file</span>
</code></pre><p>When the <code>ae_steps</code> and <code>bt_steps</code> objects alone are specified, this is unsupervised machine translation, and only requires monolingual data. If the parallel data is available, give <code>mt_step</code> a value based on the language pairs for which the data is available.  </p>
<h6 id="description-of-some-essential-parameters">Description of some essential parameters</h6>
<p>The description made above remains valid here</p>
<pre><code>## main parameters
reload_model     # model to reload for encoder,decoder
## data location / training objective
ae_steps          # denoising auto-encoder training steps
bt_steps          # back-translation steps
mt_steps          # parallel training steps
word_shuffle      # noise for auto-encoding loss
word_dropout      # noise for auto-encoding loss
word_blank        # noise for auto-encoding loss
lambda_ae         # scheduling on the auto-encoding coefficient

## transformer parameters
encoder_only      # use a decoder for MT

## optimization
tokens_per_batch  # use batches <span class="hljs-keyword">with</span> a fixed number <span class="hljs-keyword">of</span> words
eval_bleu         # also evaluate the BLEU score
</code></pre><h6 id="there-are-other-parameters-that-are-not-specified-here-see-train-py-xlm-train-py-">There are other parameters that are not specified here (see <a href="XLM/train.py">train.py</a>)</h6>
<h3 id="4-how-to-evaluate-a-language-model-trained-on-a-language-l-on-another-language-l-">4. How to evaluate a language model trained on a language L on another language L&#39;.</h3>
<h6 id="our">Our</h6>
<table class='table table-striped'><caption><b>?</b></caption><thead><tr><th scope='col'>Evaluated on (cols)---------<br/>Trained on (rows)</th><th scope='col'>Bafi</th><th scope='col'>Bulu</th><th scope='col'>Ewon</th><th scope='col'>Ghom</th><th scope='col'>Limb</th><th scope='col'>Ngie</th><th scope='col'>Dii</th><th scope='col'>Doya</th><th scope='col'>Peer</th><th scope='col'>Samb</th><th scope='col'>Guid</th><th scope='col'>Guiz</th><th scope='col'>Kaps</th><th scope='col'>Mofa</th><th scope='col'>Mofu</th><th scope='col'>Du_n</th><th scope='col'>Ejag</th><th scope='col'>Fulf</th><th scope='col'>Gbay</th><th scope='col'>MASS</th><th scope='col'>Tupu</th><th scope='col'>Vute</th></tr></thead><tbody><tr><th scope='row'>Bafi</th><td>15.155782/46.113990</td><td>3522.435230/12.694301</td><td>10532.574414/3.108808</td><td>3414.970521/10.103627</td><td>3662.233924/10.880829</td><td>4476.028980/2.072539</td><td>4594.588311/10.362694</td><td>3840.575574/13.989637</td><td><b>3111.148085/13.212435</b></td><td>4210.511141/8.031088</td><td>6607.939683/2.590674</td><td>7506.246899/3.108808</td><td>11121.594025/3.367876</td><td>3122.591005/13.212435</td><td>3183.283705/10.621762</td><td>5504.065998/8.549223</td><td>4127.620979/3.108808</td><td>9107.779213/6.994819</td><td>7440.762805/3.886010</td><td>4916.778213/12.176166</td><td>8239.932584/4.922280</td><td>3192.590598/10.362694</td></tr><tr><th scope='row'>Bulu</th><td><b>577.711688/9.585492</b></td><td>18.602898/43.264249</td><td>795.094593/17.357513</td><td>589.636415/13.471503</td><td>1482.709434/8.549223</td><td>1113.122905/12.435233</td><td>994.030274/11.658031</td><td>820.063393/10.103627</td><td>828.162228/11.658031</td><td>1519.449874/3.367876</td><td>1183.604483/9.326425</td><td>671.542857/13.989637</td><td>1427.515245/5.440415</td><td>657.031222/13.212435</td><td>1018.342338/6.217617</td><td>602.305603/10.880829</td><td>1066.765090/6.994819</td><td>1349.669421/6.476684</td><td>605.298410/13.989637</td><td>1615.328636/5.699482</td><td>2493.141092/8.290155</td><td>699.009937/13.730570</td></tr><tr><th scope='row'>Ewon</th><td>2930.433348/13.730570</td><td><b>784.556467/12.435233</b></td><td>439.343693/11.139896</td><td>8576.270483/3.886010</td><td>1408.305834/12.176166</td><td>6329.517824/5.181347</td><td>4374.527024/8.031088</td><td>5703.222147/4.922280</td><td>3226.438808/13.471503</td><td>5147.417352/9.585492</td><td>7383.547206/3.886010</td><td>2049.974847/13.730570</td><td>3458.765537/12.176166</td><td>1428.351000/11.139896</td><td>4890.406327/1.813472</td><td>2050.215975/11.917098</td><td>4693.132443/2.331606</td><td>3796.911033/9.844560</td><td>4985.892435/7.253886</td><td>3737.211837/11.658031</td><td>8497.461052/1.036269</td><td>8105.614715/2.590674</td></tr><tr><th scope='row'>Ghom</th><td>10826.769423/12.176166</td><td>7919.745037/10.621762</td><td>13681.624683/6.735751</td><td>112.759549/22.538860</td><td>8550.764036/13.212435</td><td>21351.213307/11.658031</td><td><b>5724.234345/11.917098</b></td><td>7638.186054/10.621762</td><td>8992.791640/6.735751</td><td>9870.502751/5.440415</td><td>8671.271306/14.248705</td><td>7952.305962/9.844560</td><td>17073.248866/7.253886</td><td>17507.383398/3.626943</td><td>6253.188979/12.435233</td><td>6616.060359/9.585492</td><td>31826.000072/3.108808</td><td>11636.816092/11.398964</td><td>6129.150512/14.507772</td><td>9667.854370/11.139896</td><td>14276.187678/8.031088</td><td>7152.109226/12.953368</td></tr><tr><th scope='row'>Limb</th><td>2348.605310/7.772021</td><td>5910.088736/10.103627</td><td>11640.836610/2.331606</td><td>2234.982947/8.031088</td><td>16.486114/48.186528</td><td>5240.029343/10.880829</td><td>3485.743598/11.139896</td><td><b>1744.289850/10.880829</b></td><td>2357.786346/11.658031</td><td>2829.453145/10.362694</td><td>6097.658965/6.735751</td><td>2806.032546/9.326425</td><td>2530.422427/11.139896</td><td>2234.037369/14.507772</td><td>3106.412553/9.067358</td><td>5675.990382/8.549223</td><td>4323.215519/10.880829</td><td>5303.094881/7.512953</td><td>3222.476499/10.362694</td><td>2619.771393/12.435233&lt;/
td&gt;<td>6315.916126/12.435233</td><td>1965.282639/9.326425</td></tr><tr><th scope='row'>Ngie</th><td>2494.668579/10.621762</td><td>1683.610004/7.772021</td><td><b>645.074490/13.212435</b></td><td>2747.857945/10.621762</td><td>865.229192/8.031088</td><td>53.604331/32.642487</td><td>3487.877577/5.440415</td><td>2973.100164/9.844560</td><td>1694.041692/9.844560</td><td>2285.872589/8.808290</td><td>3555.658122/3.626943</td><td>2240.803918/4.663212</td><td>8214.745127/2.849741</td><td>2162.964776/8.290155</td><td>4130.931993/5.699482</td><td>1251.907556/9.585492</td><td>1406.624816/6.735751</td><td>1134.593481/8.031088</td><td>3484.481404/9.844560</td><td>1587.951832/9.326425</td><td>1786.015603/9.326425</td><td>2117.031454/10.103627</td></tr><tr><th scope='row'>Dii</th><td>5369.974508/5.181347</td><td>3526.951377/11.917098</td><td>4466.736657/2.590674</td><td>3468.181916/8.808290</td><td>1524.457754/10.880829</td><td><b>856.533233/10.362694</b></td><td>16.031832/47.150259</td><td>3570.945172/11.658031</td><td>1933.128270/11.139896</td><td>3086.805425/7.253886</td><td>5545.945984/3.626943</td><td>1592.451661/11.139896</td><td>7351.154713/2.331606</td><td>1430.511351/14.248705</td><td>4198.900876/4.145078</td><td>2587.338616/8.290155</td><td>3315.158358/2.590674</td><td>2903.721453/8.808290</td><td>4416.753252/3.886010</td><td>3044.769713/5.440415</td><td>3276.637193/10.362694</td><td>3551.309415/8.808290</td></tr><tr><th scope='row'>Doya</th><td>2413.178389/7.253886</td><td>2925.237118/9.326425</td><td>3035.126064/9.844560</td><td>6431.020717/4.404145</td><td>2888.802299/10.362694</td><td>4296.348738/9.585492</td><td>1963.357861/9.067358</td><td>225.399738/14.507772</td><td>2647.241446/4.663212</td><td>3559.797389/1.036269</td><td>3224.327707/8.549223</td><td>1628.560179/16.062176</td><td>7036.636934/2.072539</td><td>2378.384535/7.772021</td><td>2526.667089/10.103627</td><td>2560.562728/10.362694</td><td>3486.425933/7.253886</td><td>4898.016349/6.217617</td><td><b>1336.163366/12.176166</b></td><td>5378.777228/0.518135</td><td>2334.347220/9.585492</td><td>4210.426671/6.476684</td></tr><tr><th scope='row'>Peer</th><td>5417.812131/7.253886</td><td>3718.857566/8.290155</td><td>3921.429577/10.103627</td><td>8042.333854/2.590674</td><td>4744.329113/12.435233</td><td>2378.606152/7.772021</td><td>4297.265443/7.253886</td><td>7835.525318/3.108808</td><td>27.612503/46.113990</td><td>8547.481994/3.367876</td><td>7819.217930/4.922280</td><td><b>2009.553562/13.730570</b></td><td>7929.664487/2.590674</td><td>5227.466016/3.108808</td><td>2828.595071/10.103627</td><td>3109.933571/11.398964</td><td>3449.171674/7.512953</td><td>7517.809582/5.181347</td><td>3593.460649/9.326425</td><td>6490.444215/5.181347</td><td>8583.548031/6.994819</td><td>3640.649700/9.585492</td></tr><tr><th scope='row'>Samb</th><td>1921.203126/10.621762</td><td>2876.156252/8.808290</td><td>5222.268404/2.331606</td><td>2258.419159/8.808290</td><td>2940.603464/9.844560</td><td><b>757.885957/10.362694</b></td><td>2852.564926/3.886010</td><td>3568.046199/9.585492</td><td>3198.132105/11.658031</td><td>14.473909/45.336788</td><td>2135.946491/9.326425</td><td>1882.791510/12.435233</td><td>1380.449126/12.694301</td><td>2739.728389/6.217617</td><td>1114.151589/13.989637</td><td>2588.952886/10.362694</td><td>2408.673909/9.844560</td><td>1012.804391/13.471503</td><td>4310.704371/6.217617</td><td>2429.426652/3.108808</td><td>1681.603952/7.772021</td><td>2305.207465/4.404145</td></tr><tr><th scope='row'>Guid</th><td>11105.869490/11.917098</td><td>11350.393050/8.549223</td><td>24157.732815/2.331606</td><td>28800.139343/5.440415</td><td>9497.473893/11.139896</td><td>11941.642599/11.658031</td><td>26891.060403/2.072539</td><td>35288.834478/3.367876</td><td>11458.390164/9.326425</td><td>8581.012321/12.953368</td><td>669.152371/22.020725</td><td><b>8237.415053/12.953368</b></td><td>24641.309182/3.626943</td><td>12256.261503/6.735751</td><td>8329.239657/15.025907</td><td>18733.469719/2.590674</td><td>13013.633062/11.398964</td><td>22151.485850/4.922280</td><td>15139.079118/12.176166&lt;/
td&gt;<td>12649.997596/11.139896</td><td>13526.708187/9.844560</td><td>14521.723680/13.471503</td></tr><tr><th scope='row'>Guiz</th><td>1900.984819/11.917098</td><td>3422.299591/5.440415</td><td>2920.779863/13.212435</td><td>2657.232975/3.886010</td><td>7763.772745/6.217617</td><td>2516.088934/11.398964</td><td>1556.474440/12.953368</td><td><b>1450.939238/12.694301</b></td><td>1852.263760/12.435233</td><td>3503.139397/5.440415</td><td>1957.981930/7.772021</td><td>5.612643/60.362694</td><td>2030.975178/10.621762</td><td>3100.456750/9.585492</td><td>3816.057439/9.067358</td><td>2527.372931/10.103627</td><td>2017.135324/9.585492</td><td>1771.010720/12.953368</td><td>2467.262902/9.067358</td><td>6465.542228/6.735751</td><td>4936.521836/5.181347</td><td>3251.372451/4.663212</td></tr><tr><th scope='row'>Kaps</th><td>4787.151015/7.772021</td><td>4026.495938/9.067358</td><td>2591.212157/13.730570</td><td>3963.789278/11.139896</td><td>4835.168698/9.844560</td><td>3738.018788/5.958549</td><td>3472.599548/9.067358</td><td>2846.824328/9.067358</td><td>3964.442923/6.217617</td><td>8248.174848/4.663212</td><td>3178.776910/9.326425</td><td>4521.187784/6.476684</td><td>6.392693/63.730570</td><td>4535.673748/6.476684</td><td>2285.708359/13.730570</td><td>5222.426332/5.699482</td><td>4409.982716/5.440415</td><td><b>2124.534904/10.362694</b></td><td>4863.209844/10.362694</td><td>4875.780156/3.886010</td><td>4278.744225/12.176166</td><td>4661.710772/9.067358</td></tr><tr><th scope='row'>Mofa</th><td>5555.267163/7.772021</td><td>5328.793555/11.658031</td><td>6064.913246/13.730570</td><td>8844.481560/5.181347</td><td>14355.051790/6.217617</td><td>10773.098216/8.290155</td><td>5702.554716/11.398964</td><td>11819.967712/5.958549</td><td>5810.652609/12.435233</td><td>10899.166334/6.476684</td><td>9606.038800/5.699482</td><td><b>4528.077873/13.471503</b></td><td>10261.988658/9.844560</td><td>38.718690/38.341969</td><td>7191.371927/8.290155</td><td>4847.594375/14.248705</td><td>8110.295270/9.844560</td><td>14375.814958/5.699482</td><td>10070.806870/3.626943</td><td>10826.318474/8.290155</td><td>10187.374717/7.772021</td><td>16953.170797/3.626943</td></tr><tr><th scope='row'>Mofu</th><td>2175.168540/11.658031</td><td>3005.393159/10.621762</td><td>2773.793897/7.253886</td><td>2257.313709/6.476684</td><td>1807.203325/13.471503</td><td>2481.194623/2.331606</td><td>1626.688315/12.435233</td><td>1473.207901/13.212435</td><td>3206.638463/8.290155</td><td><b>1358.112972/12.435233</b></td><td>2550.513183/10.880829</td><td>1867.275865/12.694301</td><td>2847.897967/4.145078</td><td>1645.699003/13.471503</td><td>50.399227/32.642487</td><td>3831.820284/3.108808</td><td>1679.421861/9.844560</td><td>1957.944241/13.989637</td><td>1655.398024/13.212435</td><td>3439.753108/6.735751</td><td>4164.392749/9.844560</td><td>2176.478824/10.103627</td></tr><tr><th scope='row'>Du_n</th><td>3358.977688/12.694301</td><td>8269.025689/5.958549</td><td>6784.926221/4.922280</td><td>4034.987828/10.362694</td><td>8317.977821/5.440415</td><td>4469.988388/9.326425</td><td>4581.242219/9.585492</td><td>4046.289387/10.880829</td><td>4587.843666/10.880829</td><td>4061.430238/12.435233</td><td>4116.231632/8.031088</td><td>4043.687467/11.658031</td><td>8587.884922/5.699482</td><td><b>2518.760103/13.989637</b></td><td>9252.838415/6.217617</td><td>38.646292/34.196891</td><td>2823.000209/11.658031</td><td>7688.259347/5.699482</td><td>4184.395191/9.844560</td><td>6460.323149/9.844560</td><td>12418.880207/5.699482</td><td>4394.753911/10.362694</td></tr><tr><th scope='row'>Ejag</th><td>878.221181/8.290155</td><td>2977.854246/10.362694</td><td>1122.454274/13.212435</td><td>4066.806240/3.626943</td><td>4401.408293/12.694301</td><td>1324.839235/11.139896</td><td>2760.972117/9.585492</td><td>802.718089/8.808290</td><td>1935.328428/6.735751</td><td>2456.134064/8.549223</td><td>948.726346/11.658031</td><td>1464.326862/6.994819</td><td>1999.633312/6.476684</td><td>2483.815842/4.663212</td><td>790.752998/11.917098</td><td>1436.471564/10.362694</td><td>27.125567/39.896373</td><td>2701.314483/8.549223&lt;/
td&gt;<td><b>739.895562/13.989637</b></td><td>1119.207373/9.844560</td><td>2061.967307/3.367876</td><td>3116.635849/4.663212</td></tr><tr><th scope='row'>Fulf</th><td>3122.754082/11.139896</td><td>3172.412810/8.290155</td><td>2632.034499/10.103627</td><td>1803.237123/14.507772</td><td>3015.507576/12.953368</td><td>4697.430105/10.621762</td><td>2221.398811/11.917098</td><td>3338.511704/7.772021</td><td>5857.163684/4.663212</td><td>2631.329961/12.694301</td><td><b>1756.767457/14.248705</b></td><td>3965.216351/8.031088</td><td>2961.580251/10.362694</td><td>1850.532804/14.248705</td><td>2431.677037/8.808290</td><td>2688.040706/8.549223</td><td>6237.846441/3.108808</td><td>9.819160/53.108808</td><td>1794.314668/12.435233</td><td>2633.154009/4.922280</td><td>5899.732260/9.585492</td><td>6035.594459/5.440415</td></tr><tr><th scope='row'>Gbay</th><td>3537.010215/8.808290</td><td>2213.336729/9.326425</td><td>958.976958/14.766839</td><td>2170.105117/2.849741</td><td>2381.840897/8.549223</td><td>1092.011356/11.398964</td><td>989.079405/15.284974</td><td>2110.708219/12.953368</td><td>1212.493865/13.989637</td><td>1342.159428/12.953368</td><td><b>784.478130/16.321244</b></td><td>1404.757907/15.284974</td><td>1949.759014/13.730570</td><td>1165.979838/12.694301</td><td>1940.255308/5.699482</td><td>1073.951745/13.730570</td><td>2180.263932/7.253886</td><td>2639.229412/8.031088</td><td>4.503568/64.766839</td><td>2711.475687/5.440415</td><td>2879.142805/11.139896</td><td>2777.515280/3.626943</td></tr><tr><th scope='row'>MASS</th><td>2052.763675/6.476684</td><td>2123.090411/11.139896</td><td>1150.690864/11.398964</td><td><b>404.857470/19.170984</b></td><td>4114.380214/2.849741</td><td>1177.460159/10.880829</td><td>1553.261634/11.917098</td><td>767.332823/13.212435</td><td>1558.036793/6.217617</td><td>673.483311/13.730570</td><td>1308.799442/6.735751</td><td>2525.700131/5.440415</td><td>1157.282835/14.248705</td><td>1665.795367/8.031088</td><td>969.622799/11.139896</td><td>2236.251124/10.621762</td><td>1768.310288/9.585492</td><td>1530.460913/10.621762</td><td>703.513823/14.766839</td><td>9.311520/52.072539</td><td>3781.478640/5.440415</td><td>783.170102/16.580311</td></tr><tr><th scope='row'>Tupu</th><td>499.010245/24.611399</td><td>2789.182977/9.844560</td><td>1176.557896/16.062176</td><td>335.366353/21.243523</td><td>3759.854817/4.922280</td><td>1473.248900/8.290155</td><td>1637.969909/15.284974</td><td>444.487258/23.056995</td><td>729.184899/19.430052</td><td>326.348924/24.611399</td><td>530.140976/24.611399</td><td>834.757176/20.207254</td><td>1014.747872/11.398964</td><td>1361.103340/11.398964</td><td>447.754239/17.875648</td><td>1313.622745/15.803109</td><td>2020.767969/9.326425</td><td>1234.031067/13.730570</td><td><b>242.696296/29.533679</b></td><td>1209.709716/14.766839</td><td>5.328121/62.953368</td><td>678.820813/13.730570</td></tr><tr><th scope='row'>Vute</th><td>5247.001730/8.290155</td><td>2972.688386/11.398964</td><td>3141.040872/9.067358</td><td>4304.014532/12.435233</td><td>2981.350915/10.880829</td><td>7944.078280/2.331606</td><td>3013.186151/13.730570</td><td>2532.120943/12.176166</td><td>4688.069751/9.844560</td><td>8022.399859/3.886010</td><td>5315.095277/3.626943</td><td><b>2075.166168/12.694301</b></td><td>3794.597938/12.176166</td><td>2879.870276/13.212435</td><td>4364.837110/3.367876</td><td>3858.872867/8.549223</td><td>2749.070864/10.880829</td><td>9917.265191/3.367876</td><td>8091.176547/3.108808</td><td>5939.386425/4.404145</td><td>7670.501815/2.849741</td><td>43.658700/33.419689</td></tr></tbody></table>

<h6 id="prerequisite">Prerequisite</h6>
<p>If you want to evaluate the LM on a language <code>lang</code>, you must first have a file named <code>lang.txt</code> in the <code>$src_path</code> directory of <a href="eval_data.sh">eval_data.sh</a>.<br>For example if you want to use the biblical corpus, you can run <a href="scripts/bible.py">scripts/bible.py</a> :</p>
<pre><code># folder containing the csvs folder
csv_path=
# folder <span class="hljs-keyword">in</span> which the objective folders will be created (mono or para)
output_dir=
# monolingual one (<span class="hljs-string">"mono"</span>) or parallel one (<span class="hljs-string">"para"</span>)
data_type=mono
# list <span class="hljs-keyword">of</span> languages to be considered <span class="hljs-keyword">in</span> alphabetical order and separated by a comma
# <span class="hljs-keyword">case</span> <span class="hljs-keyword">of</span> one language
languages=lang,lang  
# <span class="hljs-keyword">case</span> <span class="hljs-keyword">of</span> many languages
languages=lang1,lang2,...   
# old_only : use only old testament
#  use only new testament
new_only=<span class="hljs-literal">True</span>

python ../scripts/bible.py --csv_path $csv_path --output_dir $output_dir --data_type $data_type --languages $languages --new_only $new_only
</code></pre><p>See other parameters in <a href="scripts/bible.py">scripts/bible.py</a></p>
<h6 id="data-pre-processing">Data pre-processing</h6>
<p>Modify parameters in <a href="eval_data.sh">eval_data.sh</a></p>
<pre><code># languages to be evaluated
languages=lang1,lang2,... 
chmod +x ../eval_data.sh 
../eval_data.sh $languages
</code></pre><h6 id="evaluation">Evaluation</h6>
<p>We take the language to evaluate (say <code>Bulu</code>), replace the files <code>test.Bulu.pth</code> (which was created with the <code>VOCAB</code> and <code>CODE</code> of <code>Bafi</code>, the evaluating language) with <code>test.Bafi.pth</code> (since <code>Bafi</code> evaluates, the <code>train.py</code> script requires that the dataset has the (part of the) name of the <code>lgs</code>). Then we just run the evaluation, the results (acc and ppl) we get is the result of LM Bafia on the Bulu language.</p>
<pre><code># evaluating <span class="hljs-keyword">language</span>
tgt_pair=
# folder containing the data <span class="hljs-keyword">to</span> <span class="hljs-keyword">be</span> evaluated (must <span class="hljs-keyword">match</span> $tgt_path in eval_data.<span class="hljs-keyword">sh</span>)
src_path=
# You have <span class="hljs-keyword">to</span> <span class="hljs-keyword">change</span> two parameters in the configuration <span class="hljs-keyword">file</span> used <span class="hljs-keyword">to</span> train the LM which evaluates (<span class="hljs-string">"data_path"</span>:<span class="hljs-string">"$src_path"</span> <span class="hljs-built_in">and</span> <span class="hljs-string">"eval_only"</span>: <span class="hljs-string">"True"</span>)
# You must also specify the <span class="hljs-string">"reload_model"</span> parameter, otherwise the <span class="hljs-keyword">last</span> checkpoint found will <span class="hljs-keyword">be</span> loaded <span class="hljs-keyword">for</span> evaluation.
config_file=../configs/lm_template.json 
# languages <span class="hljs-keyword">to</span> <span class="hljs-keyword">be</span> evaluated
eval_lang= 
chmod +<span class="hljs-keyword">x</span> ../scripts/evaluate.<span class="hljs-keyword">sh</span>
../scripts/evaluate.<span class="hljs-keyword">sh</span> $eval_lang
</code></pre><p>When the evaluation is finished you will see a file named <code>eval.log</code> in the <code>$dump_path/$exp_name/$exp_id</code> folder containing the evaluation results.<br><strong>Note</strong> :The description given below is only valid when the LM evaluator has been trained on only one language (and therefore without TLM). But let&#39;s consider the case where the basic LM has been trained on <code>en-fr</code> and we want to evaluate it on <code>de</code> or <code>de-ru</code>. <code>$tgt_pair</code> deviates from <code>en-fr</code>, but <code>language</code> varies depending on whether the evaluation is going to be done on one language or two:  </p>
<ul>
<li>In the case of <code>de</code> : <code>lang=de-de</code>  </li>
<li>in the case of <code>de-ru</code>: <code>lang=de-ru</code>.</li>
</ul>
<h2 id="iv-references">IV. References</h2>
<p>Please cite <a href="https://arxiv.org/abs/1901.07291">[1]</a> and <a href="https://arxiv.org/abs/1911.02116">[2]</a> if you found the resources in this repository useful.</p>
<h3 id="cross-lingual-language-model-pretraining">Cross-lingual Language Model Pretraining</h3>
<p>[1] G. Lample <em>, A. Conneau </em> <a href="https://arxiv.org/abs/1901.07291"><em>Cross-lingual Language Model Pretraining</em></a> and <a href="https://github.com/facebookresearch/XLM">facebookresearch/XLM</a></p>
<p>* Equal contribution. Order has been determined with a coin flip.</p>
<pre><code>@article{lample2019cross,
  <span class="hljs-attr">title={Cross-lingual</span> Language Model Pretraining},
  <span class="hljs-attr">author={Lample,</span> Guillaume <span class="hljs-literal">and</span> Conneau, Alexis},
  <span class="hljs-attr">journal={Advances</span> <span class="hljs-keyword">in</span> Neural Information Processing Systems (NeurIPS)},
  <span class="hljs-attr">year={2019}</span>
}
</code></pre><h2 id="license">License</h2>
<p>See the <a href="LICENSE">LICENSE</a> file for more details.</p>
