Abstract: With an increase of dataset availability, the potential for learning from a variety of data sources
has increased. One particular method to improve learning from multiple data sources is to embed
the data source during training. This allows the model to learn generalizable features as well as
distinguishing features between datasets. However, these dataset embeddings have mostly been
used before contextualized transformer-based embeddings were introduced in the field of Natural
Language Processing. In this work, we compare two methods to embed datasets in a transformerbased multilingual dependency parser, and perform an extensive evaluation. We show that: 1)
embedding the dataset is still beneficial with these models 2) performance increases are highest
when embedding the dataset at the encoder level 3) unsurprisingly, we confirm that performance
increases are highest for small datasets and datasets with a low baseline score. 4) we show that
training on the combination of all datasets performs similarly to designing smaller clusters based
on language-relatedness.
0 Replies
Loading