Quantifying and Evaluating Positive Transfer using Multi-task Learning for NLP

Hanoz Bhathena, Siamak Shakeri, Daniel Salz

07 Mar 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: In this paper our goal is to study the effects of multi-task learning (MTL) for NLP tasks from the GLUE benchmark. More specifically, we try to gauge the effectiveness of MTL for NLP tasks by quantifying the amount of positive or negative transfer that exists between these tasks/datasets. We conduct a series of experiments which involve MTL using GLUE tasks. We first run pairwise multi-task learning experiments, then jointly train on all GLUE tasks together and then only do multi-task learning with tasks in the same GLUE category (NLI, single sentence and pairwise sentence tasks). Our experiments show that at least for GLUE task combinations, the majority of pairwise transfers are negative and the same holds true for multi-task learning with all GLUE tasks. Furthermore, even when multi-task learning targets a task with other tasks from the same category only, the results are not always positive transfer, albeit is more consistently better than multi-task learning with All GLUE tasks showing that the category of the task which a target task is combined with has an important impact. Our goals are therefore to try to empirically derive the key drivers behind performance improvements that multi-task learning might provide. Given this, we undertake a large number of experiments at the end of which we detail some key empirical trends observed in our experiments. To have more confidence in our in- sights, we repeat our experiments with random seeds to make sure our experiments are reproducible and our insights are not functions of initialization. Furthermore, we also take care to make sure that we sufficiently marginalize out the effects of model architectures and dataset sizes by repeating our experiments with three different sentence encoders and conduct a downsampling analysis, respectively. Downsampling tries to account for the effect of dataset size which might bias the results towards higher positive transfer as the generalization which is achieved from more data in the non-target tasks might inaccurately be attributed to multi-task learning. We observe that the magnitude of transfer varies heavily across different encoder types, but the direction remains reasonably consistent. Finally, we observe that some of the highest positive transfer paired tasks, actually derive most of their positive transfer from dataset size versus distributional characteristics of the paired tasks. We hope that our results and conclusions serve as a point of instructive reference for future researchers in the field of NLP who look to improve performance on their own target task or group of tasks by bringing in other similar tasks to benefit from the perceived generalization capabilities of multi-task learning.

0 Replies