Keywords: Transformers, Multi-Task Learning
Abstract: Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks for controlling its feed-forward layers. Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We conduct an extensive set of experiments on GLUE/SuperGLUE. On the SuperGLUE test set, we match the performance of the state-of-the-art while being $16$ times more parameter efficient. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.
One-sentence Summary: State-of-the-art multi-task NLU performance with only a single model
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Data: [CoLA](https://paperswithcode.com/dataset/cola), [GLUE](https://paperswithcode.com/dataset/glue), [MultiNLI](https://paperswithcode.com/dataset/multinli), [QNLI](https://paperswithcode.com/dataset/qnli), [ReCoRD](https://paperswithcode.com/dataset/record), [SST](https://paperswithcode.com/dataset/sst), [SuperGLUE](https://paperswithcode.com/dataset/superglue), [WSC](https://paperswithcode.com/dataset/wsc)