Paper Link: https://openreview.net/forum?id=fwriKA474EL
Paper Type: Long paper (up to eight pages of content + unlimited references and appendices)
Abstract: We introduce a new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text. A DEMix layer includes a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular: experts can be mixed, added, or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEMix layers reduce test-time perplexity (especially for out-of-domain data), increase training efficiency, and enable rapid adaptation. Mixing experts during inference, using a parameter-free weighted ensemble, enables better generalization to heterogeneous or unseen domains. We also show it is possible to add experts to adapt to new domains without forgetting older ones, and remove experts to restrict access to unwanted domains. Overall, these results demonstrate benefits of domain modularity in language models.
Presentation Mode: This paper will be presented in person in Seattle
Copyright Consent Signature (type Name Or NA If Not Transferrable): Suchin Gururangan
Copyright Consent Name And Address: Paul G Allen School of Computer Science, University of Washington Box 352355 Seattle, WA 98195-2355