SuperShaper: Task-Agnostic Super Pre-training of BERT Models with Variable Hidden Dimensions

Anonymous

SuperShaper: Task-Agnostic Super Pre-training of BERT Models with Variable Hidden Dimensions

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone

Abstract: Task-agnostic pre-training followed by task-specific fine-tuning is a default approach to train NLU models which need to be deployed on devices with varying resource and accuracy constraints. However, repeating pre-training and fine-tuning across tens of devices is prohibitively expensive. To address this, we propose SuperShaper, a task agnostic pre-training approach wherein we pre-train a single model which subsumes a large number of Transformer models by varying shapes, i.e., by varying the hidden dimensions across layers. This is enabled by a backbone network with linear bottleneck matrices around each Transformer layer which are sliced to generate differently shaped sub-networks. Despite its simple design space and efficient implementation, SuperShaper radically simplifies NAS for language models and discovers networks that effectively trade-off accuracy and model size: Discovered networks are more accurate than a range of hand-crafted and automatically searched networks on GLUE benchmarks. Further, we find two critical advantages of shape as a design variable for Neural Architecture Search (NAS): (a) networks found with these heuristics derived for good shapes, match and even improve on carefully searched networks across a range of parameter counts, and (b) the latency of networks across multiple CPUs and GPUs are insensitive to the shape and thus enable device-agnostic search.

0 Replies

Loading