SuperShaper: A Pre-Training Approach for Discovering Efficient Transformer Shapes
Keywords: Pretraining, efficiency, bert, shapes
TL;DR: Train a model with different shapes, extract optimal shapes for different param budgets, fine tune on downstream tasks, enjoy life!
Abstract: Task-agnostic pre-training followed by task-specific fine-tuning is a default approach to train NLU models which need to be deployed on devices with varying resource and accuracy constraints. However, repeating pre-training and fine-tuning across tens of devices is prohibitively expensive. To address this, we propose SuperShaper, a task-agnostic approach wherein we pre-train a single model which subsumes a large number of Transformer models via linear bottleneck matrices around each Transformer layer which are sliced to generate differently shaped sub-networks. Despite its simplicity, SuperShaper radically simplifies NAS for language models and discovers networks, via evolutionary algorithm, that effectively trade-off accuracy and model size. Discovered networks are more accurate than a range of hand-crafted and automatically searched networks on GLUE benchmarks. Further, a critical advantage of shape as a design variable for NAS is that the networks found with these heuristics derived for good shapes, match and even improve on carefully searched networks across a range of parameter counts.
Submission Number: 52