Towards Scalable Bayesian Transformers: Investigating stochastic subset selection for NLP

Peter J.T. Kampen; Gustav R. S. Als; Michael Riis Andersen

Towards Scalable Bayesian Transformers: Investigating stochastic subset selection for NLP

Peter J.T. Kampen, Gustav R. S. Als, Michael Riis Andersen

Published: 26 Apr 2024, Last Modified: 15 Jul 2024UAI 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bayesian Machine Learning, Natural Language Processing, Deep Learning, Transformers

Abstract: Bayesian deep learning provides a framework for quantifying uncertainty. However, the scale of modern neural networks applied in Natural Language Processing (NLP) limits the usability of Bayesian methods. Subnetwork inference aims to approximate the posterior by selecting a stochastic parameter subset for inference, thereby allowing scalable posterior approximations. Determining the optimal parameter space for subnetwork inference is far from trivial. In this paper, we study partially stochastic Bayesian neural networks in the context of transformer models for NLP tasks for the Laplace approximation (LA) and Stochastic weight averaging - Gaussian (SWAG). We propose heuristics for selecting which layers to include in the stochastic subset. We show that norm-based selection is promising for small subsets, and random selection is superior for larger subsets. Moreover, we propose Sparse-KFAC (S-KFAC), an extension of KFAC LA, which selects dense stochastic substructures of linear layers based on parameter magnitudes. S-KFAC retains performance while requiring substantially fewer stochastic parameters and, therefore, drastically limits memory footprint.

List Of Authors: Kampen, Peter Johannes Tejlgaard and Als, Gustav Ragnar Stoettrup and Andersen, Michael Riis

Latex Source Code: zip

Signed License Agreement: pdf

Code Url: https://github.com/GustavAls/PartialNLP

Submission Number: 364

Loading