Exploring the structure of BERT through Kernel Learning

Ivano Lauriola, Alberto Lavelli, Alessandro Moschitti, Fabio Aiolli

2021 (modified: 14 Dec 2021)IJCNN 2021Readers: Everyone

Abstract: Combining internal representations of a pre-trained Transformer model, such as the popular BERT, is an interesting and challenging task nowadays. Usually, internal representations are combined by simple heuristics, e.g. concatenation or average of a subset of layers, with a consequent need for calibrating multiple hyper-parameters during the fine-tuning phase. Inspired by the recent literature, we propose a principled approach to optimally combine internal representations of a Transformer model via Multiple Kernel Learning strategies. Broadly speaking, the proposed system consists of two elements. The former is a canonical Transformer model fine-tuned on the target task. The latter is a Multiple Kernel Learning algorithm that extracts and combines representations developed in the internal layers of the Transformer and performs predictions. Most important, we use the system as a powerful tool to inspect the information encoded into the Transformer network, emphasizing the limits of state-of-the-art models.

0 Replies