A Mutual Information Maximization Perspective of Language Representation Learning

Lingpeng Kong; Cyprien de Masson d'Autume; Lei Yu; Wang Ling; Zihang Dai; Dani Yogatama

A Mutual Information Maximization Perspective of Language Representation Learning

Lingpeng Kong, Cyprien de Masson d'Autume, Lei Yu, Wang Ling, Zihang Dai, Dani Yogatama

Published: 20 Dec 2019, Last Modified: 22 Jun 2025ICLR 2020 Conference Blind SubmissionReaders: Everyone

Abstract: We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

Data: [SQuAD](https://paperswithcode.com/dataset/squad)

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/a-mutual-information-maximization-perspective/code)

Original Pdf: pdf

12 Replies

Loading