JOINTLY LEARNING TOPIC SPECIFIC WORD AND DOCUMENT EMBEDDING

Farid Uddin; Zuping Zhang

JOINTLY LEARNING TOPIC SPECIFIC WORD AND DOCUMENT EMBEDDING

Farid Uddin, Zuping Zhang

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone

Keywords: Language modeling ·Document embedding ·Natural language processing ·Machine learning

Abstract: Document embedding generally ignores underlying topics, which fails to capture polysemous terms that can mislead to improper thematic representation. Moreover, embedding a new document during the test process needs a complex and expensive inference method. Some models first learn word embeddings and later learn underlying topics using a clustering algorithm for document representation; those methods miss the mutual interaction between the two paradigms. To this point, we propose a novel document-embedding method by weighted averaging of jointly learning topic-specific word embeddings called TDE: Topical Document Embedding, which efficiently captures syntactic and semantic properties by utilizing three levels of knowledge -i.e., word, topic, and document. TDE obtains document vectors on the fly simultaneously during the jointly learning process of the topical word embeddings. Experiments demonstrate better topical word embeddings using document vector as a global context and better document classification results on the obtained document embeddings by the proposed method over the recent related models.

One-sentence Summary: The proposed model learns probabilistic topical word embeddings using a document vector as a global context where each word can be part of different underlying topics of the document instead of just a single paragraph vector.

Supplementary Material: zip

5 Replies

Loading