Sentence level matrix representation for document spectral clustering

Victor Mijangos, Gerardo Sierra, Azucena Montes

12 Jan 2022 (modified: 12 Jan 2022)OpenReview Archive Direct UploadReaders: Everyone

Abstract: Using a simple vector in R n is a traditional way of representing documents in vector spaces. However, this representation tends to ignore the discourse and syntactic structure of texts. A matrix representation such as the one offered by the Doc2Vec word embedding method preserves these characteristics. In order to integrate a sentence level matrix representing documents to a clustering algorithm, we use a Frobenius based inner product that allows defining kernel functions for spectral clustering. We show that this methodology provides advantages over traditional clustering algorithms and performs better than bag of words (BoW) representations used in Information Retrieval (IR).

0 Replies