Sentence level matrix representation for document spectral clustering
Abstract: Using a simple vector in R n is a traditional way of representing documents in vector spaces. However, this representation tends to ignore the discourse and syntactic structure of texts. A matrix representation such as the one offered by the Doc2Vec word embedding method preserves these characteristics. In order to integrate a sentence level matrix representing documents to a clustering algorithm, we use a Frobenius based inner product that allows defining kernel functions for spectral clustering. We show that this methodology provides advantages over traditional clustering algorithms and performs better than bag of words (BoW) representations used in Information Retrieval (IR).
0 Replies
Loading