Abstract: We investigate clustering documents based on automatically annotated potentially sensitive information extracted from a large collection of organizational data. The process of clustering in this particular use case is helpful to visualize and navigate through groups of documents with related content. However, the effectiveness and efficiency of document clustering is limited mainly due to the large dimensionality of the document vectors. To alleviate this problem we propose a dimensionality reduction approach which involves selecting terms with high tf-idf scores from the context of the automatically annotated sensitive regions of a document. Due to the unavailability of real organizational data for research purposes, we evaluate our approach on the standard 20 news-groups dataset. For evaluation purposes, the only sensitive information that we use from the documents of this dataset are the named entities, e.g. the names of persons and organizations. Experimental results show that our approach is able to achieve an almost perfect clustering with a purity value of 0.998 improving by 22.60% with respect to the purity value of 0.814 obtained without document dimensionality reduction.
Loading