Abstract: We address the question of how document properties (word count, term frequency, cohesiveness, genre) affect the quality of unsupervised document relatedness measures (Google trigram model and vector space model). We use three genres of documents: aviation safety reports, medical equipment failure descriptions, and biodiversity heritage library text. Quality of document relatedness is assessed by the accuracy of a classification task using the kNN method. Experiments discover correlations between document property values and document relatedness quality, and we discuss how one approach may perform better depending on property values of the dataset.
0 Replies
Loading