Compositional data analysis (CoDA) approaches to distance in information retrievalOpen Website

2014 (modified: 12 Nov 2022)SIGIR 2014Readers: Everyone
Abstract: Many techniques in information retrieval produce counts from a sample, and it is common to analyse these counts as proportions of the whole---term frequencies are a familiar example. Proportions carry only relative information and are not free to vary independently of one another: for the proportion of one term to increase, one or more others must decrease. These constraints are hallmarks of compositional data. While there has long been discussion in other fields of how such data should be analysed, to our knowledge, Compositional Data Analysis (CoDA) has not been considered in IR. In this work we explore compositional data in IR through the lens of distance measures, and demonstrate that common measures, naive to compositions, have some undesirable properties which can be avoided with composition-aware measures. As a practical example, these measures are shown to improve clustering.
0 Replies

Loading