Keywords: scientometrics, NLP, metascience
TL;DR: We find that publications that are more similar to pre-existing publications are cited more frequently.
Abstract: We report a novel observation about which scientific publications are cited more frequently: those that are more textually similar to pre-existing publications. Using bag-of-word document embeddings, we analyze quantitative trends for a large sample of publication abstracts in the field of astrophysics ($N \sim 300,000$). When new publications are ranked by how many similar publications already exist in their neighborhood, the median number of citations per year that the upper 50$^{\rm th}$ percentile receives is $\sim 1.6$ times the median of the lower 50$^{\rm th}$ percentile. When new publications are ranked by an alternative metric of dissimilarity to neighbors, the median citations per year that the upper 50$^{\rm th}$ percentile receives is $\sim 0.74$ times the median of the lower 50$^{\rm th}$ percentile. We discuss a number of hypotheses that could explain these citation-similarity relationships relevant to the science of science.
Submission Track: Original Research
Submission Number: 93
Loading