Two-dimensional visualization of large document libraries using t-SNEDownload PDF

Published: 25 Mar 2022, Last Modified: 05 May 2023GTRL 2022 PosterReaders: Everyone
Abstract: We benchmarked different approaches for creating 2D visualizations of large document libraries, using the MEDLINE (PubMed) database of the entire biomedical literature as a use case (19 million scientific papers). Our optimal pipeline is based on log-scaled TF-IDF representation of the abstract text, SVD preprocessing, and t-SNE with uniform affinities, early exaggeration annealing, and extended optimization. The resulting embedding distorts local neighborhoods but shows meaningful organization and rich structure on the level of narrow academic fields.
1 Reply