Two-dimensional visualization of large document libraries using t-SNEDownload PDF

02 Mar 2022, 12:21 (edited 29 Apr 2022)GTRL 2022 PosterReaders: Everyone
  • Abstract: We benchmarked different approaches for creating 2D visualizations of large document libraries, using the MEDLINE (PubMed) database of the entire biomedical literature as a use case (19 million scientific papers). Our optimal pipeline is based on log-scaled TF-IDF representation of the abstract text, SVD preprocessing, and t-SNE with uniform affinities, early exaggeration annealing, and extended optimization. The resulting embedding distorts local neighborhoods but shows meaningful organization and rich structure on the level of narrow academic fields.
1 Reply