Abstract: Taxonomies group synonymous terms together into concepts, arranged into hierarchical “broader than” semantic relations. However, creating and maintaining taxonomies is labour-intensive, especially when they reach a scale of hundreds of thousands or millions of terms. Here, we present an automated solution to support taxonomy editors in identifying synonymous terms in scientific literature, by leveraging semantic search techniques. Our method first encodes all taxonomy terms or phrases using a pre-trained BERT-based model. Subsequently, we employ FAISS vector search to efficiently discover synonyms for each term. We evaluate by comparing the terms considered synonymous by our method to a manually curated taxonomy that consists of more than 770,000 terms. By integrating state-of-the-art NLP and search methodologies, our approach offers a practical and efficient solution, that can achieve up to 0.79 precision and 0.25 recall for synonym discovery. This automation scales to large taxonomies and can be used at runtime in large taxonomy-driven document retrieval systems.
Loading