Harnessing and Distilling ChatGPT's Ability to Bridge Semantic Variance for Precise Query-Document Alignment in Encephalitis Research: Surpassing Keyword-Based Search Engines

Published: 29 Feb 2024, Last Modified: 01 Mar 2024AAAI 2024 SSS on Clinical FMsEveryoneRevisionsBibTeXCC BY 4.0
Track: Non-traditional track
Keywords: Transformers, Information Retrieval, Embeddings, GPT, Encephalitis
TL;DR: A dataset of semantically varied queries for each document; a model trained in this data outperformed traditional PubMed searches.
Abstract: Keyword-based search engines often fail in retrieving information that aligns with user query intent, due to variations in keywords and phrasing in scientific literature. This paper introduces an encephalitis query-document dataset, characterized by its high semantic variability. Our dataset comprises thousands of query-document pairs. To represent the diverse linguistic expressions found in encephalitis research, we leveraged the advanced language understanding capabilities of GPT-4 to generate queries that, while conceptually aligned with the information in the documents, significantly differ in phrasing and terminology. This approach addresses a critical need in scientific literature searches – retrieving pertinent information that might be overlooked due to conventional keyword-based search limitations. To evaluate the efficacy of our dataset, we trained a specialized transformer model capable of converting these query-document pairs into embeddings. Our results demonstrate a significant improvement in retrieving relevant encephalitis research papers, especially those that are not surfaced by traditional search engines like PubMed. This enhanced retrieval performance not only underscores the potential of embedding-based retrieval in medical literature search, but also opens up new avenues for comprehensive literature exploration. The implications of our findings extend beyond encephalitis studies, suggesting broader applicability for similar methodologies in other specialized fields of research.
Presentation And Attendance Policy: I have read and agree with the symposium's policy on behalf of myself and my co-authors.
Ethics Board Approval: No, our research does not involve datasets that need IRB approval or its equivalent.
Data And Code Availability: Yes, we will make data and code available upon acceptance.
Primary Area: Datasets and benchmarks
Student First Author: No, the primary author of the manuscript is NOT a student.
Submission Number: 10
Loading