Natural Language Querying on Domain-Specific NoSQL Database with Large Language Models

Wenlong Zhang, Chengyang He, Guanqun Yang, Dipankar Bandyopadhyay, Tian Shi, Ping Wang

Published: 2024, Last Modified: 06 Jan 2026BIBM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Efficiently and accurately retrieving specific information from healthcare datasets, such as the Vaccine Adverse Event Reporting System (VAERS) 1, presents significant challenges. A promising solution to this problem is the Text-to-ESQ approach, which is akin to Text-to-SQL tasks but leverages NoSQL database Elasticsearch, to thoroughly explore VAERS data. Non-relational databases are particularly adept at managing complex and dynamic data formats, thereby enabling the extraction of more valuable insights. However, generating executable NoSQL queries is still challenging due to the limited availability of NoSQL query datasets, which constrains model training. One potential remedy involves the use of large language models (LLMs), which can be applied in few-shot and even zero-shot learning scenarios. Nonetheless, the lack of prior evaluation for this novel task, coupled with the absence of a comprehensive, unbiased assessment of existing LLMs and prompting strategies, impedes the development of a robust architecture. Motivated by these challenges, we introduce a new Instruction-Enhanced Explainable (InstructEx) Chain-of-Thought (CoT) prompting by integrating existing CoT prompts and conducting a comprehensive investigation of LLMs and CoT prompting. The extensive experimental analysis demonstrates the effectiveness of using LLMs for Text-to-ESQ when combined with the InstructExCoT prompting. It also sheds light on the strengths and weaknesses of these methods from multiple perspectives.

External IDs:dblp:conf/bibm/ZhangHYBSW24