QA Analysis in Medical and Legal Domains: A Survey of Data Augmentation in Low-Resource Settings

Benedictus Kent Rachmat; Thomas Gerald; Zheng Zhang SLB; Cyril Grouin

QA Analysis in Medical and Legal Domains: A Survey of Data Augmentation in Low-Resource Settings

Benedictus Kent Rachmat, Thomas Gerald, Zheng Zhang SLB, Cyril Grouin

Published: 22 Jun 2025, Last Modified: 17 Jul 2025ACL-SRW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, low resources, domain adaptation, data augmentation

TL;DR: This survey reviews QA data augmentation in low-resource biomedical and legal domains, focusing on methods, challenges, and LLM adaptation.

Abstract: Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP), but their success remains largely confined to high-resource, general-purpose domains. In contrast, applying LLMs to low-resource domains poses significant challenges due to limited training data, domain drift, and strict terminology constraints. This survey provides an overview of the current landscape in domain-specific, low-resource QA with LLMs. We begin by analyzing the coverage and representativeness of specialized-domain QA datasets against large-scale reference datasets what we refer to as \textit{ParentQA}. Building on this analysis, we survey data-centric strategies to enhance input diversity, including data augmentation techniques. We further discuss evaluation metrics for specialized tasks and consider ethical concerns. By mapping current methodologies and outlining open research questions, this survey aims to guide future efforts in adapting LLMs for robust and responsible use in resource-constrained, domain-specific environments. To facilitate reproducibility, we make our code available at https://github.com/kentrachmat/survey-da.

Archival Status: Archival

Acl Copyright Transfer: pdf

Paper Length: Long Paper (up to 8 pages of content)

Submission Number: 329

Loading