Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and Multi-Level Embeddings
Abstract: Alzheimer’s disease represents a growing global health concern, emphasizing the need for early diagnosis to mitigate neurocognitive decline. Speech analysis has emerged as a promising, non-invasive approach, yet limited data availability hinders the development of robust Machine Learning (ML) models. To address this challenge, this study exploits the potentialities of Large Language Models (LLMs)—both their ability to generate synthetic data and their capacity to extract complex linguistic features from speech. We employ GPT-4 to generate synthetic transcripts, thus expanding the ADReSS2020 dataset and enhancing its diversity while preserving semantic and structural coherence. Moreover, we propose a novel multilevel feature extraction framework that integrates Bidirectional Encoder Representations from Transformers (BERT) embeddings fine-tuned with linguistic features obtained through Computerized Language Analysis (CLAN). The study involved two experiments: first, to identify the optimal feature extraction strategy and second, to evaluate the impact of synthetic data generated by GPT-4. In both experiments, the performance of five classifiers was evaluated to determine the most effective configuration. Our results demonstrated that fine-tuned BERT embeddings slightly improve classification performance compared to pre-trained models, highlighting the value of domain-specific fine-tuning. Although adding CLAN-like linguistic features yielded limited benefits, GPT-4-generated synthetic data demonstrated promising potential, particularly when combined with sentence embeddings. Classifiers such as Random Forest showed an improvement in accuracy, increasing from 0.79 to 0.88 when using the augmented dataset. This study paves the way for the use of LLMs to expand the diversity of datasets and improve the robustness of ML models in clinical applications.
External IDs:dblp:conf/ijcnn/MutalaPPXSBS25
Loading