How Far Can LLMs Go On Cognitive Health Prediction? A Study On EMA Data

Sambit Mukherjee; Anthony Fu; Kai Zhao; Yang Ren; Dezhi Wu; Chih-Hsiang Yang

How Far Can LLMs Go On Cognitive Health Prediction? A Study On EMA Data

Sambit Mukherjee, Anthony Fu, Kai Zhao, Yang Ren, Dezhi Wu, Chih-Hsiang Yang

Published: 04 Mar 2026, Last Modified: 11 Mar 2026ICLR 2026 Workshop LMRL PosterEveryoneRevisionsBibTeXCC BY 4.0

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Track: long paper (4–8 pages excluding references)

Keywords: LLMs; EMA; Ambulatory Cognition; ;Cognition Health; ML; Statistical Models; Brain Science; Senior Health

Abstract: Cognitive health is a growing public-health burden in an increasingly aging world, and scalable methods for monitoring cognitive state are urgently needed. Smartphone-based intensive measurement studies can capture short-term cognitive variability in daily life via repeated ecological momentary assessment (EMA) surveys and brief cognitive micro-tests, but the resulting datasets are small, heterogeneous, and predominantly tabular, raising a basic question: can large language models (LLMs) serve as effective predictors in this structured regime, or do specialized tabular learners remain necessary? We study a proprietary healthy aging EMA dataset of 115 older adults (ages 65–91) collected over a two-week measurement burst, and predict an ordinal cognitive severity septile derived from a composite cognition score using participant-level splits to evaluate generalization to unseen individuals. We benchmark state-of-the-art LLMs under a unified, machine-parseable structured-output interface against strong tabular machine-learning and statistical ordinal baselines using exact accuracy, tolerance accuracy, and macro-F1. Across models and settings, tree-based tabular methods achieve the best boundary-accurate performance, while LLMs capture coarse ordinal ordering but are less reliable at exact septile decisions; moreover, structured-output failures emerge as a practical deployment constraint that must be included in end-to-end evaluation. Our results delineate the current limits of prompting-based LLM inference for structured ambulatory cognition prediction and provide a reproducible framework for comparing LLMs to tabular baselines in small-sample clinical monitoring tasks.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 47

Loading