The Sparsity Trap: Clinical Context as Systematic Noise in Subject-Independent Glucose Forecasting

Vedant Thakkar

Published: 05 Feb 2026, Last Modified: 06 Feb 2026Submitted to CHIL 2026EveryoneCC BY 4.0

Abstract: Deep learning models for blood glucose forecasting are increasingly built as multivariate systems, with the working assumption that incorporating exogenous clinical variables such as insulin boluses and carbohydrate intake will generally improve predictive accuracy. In this work, we challenge that assumption in a strict subject-independent setting on the public Shanghai Diabetes Registry (112 inpatients with Type 2 Diabetes). We benchmark a state-of-the-art univariate Transformer (PatchTST) against a persistence baseline and a late-fusion multivariate variant that ingests standardized insulin and carbohydrate logs. The univariate Transformer significantly outperforms persistence at long horizons (120-minute RMSE 38.43 vs. 44.20 mg/dL), establishing a strong forecasting baseline under subject-wise splits. In contrast, adding clinical features consistently degrades performance. Stratifying 64,381 test windows by treatment activity reveals a "sparsity trap": in high-activity windows dominated by meal intake, the multivariate model underperforms the univariate baseline by 2.36 mg/dL (25.07 vs. 22.71 mg/dL RMSE, p<10^-4), accompanied by a 4.6 percentage-point drop in the fraction of predictions within 20% of the reference value, a proxy for clinically safe accuracy. We further show that a two-channel CGM+Carbs Transformer and a gradient boosting baseline on hand-crafted clinical features both reproduce this sparsity trap—often with poorer clinical accuracy in meal-driven windows—indicating that the failure is not specific to a single architecture. A synthetic injection experiment, in which future glucose is defined as a simple linear function of insulin history, shows that the same architecture rapidly learns the induced fusion rule, implicating the sparsity and noise of real-world logs in this small cohort rather than model capacity as the primary cause of failure. [span_5](start_span)These findings suggest that, for such cohorts, robust univariate modeling may be a safer and more accurate default than naive multivariate fusion of sparse clinical streams.