Scent of Health (S-O-H): Olfactory Multivariate Time-Series Dataset for Non-Invasive Disease Screening
Keywords: enose, dataset, medicine, olfactory
TL;DR: A multivariate dataset from an Enose sensor for non-invasive disease screening with data from over 1000 unique patients.
Abstract: Exhaled breath analysis has become an advantageous alternative to traditional medical diagnostic methods. Electronic nose (eNose) sensors can enable low-cost, non-invasive disease screening from exhaled breath. Still, progress is limited by small, site-specific datasets and sensor-specific temporal artifacts (e.g., baseline drift). In this paper, we introduce Scent of Health, the largest printed-metal-oxide eNose clinical dataset with curated temporal splits. We also introduce breath diagnosis as a realistic multivariate time-series task with temporally stratified splits that mimic deployment. We provide a reproducible benchmark, including classical algorithms with handcrafted features, convolutional neural networks with data augmentation, and specialized time series classification methods, and show that, while these methods offer useful inductive biases, substantial gaps remain in robustness and generalization under drift and limited labels. Our findings demonstrate that machine learning for data from eNose can achieve clinically relevant performance in detecting malignant lung neoplasms and differentiating respiratory diseases. The substantial sample size of this dataset addresses a critical gap in research and provides a valuable resource for developing and validating disease classification models and olfactory data representation.
Primary Area: datasets and benchmarks
Submission Number: 24960
Loading