[Short] Towards Large-Scale Heterogeneous Data Organization for Scientific Foundation Models: A Nuclear Fusion Case Study
Keywords: Multi-modal, Nuclear Fusion, Heterogeneous Data Loading, Time Series, Spectrograms, Tokamak
TL;DR: To enable tokamak foundation models, we analyze 23 diagnostics with mixed tensor structures and distinct physics, proposing a data loading pipeline that optimizes windowing, spectral resolution, and sparsity.
Abstract: Training effective foundation models requires massive and organized datasets, yet scientific domains such as nuclear fusion present unique challenges due to largely heterogeneous and sparse data. Here we characterize the data used in developing such a model: with over 20 sensor types spanning 5 orders of magnitude in sampling rate, mixed tensor structures (point measurements, spectrograms, images), and nonstationary physics. We analyze our input complexity and discuss trade-offs between temporal context and frequency resolution. Our analysis provides a template for representing multi-modal fluctuation data at scale, with implications for both multi-modal control systems and nuclear fusion.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 106
Loading