Advancing Web Science through Foundation Model for Tabular Data

Inwon Kang

Published: 01 Jan 2024, Last Modified: 27 Sept 2024WebSci (Companion) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: As the landscape of web science expands, handling the vast datasets collected from the Web while preserving computational efficiency and privacy remains a significant challenge. Data distillation offers a compelling solution by condensing large datasets into a distilled subset that retains essential characteristics. Part of my ongoing thesis work on tabular data distillation has shown that autoencoders and clustering algorithms can effectively distill tabular datasets, offering a promising solution for handling large datasets. Building upon this, my next step is to develop a versatile pre-trained model analogous to BERT and RoBERTa. This model can distill arbitrary tabular datasets, streamlining processes like data size reduction, synthetic data generation, large-scale analysis, reproducibility, and privacy preservation. Such a foundation model will serve as a versatile tool for web science research, making both data and research more accessible and computationally efficient. This model will not be limited to downstream classification and will be applicable to many further uses, such as reducing dataset sizes for efficient analysis, producing privatized synthetic datasets, or enhancing reproducibility through shared distilled data. By developing a foundation model for tabular data distillation, I aim to unlock new avenues in web science and improve computational accessibility, privacy protection, and reproducibility. The proposed direction holds promise for a versatile tool for handling the large amounts of data generated from the web while preserving its essence.