everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
We study temporal tabular data-streams (TTD) where each observation has both categorical and numerical values, and where the universe of distinct categorical items is not known upfront and can even grow unboundedly over time. Such data is common in many large-scale systems, such as user activity in computer system logs and scientific experiment records. Feature hashing is commonly used as a pre- processing step to map the categorical items into a known universe, before doing representation learning (Coleman et al., 2024; Desai et al., 2022). However, these methods have been developed and evaluated for the offline or batch settings. In this paper, we consider the pre-processing step of hashing before representation learning in the online setting for TTD. We show that deterministic embeddings suffer from forgetting in online learning with TTD, leading to performance deterioration. To mitigate the issue, we propose a probabilistic hash embedding (PHE) model that treats hash embeddings as stochastic and applies Bayesian online learning to learn incrementally with data. Based on the structure of PHE, we derive a scalable inference algorithm to learn model parameters and infer/update the posteriors of hash embeddings and other latent variables. Our algorithm (i) can handle evolving vocabulary of categorical items, (ii) is adaptive to new items without forgetting old items, (iii) is implementable with a bounded set of parameters that does not grow with the number of distinct observed items on the stream, and (iv) is efficiently implementable both in the offline and the online streaming setting. Experiments in classification, sequence modeling, and recommendation systems with TTD demonstrate the superior performance of PHE compared to baselines.