Probabilistic Hash Embeddings for Temporal Tabular Data Streams

19 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: hash embedding, Bayesian online learning, tabular data, dynamic vocabulary
TL;DR: We propose a novel probabilistic model that applies to a new tabular setting where the categorical vocabulary expands over time.
Abstract:

We study temporal tabular data-streams (TTD) where each observation has both categorical and numerical values, and where the universe of distinct categorical items is not known upfront and can even grow unboundedly over time. Such data is common in many large-scale systems, such as user activity in computer system logs and scientific experiment records. Feature hashing is commonly used as a pre- processing step to map the categorical items into a known universe, before doing representation learning (Coleman et al., 2024; Desai et al., 2022). However, these methods have been developed and evaluated for the offline or batch settings. In this paper, we consider the pre-processing step of hashing before representation learning in the online setting for TTD. We show that deterministic embeddings suffer from forgetting in online learning with TTD, leading to performance deterioration. To mitigate the issue, we propose a probabilistic hash embedding (PHE) model that treats hash embeddings as stochastic and applies Bayesian online learning to learn incrementally with data. Based on the structure of PHE, we derive a scalable inference algorithm to learn model parameters and infer/update the posteriors of hash embeddings and other latent variables. Our algorithm (i) can handle evolving vocabulary of categorical items, (ii) is adaptive to new items without forgetting old items, (iii) is implementable with a bounded set of parameters that does not grow with the number of distinct observed items on the stream, and (iv) is efficiently implementable both in the offline and the online streaming setting. Experiments in classification, sequence modeling, and recommendation systems with TTD demonstrate the superior performance of PHE compared to baselines.

Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1985
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview