The Need for Tabular Representation Learning: An Industry Perspective

Joyce Cahoon; Alexandra Savelieva; Andreas C Mueller; Avrilia Floratou; Carlo Curino; Hiren Patel; Jordan Henkel; Markus Weimer; Nellie Gustafsson; Richard Wydrowski; Roman Batoukov; Shaleen Deep; Venkatesh Emani

The Need for Tabular Representation Learning: An Industry Perspective

Joyce Cahoon, Alexandra Savelieva, Andreas C Mueller, Avrilia Floratou, Carlo Curino, Hiren Patel, Jordan Henkel, Markus Weimer, Nellie Gustafsson, Richard Wydrowski, Roman Batoukov, Shaleen Deep, Venkatesh Emani

Published: 21 Oct 2022, Last Modified: 16 May 2023TRL @ NeurIPS 2022 PosterReaders: Everyone

Keywords: Table Representation Learning

TL;DR: An industry perspective on open challenges in the domain of TRL

Abstract: The total addressable market for data applications has been estimated at \$70B. This includes the \$11B market for data integration, which is estimated to grow at 25% in the coming year; \$35B market for analytics, growing at 11%; and \$19B market for business intelligence, growing at 8% [1]. Given this data-driven future and the scale at which Microsoft operates, we survey PMs, engineers and researchers and synthesize their opinions around extracting insights from tabular data at-scale. We see three main areas where tabular representation learning (TRL) can be leveraged: Data insights. Enabling real-time analytics is one of the key priorities for Microsoft’s new intelligence platform [6] now that a converged environment exists to house any type of data. TRL models can help expose column and table-level semantic annotations, relationships between columns and between tables, and advanced data patterns such as semantic-aware denial constraints [5]. Data management. From our internal workload telemetry, we know that 17.8% of tabular data across our virtual clusters remain unaccessed [9]. From an external perspective, leveraging telemetry from Azure Observability Platform, we observed that out of the 10B+ metrics generated, less than 0.1% is used [3]. Bringing this data to light requires sophisticated data discovery, data understanding and data integration capabilities. We believe TRL models can play an important role on tasks such as entity detection and deduplication, schema mapping, and data imputation. Data movement. It is well-known that data movement remains a key bottleneck in analytics [10]. In order to ensure that our users receive the best performance possible, investments have been made in smart caching policies, like those involving materialized views [4], as well as predicate operator pushdown [2]. Recent work [8] predicts various structural and performance properties of queries by pre-training encoder models with database workloads; but, the application of these strategies fail to consider the underlying tabular data. With TRL models, we can jointly pre-train tables with their query plans to enhance our understanding and ability to characterize workloads, and thus further efforts in reducing data movement. Challenges and opportunities. Existing tabular models are mostly trained on Wikipedia tables and/or spreadsheets. However, in an enterprise setting, both the customer data and their associated schema is often industry specific. Access to the customer data is typically not possible due to privacy regulations [11, 7], thus training TRL models on such data is often not possible. It remains an open question, whether the existing TRL models can be successfully used on domain-specific data. Along the same lines, the existence of large language models (LLMs) and Microsoft’s exclusive license to them, allows rapid prototyping of many applications even on top of tabular data. We encourage the community to provide studies that compare the performance of LLMs and TRL on some of the tasks mentioned above. Such systematic studies will be useful to application developers and product teams that are looking to incorporate more ML-based capabilities.

0 Replies

Loading