Structural Embedding of Data Files with MAGRITTE

Gerardo Vitagliano; Mazhar Hameed; Felix Naumann

Structural Embedding of Data Files with MAGRITTE

Gerardo Vitagliano, Mazhar Hameed, Felix Naumann

Published: 21 Oct 2022, Last Modified: 16 May 2023TRL @ NeurIPS 2022 PosterReaders: Everyone

Keywords: structural embedding, data preparation, tabular data file, csv

TL;DR: Learning to represent the structure of tabular data files

Abstract: Large amounts of tabular data are encoded in plain-text files, e.g., CSV, TSV and TXT. Plain-text formats allow freedom of expression and encoding, fostering the use of non-standard syntaxes and dialects. Before analyzing the content of such files, it is necessary to understand their structure, e.g., recognize their dialect, extract metadata, or detect tables. Previous work on table representation focused on learning the semantics of data cells , with the assumption that the syntactical properties of a file are known to end users. We propose MAGRiTTE, an approach to synthetically represent the structural features of a data file. MAGRiTTE is a self-supervised machine learning model trained to learn structural embeddings from data files. The architecture of MAGRiTTE is composed of two components. The first is a transformer-encoder architecture, based on BERT and pre-trained to learn row embeddings. The second is a DCGAN-autoencoder trained to produce file-level embeddings. To pre-train the transformer architecture on structural features, we propose two core adaptations: a novel tokenization stage and specialized training objectives. To abstract the data content of a file, and train the transformer architecture on structural features, we introduce “pattern tokenization”: Assuming that structural properties are identifiable through special characters, we reduce all alphanumeric characters to a set of few general patterns. After tokenization, the rows of the input files are split on newline characters and a percentage of the special character tokens is masked before feeding it to the row encoder model. The row-transformer model is then trained on two objectives, reconstructing the masked tokens, and identifying whether pairs of rows belong to the same file. The row embeddings produced by this model are then used as the input for the file embedding stage of MAGRiTTE. In this stage, the generator and discriminator models are trained in an adversarial fashion on the row embeddings feature maps. To obtain a file-wise embedding vector, we concatenate the output features produced from all convolutional stages of the discriminator. We shall evaluate the effectiveness of our learned structural representations on three tasks to analyze unseen data files: (1) fine-grained dialect detection, i.e., identifying the structural role of characters within rows, (2) line and cell classification, i.e., identifying metadata, comments, and data within a file, (3) table extraction, i.e., identifying the boundaries of tabular regions. We compare the use of MAGRiTTE encodings with state-of-the-art approaches that were specifically designed for these tasks. In future work, we aim at using MAGRiTTE embeddings to automatically perform structural data preparation, e.g., extracting tables, removing unwanted rows, or changing file dialects.

0 Replies

Loading