Keywords: structural embedding, data preparation, tabular data file, csv
TL;DR: Learning to represent the structure of tabular data files
Abstract: Large amounts of tabular data are encoded in plain-text files, e.g., CSV, TSV and
TXT. Plain-text formats allow freedom of expression and encoding, fostering the
use of non-standard syntaxes and dialects. Before analyzing the content of such
files, it is necessary to understand their structure, e.g., recognize their dialect,
extract metadata, or detect tables. Previous work on table representation
focused on learning the semantics of data cells , with the assumption that
the syntactical properties of a file are known to end users.
We propose MAGRiTTE, an approach to synthetically represent the structural features
of a data file. MAGRiTTE is a self-supervised machine learning model trained
to learn structural embeddings from data files. The architecture of MAGRiTTE
is composed of two components. The first is a transformer-encoder architecture,
based on BERT and pre-trained to learn row embeddings. The second is a
DCGAN-autoencoder trained to produce file-level embeddings. To pre-train the
transformer architecture on structural features, we propose two core adaptations: a
novel tokenization stage and specialized training objectives. To abstract the data
content of a file, and train the transformer architecture on structural features, we introduce
“pattern tokenization”: Assuming that structural properties are identifiable
through special characters, we reduce all alphanumeric characters to a set of few
general patterns. After tokenization, the rows of the input files are split on newline
characters and a percentage of the special character tokens is masked before feeding
it to the row encoder model. The row-transformer model is then trained on two
objectives, reconstructing the masked tokens, and identifying whether pairs of rows
belong to the same file. The row embeddings produced by this model are then
used as the input for the file embedding stage of MAGRiTTE. In this stage, the
generator and discriminator models are trained in an adversarial fashion on the row
embeddings feature maps. To obtain a file-wise embedding vector, we concatenate
the output features produced from all convolutional stages of the discriminator.
We shall evaluate the effectiveness of our learned structural representations on three
tasks to analyze unseen data files: (1) fine-grained dialect detection, i.e., identifying
the structural role of characters within rows, (2) line and cell classification, i.e.,
identifying metadata, comments, and data within a file, (3) table extraction, i.e.,
identifying the boundaries of tabular regions. We compare the use of MAGRiTTE
encodings with state-of-the-art approaches that were specifically designed for these
tasks. In future work, we aim at using MAGRiTTE embeddings to automatically
perform structural data preparation, e.g., extracting tables, removing unwanted
rows, or changing file dialects.
0 Replies
Loading