- Keywords: Web tables, table type classication, table vectors, table embedding, web table clustering
- TL;DR: We introduce a method for generating table vector embeddings, to enable clustering of web tables based on their structure, and facilitate table type classification.
- Abstract: There are hundreds of millions of tables in web pages that contain useful information for many applications. Leveraging data within these tables is difficult because of the wide variety of structures, formats and data encoded in these tables. We propose a weakly supervised method to classify web tables from a specific domain into five common categories, relational, entity, matrix, list, and non-data. In our method, we first calculate table vector embeddings on the table corpus in an unsupervised manner. This embedding space is then used to form meaningful clusters of tables, where each cluster represents a single category of tables. We evaluate our method on table classification task in three scenarios, weakly supervised, supervised with small training set, and supervised with large training set. Our evaluations in four real world domains show that the table vectors from our method performs well on table clustering and can leverage training data to have comparable performance to state of the art systems.
- Archival status: Archival
- Subject areas: Machine Learning, Information Extraction