
\documentclass{article} % For LaTeX2e
\usepackage{iclr2022_conference,times}

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}

\usepackage{hyperref}
\usepackage{url}
\usepackage{graphicx}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{lipsum}
\usepackage{hyperref}
\usepackage{amsmath}
% \usepackage{gensymb}
\usepackage{adjustbox}
\usepackage{soul}
\usepackage{multirow}
% \usepackage[table]{xcolor}


\title{DCoM: A Deep Column Mapper for Semantic Data Type Detection}

% Authors must not appear in the submitted version. They should be hidden
% as long as the \iclrfinalcopy macro remains commented out below.
% Non-anonymous submissions will be rejected without review.

\author{Antiquus S.~Hippocampus, Natalia Cerebro \& Amelie P. Amygdale \thanks{ Use footnote for providing further information
about author (webpage, alternative address)---\emph{not} for acknowledging
funding agencies.  Funding acknowledgements go at the end of the paper.} \\
Department of Computer Science\\
Cranberry-Lemon University\\
Pittsburgh, PA 15213, USA \\
\texttt{\{hippo,brain,jen\}@cs.cranberry-lemon.edu} \\
\And
Ji Q. Ren \& Yevgeny LeNet \\
Department of Computational Neuroscience \\
University of the Witwatersrand \\
Joburg, South Africa \\
\texttt{\{robot,net\}@wits.ac.za} \\
\AND
Coauthor \\
Affiliation \\
Address \\
\texttt{email}
}

% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to \LaTeX{} to determine where to break
% the lines. Using \AND forces a linebreak at that point. So, if \LaTeX{}
% puts 3 of 4 authors names on the first line, and the last on the second
% line, try using \AND instead of \And before the third author name.

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

%\iclrfinalcopy % Uncomment for camera-ready version, but NOT for submission.
\begin{document}


\maketitle

\begin{abstract}
Detection of semantic data types is a very crucial task in data science for automated data cleaning, schema matching, data discovery, semantic data type normalization and sensitive data identification. Existing methods include regular expression-based or dictionary lookup-based methods that are not robust to dirty as well unseen data and are limited to a very less number of semantic data types to predict. Existing Machine Learning methods extract a large number of engineered features from data and build logistic regression, random forest or feedforward neural network for this purpose. In this paper, we introduce DCoM, a collection of multi-input NLP-based deep neural networks to detect semantic data types where instead of extracting a large number of features from the data, we feed the raw values of columns (or instances) to the model as texts. We train DCoM on 686,765 data columns extracted from the VizNet corpus with 78 different semantic data types. DCoM outperforms other contemporary results with a quite significant margin on the same dataset achieving a support-weighted F1 score of 0.925. 
\end{abstract}

\section{Introduction}
Robust data preprocessing pipeline enables organizations to make data-driven business decisions. Most unstructured and semi-structured data after initial pre-processing is available in tabular format for further processing. These tabular datasets have to go through a process of data curation and quality check before consumption. Companies apply standard data quality checks/rules on business-critical columns to access and maintain the quality of data. Additionally, some of these datasets may contain sensitive information e.g. Protected Health Information (PHI), Personally Identifiable Information(PII), etc and need to be masked/de-identified i.e. identifying all columns which are Social Security Numbers, First Names, telephone numbers, etc. The first step in this data curation and quality check process is column level semantic tagging/mapping. There have been many attempts by \cite{Ramnandan}, \cite{Limaye}, \cite{Puranik2012ASA}, \cite{Hulsebos}, and \cite{sato} to automate this process. 
Traditionally, Semantic tagging is usually done using handcrafted rule-based systems e.g. tools from \cite{google}, \cite{powerbi}, etc. In some cases, column descriptions are also available. However, as the data grows exponentially, the cost associated with maintaining a rule-based system also increases. The semantic data type tagging is still largely manual where data stewards manually scan through databases and map columns of interest.
Most ML-based approaches e.g. \cite{sato}, \cite{Hulsebos} and \cite{Pham} use corresponding metadata, atomic datatypes, column description along with handcrafted features as input data for training the model.

We herein, propose \textbf{DCoM}-\textbf{D}eep \textbf{Co}lumn \textbf{M}apper; a generic collection of Deep learning based Semantic mappers to map the columns to semantic data types.
These models take the raw values of the columns (or instances) as inputs considering those as texts and build NLP-based deep learning architectures to predict the semantic type of the columns (or instances). We also extracted 19 engineered features and used those as auxiliary features to DCoM models. We refrained from extracting a large number of features because we wanted to make the DCoM models learn those features along with some interesting high-level features on their own, for better semantic data type detection. As we are exploring NLP in this problem, we make the leverage of using advanced NLP-based deep learning layers or models such as Bi-LSTM, BERT, etc.


\section{Related Work}
\cite{powerbi} Power BI and \cite{google} Data Studio use some regular expression based patterns for dictionary looks ups of column headers and values to detect a limited number of semantic data types. \cite{Venetis} build a database of value-type mappings, then assign semantic types using a maximum likelihood estimator based on column values. \cite{Syed2010ExploitingAW} use column values and headers to build a Wikitology query to map columns to semantic classes. \cite{Ramnandan} use heuristics to first separate numerical and textual types, then describe those types using the Kolmogorov-Smirnov (K-S) test and Term Frequency-Inverse Document Frequency (TF-IDF), respectively. \cite{Pham} use slightly more features, including the Mann-Whitney test for numerical data and Jaccard similarity for textual data, to train logistic regression and random forest models. \cite{Goel12k.:exploiting} use conditional random fields to predict the semantic type of each value within a column, then combine these predictions into a prediction for the whole column. \cite{Limaye} use probabilistic graphical models to annotate values with entities, columns with types, and column pairs with relationships. \cite{Puranik2012ASA} proposes a “specialist approach” combining the predictions of regular expressions, dictionaries, and machine learning models. More recently, \cite{yan} introduced a system that, given a search keyword and set of positive examples, synthesizes type detection logic from open source GitHub repositories. \cite{Hulsebos} built a multi-input deep neural network model for detecting 78 semantic data types, extracting 686, 765 data columns from VizNet corpus from \cite{viznet}. They extracted a total of 1,588 features for each column to train the model, thus resulting in a support-weighted F1 score of 0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations. \cite{sato} introduced SATO, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values. It combines a deep learning model trained on a large-scale table corpus with topic modelling and structured prediction to achieve support-weighted and macro average F1 scores of 0.925 and 0.735, respectively, exceeding the state-of-the-art performance by a significant margin. Recently \citet{turl} used a structure-aware Transformer encoder to capture semantics and knowledge in large-scale data. \citet{tcn} proposed Table Convolution Network and \citet{doduo} proposed a multi-task learning approach (called Doduo) which outperform the existing results in semantic data type detection. \citet{tabbie} devised a simple pre-training objective (corrupt cell detection) that learns exclusively from tabular data and reaches the state-of-the-art on a suite of table-based prediction tasks.

While in this paper we do not experiment with the context information of the column values, our work is mostly aligned to \cite{Hulsebos}. Keeping that in mind, we are planning to research and implement the context of the column values in tables in our DCoM models for our future works. The paper is organized as follows: Section \ref{sec:data} describes the data used for DCoM models. In Section \ref{sec:method} we present the details of data preparation and architecture for DCoM models. Section \ref{sec:training} contains the training, evaluation and inference procedures of DCoM models while Section \ref{sec:results} discussed about the results of extensive experiments. In Section \ref{sec:limitations}, we talk about some known limitations from the data as well as application point of view and finally, in Section \ref{sec:conclusion}, we present the concluding remarks and some future directions.

\section{Data}
\label{sec:data}

We have used the dataset prepared by \cite{Hulsebos} and compared our model performance with them considering their results to be baselines. This dataset contains 686,765 instances with 78 unique classes (or semantic data types). It is divided into 60/20/20 training/validation/testing splits. The instance count for classes varies from 584 (affiliate class) to 9088 (type class). The count distribution for classes is shown in Figure \ref{fig:class_distribution} of Appendix \ref{sec:appendix_dataset}. To provide a more clear picture of the data a sample of the dataset is also shown in Table \ref{tab:sample_data} of Appendix \ref{sec:appendix_dataset}.



\section{Proposed  Method}
\label{sec:method}
Prior approaches \citep{Venetis, Limaye} to semantic type detection trained models, such as logistic regression, decision trees, feedforward neural network \citep{Hulsebos}, extracting various features from the data. We, on the other hand, treated the data like natural language (text) and used the data itself as the input to the model. We used a very small number of hand-engineered features unlike \cite{Hulsebos}. Therefore, our DCoM models have two inputs, the values of the instances as text or natural language and hand-engineered features. We present two types of DCoM models based on the way we feed the text input to the model.

\subsection{DCoM with Single Sequence Input}
In this subsection, we discuss how the values of each instance can be fed to the model as a single sequence input to the DCoM. Considering the scenario, we can not simply pass the inputs separating values of each input with any separator token. This gives the model wrong information about the sequence of the data resulting in faulty training and poor performance on the unseen data. For example, if we consider the last example from Table \ref{tab:sample_data} and create the input, \texttt{Deletes the property <SEP> Lets you edit the value of the property <SEP> Script execution will be stopped} for the model, the model gets to learn that the value \texttt{Lets you edit the value of the property} has a relative position between \texttt{Deletes the property} and \texttt{Script execution will be stopped}, which is quite wrong because the values in an instance do not have any relative position between them. 

To mitigate the problem we introduce permutations from mathematics. With permutation, we order $r$ items from the set of $n_i$ items. Here $n_i$ items are all the values of an instance $i$, and $r \in [1, n]$. Doing so, the above instance can be broken down into multiple subsets. A sample of the subset is shown in Table \ref{tab:permutation}. If we feed all the new instances of the subset to the model, it does not learn any relative positional information of \texttt{Lets you edit the value of the property} with respect to other values in that instance unlike earlier. Therefore, for the value \texttt{Deletes the property}, the model only learns the relative positions of the tokens e.g. \texttt{Deletes, the, property,} etc in a value, but not the relative position of the values in an instance. This helps the model getting the actual information present in the data for accurate prediction on the output. This permutation method also helps in augmenting new instances which help in training the data-hungry deep learning models with enriched data. It is not feasible to generate all the possible permutations before the training because of the huge number of subsets. Instead, during training, we sample $r$ between $1$ and $n$ for each of the instances and generate one permutation for each instance. More than one instance can be generated, but training the model for multiple epochs will result same for both cases.


\begin{table}[ht]
	\centering
	\begin{adjustbox}{width=0.9\columnwidth,center}
		\begin{tabular}{cc}
			\hline
			\textbf{New Instance} & \textbf{class} \\ \hline
			\texttt{Deletes the property} & description \\ 
			\texttt{Lets you edit the value of the property} & description \\ 
			\texttt{Script execution will be stopped} & description \\ 
			\texttt{Deletes the property <SEP> Lets you edit the value of the property} & description \\ 
			\texttt{Lets you edit the value of the property <SEP> Script execution will be stopped} & description \\ 
			\texttt{Deletes the property <SEP> Lets you edit the value of the property <SEP> Script execution will be stopped} & description \\ 
			\texttt{Deletes the property <SEP> Script execution will be stopped <SEP> Lets you edit the value of the property} & description \\ 
			\texttt{Lets you edit the value of the property <SEP> Deletes the property <SEP> Script execution will be stopped} & description \\ \hline
			
		\end{tabular}
	\end{adjustbox}
	\caption{Permutations on the values of a single instance from class \texttt{description}.}
	\label{tab:permutation}
\end{table}

\subsection{DCoM with Multiple Sequences Inputs}
Input preparation with this method is straightforward compared to the earlier method. In this method, we also use permutations to generate new instances, but the value of $r$ is fixed during training and it is used as a hyper-parameter, where $r \in [1, \infty)$. Once we decide the value of $r$, $r$ number of inputs are used as text inputs to the model. Aggregation of embedding vectors of the inputs are performed once they are generated using the shared embedding weights. If $r > n$, where $n$ is the number of values of an instance, then this scenario can be handled in two ways. The first way is to pad $r-n$ inputs and aggregate the embedding vectors only for the non-padded inputs. Another way is while generating new instances use permutation with replacement to always sample $r$ values out of $n$ values. Therefore, padding is not required for the latter method. We tried both approaches and did not observe any significant difference in the result.

We used 19 out of 27 global statistical features from \cite{Hulsebos} as our engineered features for the DCoM models because we are only considering the unique values of columns. These features are normalized before feeding to the model. The complete list of these features is shown in Table \ref{tab:feat_imp}. Once the text and engineered inputs are prepared, we attach LSTM/Transformer/BERT layers to the text input. The output of the earlier layers is aggregated with the engineered inputs. We use some feed-forward layer after that. Finally, we use one \texttt{softmax} layer with 78 units to get the probability of each of the classes as output. This is our generalized architecture design of DCoM models. Extensive hyper-parameter tuning is performed to finalize the number of layers, number of units in a layer and many other hyper-parameters in the model. This topic is discussed in detail in section \ref{sec:training}. To name the DCoM models, the type of the text input and the name of the deep learning layers are used as suffixes with the name \texttt{DCoM}, e.g. DCoM model with single instance input and LSTM layers are named as \texttt{DCoM-Single-LSTM}. The architecture design of DCoM models for both single sequence and multiple sequences inputs are shown in Figure \ref{fig:arch}.

\begin{figure}[ht]
	\centering
	\includegraphics[width=0.8\linewidth]{images/arch.png}
	\caption{Architecture diagram of (a) \texttt{DCoM-Single} and (b) \texttt{DCoM-Multi} models.}
	\label{fig:arch}
\end{figure}


\section{Training, Evaluation and Inference}
\label{sec:training}

We trained our model on the train dataset, validated it on the validation dataset and finally reported our results on the test dataset. As class imbalance is present, like \cite{Hulsebos}, we also evaluated our model performance using the average F1-score, weighted by the number of columns per class in the test set (i.e., the support). 

The DCoM models are trained in Tensorflow 2 \citep{tensorflow2015-whitepaper} and the hyperparameters are tuned using \texttt{keras-tuner} \citep{kerastuner}. For inputs of DCoM models, we tried permutations with replacement as well as without replacement, but we did not observe any significant difference in outputs. The value of $r$ for which we observed best performance for \texttt{DCoM-Multi} models is $45$. For tokenizing text inputs, we experimented with character-based, word-based and BERT Wordpiece \citep{wordpiece} tokenizers, and we found out BERT Wordpiece tokenizer stood out to be working better with respect to the other tokenizers because of the obvious reasons stated in the paper \citep{wordpiece}. We compared our result with and without pre-trained embedding weights. It is observed that, though in the final output there is no considerable difference, training with pre-trained embedding weights take approximately 40\% less amount of time to converge. We experimented with the small, base and large variations of several BERT architectures and finalized \texttt{DistilBERT-base} \citep{distilbert} and \texttt{Electra-small} \citep{electra} based on their performances. We used Bi-directional LSTM layers in all the \texttt{DCoM-LSTM} variants. For \texttt{Dropout} layers we used $0.3$ as our dropout rate. We experimented \texttt{mean, sum, concatenation} and \texttt{weighted sum} functions for aggregation, but there was not any significant difference in outputs based on these variations. We used \texttt{Adam} \citep{adam} optimizer with initial learning rate $10^{-4}$. Along with this, we implement a learning rate reducer with a factor of 0.5, which reduces the learning rate by 50\% if the model performance does not improve on the validation dataset after consecutive 5 epochs. Assigning class weights does not have much effect on the test results. As this is a multi-class (78 classes) classification problem, the DCoM models are trained with \texttt{categorical-crossentropy} loss and validated with \texttt{accuracy} and \texttt{average F1} score metric.

We tried two approaches for inference on the test dataset. In the first approach, we performed single time inference on each of the instances of the test dataset. While doing so, we set $r$ to be the total number of values ($n_i$) for each instance for \texttt{DCoM-Single} models to perform inference. For \texttt{DCoM-Multi}, as $r$ values if prefixed, some values will be truncated for inputs with total values $n > r$. The other approach is to generate $k$ (where, $k > 1$) instances with permutation, sampling $r$ values $k$ times between 1 and $n_i$, where $n_i$ is the total number of values for instance $i$. Therefore, we get $k$ class predictions for each of the instances and finally with majority voting we pick the prediction class for each instance. We used $k$ value to be 10 in our case and it is observed that for both \texttt{DCoM-Single} and \texttt{DCoM-Multi}, we observed $0.2-0.5\%$ improvement in the test \texttt{average F1} score, but this improvement comes with the price of increased inference time by approximately $k$ times. 

\section{Results}
\label{sec:results}

We compared DCoM against Sherlock and other types of models shown by \cite{Hulsebos} along with SATO \citep{sato}, TURL \citep{turl}, TABBIE \citep{tabbie}, DODUO \citep{doduo} and TCN \citep{tcn} assuming those to be our baseline models. All the models are compared on the dataset used by \citet{Hulsebos} in Sherlock paper. Table \ref{tab:results} presents the comparison of results on the test dataset for each of the models. Columns \texttt{Engineered Features} and \texttt{k} are specific to DCoM models. \texttt{Engineered Features} says whether the 19 engineered features are used with the text inputs while training the model. \texttt{k} column tells the number of times inference is performed on a single test instance. It is discussed in detail in section \ref{sec:training}. \texttt{Runtime} column is the average time in seconds needed to extract features and generate a prediction for a single sample, and \texttt{Size} column reports the space required by the trained models in MB. From the table, it is seen that several DCoM models outperform all the baseline models in both F1 score and inference run time with significant margins. 

\cite{Hulsebos} extracted various features e.g. global statistics, character-level distributions, word embeddings, paragraph vectors from the data and used those features to fit a feedforward neural network model. On the other hand, we treated the data as texts and feed those to more advanced NLP-based models. This allows the DCoM models to extract and learn more useful features that \cite{Hulsebos} were unable to extract using hand-engineering. Along with this, hand-engineering takes a considerable amount of time to calculate the features which increase the inference time of Sherlock \citep{Hulsebos} by 3 to 20 times with respect to DCoM models.

\begin{table}[htp]
	\centering
	\begin{adjustbox}{width=0.9\columnwidth,center}
	\begin{tabular}{lcrcrr}
		\hline
		\textbf{Method} & \textbf{Engineered Features} & \textbf{k} & \textbf{F1 Score} & \textbf{Runtime (s)} & \textbf{Size (MB)} \\
		\hline
		\texttt{DCoM-Single-LSTM} & Yes & 1 & 0.895 & 0.019 $\pm$ 0.01 & 112.1\\
		\texttt{DCoM-Single-LSTM} & Yes & 10 & 0.898 & 0.152 $\pm$ 0.01 & 112.1\\
		\texttt{DCoM-Single-LSTM} & No & 1 & 0.871 & 0.018 $\pm$ 0.01 & 97.8\\
		\texttt{DCoM-Single-LSTM} & No & 10 & 0.877 & 0.141 $\pm$ 0.01 & 97.8\\
		\texttt{DCoM-Multi-LSTM} & Yes & 1 & 0.878 & 0.046 $\pm$ 0.01 & 4.7\\
		\texttt{DCoM-Multi-LSTM} & Yes & 10 & 0.881 & 0.416 $\pm$ 0.01 & 4.7\\
		\texttt{DCoM-Multi-LSTM} & No & 1 & 0.869 & 0.044 $\pm$ 0.01 & 4.6\\
		\texttt{DCoM-Multi-LSTM} & No & 10 & 0.871 & 0.401 $\pm$ 0.01 & 4.6\\
		\texttt{DCoM-Single-DistilBERT} & Yes & 1 & \textbf{0.922} & 0.162 $\pm$ 0.01 & 268.2\\
		\texttt{DCoM-Single-DistilBERT} & Yes & 10 & \textbf{0.925} & 1.552 $\pm$ 0.01 & 268.2\\
		\texttt{DCoM-Single-DistilBERT} & No & 1 & 0.901 & 0.158 $\pm$ 0.01 & 202.3\\
		\texttt{DCoM-Single-DistilBERT} & No & 10 & \textbf{0.904} & 1.492 $\pm$ 0.01 & 202.3\\
		\texttt{DCoM-Single-Electra} & Yes & 1 & \textbf{0.907} & 0.093 $\pm$ 0.01 & 53.1\\
		\texttt{DCoM-Single-Electra} & Yes & 10 & \textbf{0.909} & 0.894 $\pm$ 0.01 & 53.1\\
		\texttt{DCoM-Single-Electra} & No & 1 & 0.890 & 0.092 $\pm$ 0.01 & 45.7\\
		\texttt{DCoM-Single-Electra} & No & 10 & 0.892 & 0.887 $\pm$ 0.01 & 45.7\\
		\hline
		SATO \citep{sato} & - & - & 0.891 & 0.51 $\pm$ 0.01 & 25.7 \\
		TABBIE \citep{tabbie} & - & - & 0.897 & 0.058 $\pm$ 0.01 & 613.4 \\
		DODUO \citep{doduo} & - & - & 0.904 & 0.020 $\pm$ 0.01 & 155.8 \\
		TCN \citep{tcn} & - & - & 0.897 & 0.017 $\pm$ 0.01 & 108.7 \\
		TURL \citep{turl} & - & - & 0.894 & 0.017 $\pm$ 0.01 & 110.2 \\
		Sherlock \citep{Hulsebos} & - & - & 0.890 & 0.42 $\pm$ 0.01 & 6.2 \\
		Decision tree \citep{Hulsebos} & - & - & 0.760 &  0.26 $\pm$ 0.01 &  59.1 \\
		Random Forest \citep{Hulsebos} & - & - & 0.840 &  0.26 $\pm$ 0.01 &  760.4 \\
		Dictionary \citep{Hulsebos} & - & - & 0.160 &  0.01 $\pm$ 0.03 &  0.5 \\
		Regular expression \citep{Hulsebos} & - & - & 0.040 &  0.01 $\pm$ 0.03 &  0.01 \\
		Consensus \citep{Hulsebos} & - & - & 0.320 &  33.74 $\pm$ 0.86 &  - \\
		\hline
	\end{tabular}
	\end{adjustbox}
	\caption{Support-weighted F1 score, runtime at prediction, and size of DCoM and other benchmarks}
	\label{tab:results}
\end{table}


\subsection{Performance for Individual Types}
Following \cite{Hulsebos}, we also prepared Table \ref{tab:top_bottom_f1} which displays the top and bottom five types, as measured by the F1 score achieved by the best performing DCoM model, \texttt{DCoM-Single-DistilBERT} for single inference ($k=1$) for that type. . High performing classes such as grades, industry, ISBN, etc. contain a finite set of valid values which help DCoM models to extract distinctive features with respect to the other classes. To understand types for which \texttt{DCoM-Single-DistilBERT} performs poorly, we include incorrectly predicted examples for the lowest precision type (Rank) and the lowest recall type (Ranking) in Table \ref{tab:low_precision_recall}. From the table, it is observed that purely numerical values or values appearing in multiple classes, cause a challenge for the DCoM models to correctly classify the semantic type of the data. This issue is discussed in detail in section \ref{sec:limitations}.

%\begin{table}
%	\centering
%	\begin{tabular}{lcccc}
%		\hline
%		\textbf{Type} & \textbf{F1 Score} & \textbf{Precision} & \textbf{Recall} & \textbf{Support}\\
%		\hline
%		ISBN & 0.988 & 0.999 & 0.993 & 1430 \\
%		Grades  & 0.988 & 0.995 & 0.992 & 1765 \\
%		Birth Date & 0.987 & 0.985 & 0.986 & 479 \\
%		Jockey & 0.983 & 0.988 & 0.986 & 2817 \\
%		Industry & 0.984 & 0.984 & 0.984 & 2958 \\
%		\hline
%	\end{tabular}
%	\caption{Top five types by F1 score on the test dataset for \texttt{DCoM-Single-DistilBERT} model variant.}
%	\label{tab:top_f1}
%\end{table}


%\begin{table}
%	\centering
%	\begin{tabular}{lcccc}
%		\hline
%		\textbf{Type} & \textbf{F1 Score} & \textbf{Precision} & \textbf{Recall} & \textbf{Support}\\
%		\hline
%		Person & 0.813 & 0.596 & 0.688 & 579 \\
%		Rank  & 0.586 & 0.764 & 0.663 & 2978 \\
%		Director & 0.702 & 0.549 & 0.616 & 225 \\
%		Sales & 0.667 & 0.435 & 0.526 & 322 \\
%		Ranking & 0.695 & 0.348 & 0.464 & 439 \\
%		\hline
%	\end{tabular}
%	\caption{Bottom five types by F1 score on the test dataset for \texttt{DCoM-Single-DistilBERT} model variant.}
%	\label{tab:bottom_f1}
%\end{table}


\begin{table}
	\centering
	\begin{tabular}{l|lcccc}
		\hline
		&\textbf{Type} & \textbf{F1 Score} & \textbf{Precision} & \textbf{Recall} & \textbf{Support}\\
		\hline
		\multirow{5}{*}{Top 5 Types} & ISBN & 0.988 & 0.999 & 0.993 & 1430 \\
		& Grades  & 0.988 & 0.995 & 0.992 & 1765 \\
		& Birth Date & 0.987 & 0.985 & 0.986 & 479 \\
		& Jockey & 0.983 & 0.988 & 0.986 & 2817 \\
		& Industry & 0.984 & 0.984 & 0.984 & 2958 \\
		\hline
		\multirow{5}{*}{Bottom 5 Types} & Person & 0.813 & 0.596 & 0.688 & 579 \\
		& Rank  & 0.586 & 0.764 & 0.663 & 2978 \\
		& Director & 0.702 & 0.549 & 0.616 & 225 \\
		& Sales & 0.667 & 0.435 & 0.526 & 322 \\
		& Ranking & 0.695 & 0.348 & 0.464 & 439 \\
		\hline
	\end{tabular}
	\caption{Top and bottom five types by F1 score on the test dataset for \texttt{DCoM-Single-DistilBERT} model variant.}
	\label{tab:top_bottom_f1}
\end{table}



%\begin{table}
%	\centering
%	\begin{tabular}{lll}
%		\hline
%		\textbf{Examples} & \textbf{True Type} & \textbf{Predicted Type}\\
%		\hline
%		1, 5, 4, 3, 2 & Day & Rank\\
%		1, 2, 3, 4, 5, 6, 7, 8, 9, 10  & Region & Rank \\
%		1, 2, 3 & Position & Rank \\
%		\hline
%	\end{tabular} 
%	\caption{Examples of low precision types on the test dataset for \texttt{DCoM-Single-DistilBERT} model variant.}
%	\label{tab:low_precision}
%\end{table}



%\begin{table}
%	\centering
%	\begin{tabular}{lll}
%		\hline
%		\textbf{Examples} & \textbf{True Type} & \textbf{Predicted Type}\\
%		\hline
%		41, 2, 36 & Ranking & Rank\\
%		0, 2, 4  & Ranking & Plays \\
%		1, 2, 3, 4, 5, 6, 7, 8, 9, 10 & Ranking & Rank \\
%		\hline
%	\end{tabular}
%	\caption{Examples of low recall types on the test dataset for \texttt{DCoM-Single-DistilBERT} model variant.}
%	\label{tab:low_recall}
%\end{table}


\begin{table}
	\centering
	\begin{tabular}{l|lcc}
		\hline
		&\textbf{Examples} & \textbf{True Type} & \textbf{Predicted Type}\\
		\hline
		\multirow{3}{*}{Low Precision} & 1, 5, 4, 3, 2 & Day & Rank\\
		&1, 2, 3, 4, 5, 6, 7, 8, 9, 10  & Region & Rank \\
		&1, 2, 3 & Position & Rank \\
		\hline
		\multirow{3}{*}{Low Recall} & 41, 2, 36 & Ranking & Rank\\
		& 0, 2, 4  & Ranking & Plays \\
		& 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 & Ranking & Rank \\
		\hline
	\end{tabular} 
	\caption{Examples of low precision and low recall types on the test dataset for \texttt{DCoM-Single-DistilBERT} model variant.}
	\label{tab:low_precision_recall}
\end{table}




\subsection{Feature Importance}
To calculate the feature importance of 19 engineered features used in trained \texttt{DCoM-Single-DistilBERT} model, we extracted the learned feature weight matrix, $W$ from the \texttt{dense} layer used after the engineered features input, where $W \in R^{19 \times D}$. Here $D$ is the number of units used in the \texttt{dense} layer. $W$ contains the weights/contribution of each of the 19 features on each of the $D$ units, thus forming a $19 \times D$ array, which contains both positive and negative values based on the direction of contribution. We take the absolute value of all the elements of $W$, as we are only interested in the amount of contribution of each of the engineered features, not the direction. After that, we take the mean across the 19 features that result in a 19-dimensional array. We normalized the array by dividing the maximum value of the array by each of the elements. Table \ref{tab:feat_imp} enlists the importance of the engineered features in decreasing order. The feature importance score of the engineered features changes with the different variants of DCoM models but the rank of the top 10 features remain intact. %From the table it is seen that \texttt{Std of \# of   Numeric Characters in Cells, Std of \# of   Alphabetic Characters in Cells} and \texttt{Entropy} are the top 3 important features, where \texttt{Mean \# Alphabetic   Characters in Cells, Median Length of   Values} and \texttt{Mode Length of   Values} are the least 3 important features for the \texttt{DCoM-Single-DistilBERT} model.


\begin{table}[htbp]
	\centering
	\begin{tabular}{lll}
		\hline
		\textbf{Rank} & \textbf{Aux Features}                                     & \textbf{Score} \\
		\hline
		1    & Std of \# of   Numeric Characters in Cells       & 1.00  \\
		2    & Std of \# of   Alphabetic Characters in Cells    & 0.63  \\
		3    & Entropy                                   & 0.63  \\
		4    & Std of \# of   Special Characters in Cells       & 0.62  \\
		5    & Std of \# of Words   in Cells                    & 0.53  \\
		6    & Mean \# Words in   Cells                         & 0.50  \\
		7    & Mean \# of Numeric   Characters in Cells         & 0.48  \\
		8    & Minimum Value   Length                           & 0.45  \\
		9    & Kurtosis of the   Length of Values               & 0.41  \\
		10   & Mean \# Special   Characters in Cells            & 0.39  \\
		11   & Number of Values                                 & 0.34  \\
		12   & Fraction of Cells   with Alphabetical Characters & 0.33  \\
		13   & Fraction of Cells   with Numeric Characters      & 0.31  \\
		14   & Sum of the Length   of Values                    & 0.31  \\
		15   & Maximum Value   Length                           & 0.30  \\
		16   & Skewness of the   Length of Values               & 0.28  \\
		17   & Mean \# Alphabetic   Characters in Cells         & 0.28  \\
		18   & Median Length of   Values                        & 0.27  \\
		19   & Mode Length of   Values                          & 0.20 \\
		\hline
	\end{tabular}
	\caption{Feature importance of 19 engineered features used in \texttt{DCoM-Single-DistilBERT} model.}
	\label{tab:feat_imp}
\end{table}





\section{Known Limitations}
\label{sec:limitations}

The major limitation of the process is with the pre-processed data prepared by \cite{Hulsebos} used to train DCoM models. The same values of instances are present in multiple classes, thus resulting in strong class overlapping among some classes. It makes the models confusing during training to learn distinctive features for proper classification. Therefore, for some classes in the test dataset, the models perform very poorly. Table \ref{tab:overlap} presents a sample of examples of class overlapping in the data prepared by \cite{Hulsebos}. It is to be noted that the proportion of this class overlap is considerably large in numbers with respect to the total number of instances. Therefore, it affects the training as well as the model performance on the test dataset by a significant margin. Besides the class overlap, faulty or wrong values are present in some classes.

From the application point of view of semantic data type detection models, identifying and preparing a good dataset for training is very challenging as well as time-consuming. Non-standardized column names are a major challenge large organizations face in doing semantic detection. For example, \texttt{social security numbers} can be called as \texttt{ssn, ssn\_id, soc\_sec\_bnr} etc. Non-standardized data columns also pose a major challenge in information ambiguity. For example, \texttt{Gender} column can have values \texttt{Male, Female, unknown}, or \texttt{0,1,2}. Some organization can have mixed attributes which poses a major challenge in data security. For example, use of PII data like \texttt{Social security Number} in \texttt{Customer ID} columns. Some other challenges like corrupt/missing metadata also exist.
\begin{table}
	\centering
	\begin{tabular}{ll}
		\hline
		\textbf{Values} & \textbf{Class} \\
		\hline
		F, M & gender \\
		M & sex \\
		%F & sex \\
		M & gender \\
		M, F & sex\\
		\hline
		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 & ranking \\
		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14  & position \\
		1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 & rank \\
		%1, 3, 7, 10, 12, 15, 18 & position \\
		%130, 109, 50, 55, 29 & rank \\
		%65, 1, 39, 7, 41, 9, 106, 76, 12, 116 & ranking \\
		\hline
		Fresno State, Oregon, UCLA, South Carolina & team \\
		%Southern Maryland, yucatan & team name \\
		Michigan, Indiana, Wisconsin, Purdue & team name \\
		%Austria & team \\
		\hline
		%Terry Cater, Dave Hacker, Kevin Claborn & owner \\
		%Ella Jenkins, Joel Mabus, Andrew Bird & artist \\
		%Santi Mora, Chino Copello, Nacho La Selva & name \\
		%Carol James, Pauline Bishop, Linda Garand & person \\
		%\hline
	\end{tabular}
	\caption{A sample example of class overlapping of the data used to build up models}
	\label{tab:overlap}
\end{table} 

\section{Conclusion}
\label{sec:conclusion}
DCoM presents a novel permutation-based method by which the instances can be fed to the deep learning models directly as natural language. This takes the leverage of using more advanced NLP-based layers/models unlike feedforward neural networks in Sherlock \citep{Hulsebos}. The permutation-based method also helps in generating a large number of new instances from the existing ones, helping DCoM models to boost their performance effectively, thus outperforming Sherlock \citep{Hulsebos} and other recent works by quite a significant margin in both F1 scores as well as inference time. We also present an ensembling approach during inference time which improves the test \texttt{average F1} score by $0.2 - 0.5\%$. 

As mentioned earlier, the next step of it to introduce the context of the column values to DCoM models while predicting the semantic types in relational tables.



\bibliography{iclr2022_conference}
\bibliographystyle{iclr2022_conference}

\clearpage
\appendix
\section{Dataset Details}
\label{sec:appendix_dataset}

\begin{figure}[ht]
	\centering
	\includegraphics[width=\linewidth]{images/class_distribution.png}
	\caption{Count of instances for each of the 78 classes in the dataset \cite{Hulsebos}.}
	\label{fig:class_distribution}
\end{figure}

\begin{table}[ht]
	\centering
	\begin{adjustbox}{width=0.9\columnwidth,center}
	\begin{tabular}{ll}
		\hline
		\textbf{Instance}  & \textbf{Class} \\ \hline
		1, 2, 3, 4, 5, 6, 7, 8 & day \\
		4:59, 2:44, 2:04, 2:05, 1:13, 3:14 & duration \\ 
		Education, Poverty, Unemployment, Employment & category \\ 
		c;, end-code, code name, label name & command \\ 
		31 years, 22 years, --, 39 years, 24 years & age \\ 
		1, 2 & position \\ 
		LA, CA, AL, Warwickshire] & state \\
		Deletes the property, Lets you edit the value of the property, Script execution will be stopped & description \\ \hline
		
	\end{tabular}
	\end{adjustbox}
	\caption{A sample of the dataset.}
	\label{tab:sample_data}
\end{table}


\end{document}
