% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables
\newcommand{\theHalgorithm}{\arabic{algorithm}}

\usepackage{colortbl}



\usepackage{amsmath,amsfonts,amsthm}
% % \usepackage[algo2e,linesnumbered]{algorithm2e},amssymb
\usepackage[ruled,linesnumbered]{algorithm2e}
\usepackage{comment}
\usepackage{wrapfig}
\usepackage[]{caption}
\usepackage{xcolor}         % colors
\usepackage{minitoc}






\definecolor{alizarin}{rgb}{0.82, 0.1, 0.26}
\hypersetup{linkcolor=alizarin}
\hypersetup{citecolor=[rgb]{0.0,0.0,0.95}}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

\newcommand \inner[2]{\langle #1, #2 \rangle}
\newcommand{\etal}{\textit{et al.}}
\newcommand{\ie}{\textit{i}.\textit{e}.}
\DeclareMathOperator*{\minimize}{minimize}
% to compile a preprint version, e.g., for submission to arXiv, add add the
% [preprint] option:
%     \usepackage[preprint]{neurips_2022}


% to compile a camera-ready version, add the [final] option, e.g.:
%     \usepackage[final]{neurips_2022}


% to avoid loading the natbib package, add option nonatbib:
%    \usepackage[nonatbib]{neurips_2022}

\def\blkbx{\raisebox{-1mm}{\rule{2mm}{4mm}}}
%\usepackage{amsthm}
%\usepackage{paralist}

% \newcommand{\hy}[1]{\textcolor{purple}{\bf [Hanyu: #1] }}
% \newcommand\cl[1]{[{\color{red}chenglin: #1}]}

% Space saving measures                                                     
\usepackage[toc,page,header]{appendix}
\usepackage{minitoc}
\usepackage{todonotes}     

\usepackage{xcolor}         % colors

% \newcommand \inner[2]{\langle #1, #2 \rangle}
% \newcommand{\etal}{\textit{et al.}}
% \newcommand{\ie}{\textit{i}.\textit{e}.}
% \DeclareMathOperator*{\minimize}{minimize}
% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
\usepackage{xr} 
\externaldocument{uai2023-template}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Copula for Instance-wise Feature Selection and Ranking\\(Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author{\textbf{Hanyu Peng}, \textbf{Guanhua Fang}, \textbf{Ping Li} \\
Cognitive Computing Lab\\ Baidu Research\\
No.10 Xibeiwang East Road, Beijing 100193, China\\
10900 NE 8th St. Bellevue, Washington 98004, USA\\
{\tt \{hanyu.peng0510, fanggh2018, pingli98\}@gmail.com}
}
  \begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle






\section{Analysis with respect to complexity}
The computational complexity can be estimated as $O(d^3)$, where $d$ is the number of features. Since we have to do matrix decomposition.
To reduce the computational complexity even further, additional techniques can be applied. As one of the proposed solutions in the article~\cite{lee2022self}, we can threshold the correlation matrix 
$\boldsymbol \Sigma$, and grouping features based on agglomerative clustering using the correlation matrix as the similarity measure. This reduces the number of computations required by conducting block-wise matrix multiplication, which scales quadratically with respect to the largest block size (i.e., the number of features grouped in the largest block). By maintaining the correlation structure within each group, the generated gates for features within the same group can be used to select features that are highly correlated with the target variable, while reducing the computational complexity of the algorithm. If the largest block size remains the same, the complexity of generating the correlated gate vectors will only increase linearly with the feature dimension (since the number of blocks would increase linearly).

The CPU calculation time for copula-based feature selection also depends on the implementation, hardware, and software used. To reduce CPU calculation time, some techniques such as parallelization and approximation methods can be used. For example, using GPU computation can significantly speed up feature selection process, particularly for large datasets.

In summary, copula-based feature selection can be computationally expensive, particularly for large datasets with many features. The computational complexity can be estimated as $O(d^3)$, and the CPU calculation time depends on the implementation, hardware, and software used. However, optimization techniques such as parallelization and approximation methods can be used to reduce the computational complexity and CPU calculation time.

\section{Visualization of Chosen Pixels} 
\label{sec:visual}


\begin{figure}[t!]
\centering
\includegraphics[width=4in]{../fig/vis.pdf}
\caption{Visualization of chosen pixels, we can observe that our method can select meaning and intuitive image features.}
\label{fig:mnist_visual}
\end{figure}


Now, we present a visual representation of the top-120 features with the most significant values in the $\boldsymbol \alpha$ for each sample on the MNIST dataset. For each image, we display the most informative features. As is evident from Figure~\ref{fig:mnist_visual}, we can observe a clear segmentation in the shape of the classification object, indicating that our method is capable of identifying the most important and meaningful features. This visualization demonstrates the remarkable interpretive power of our method.




\section{Neural Reparameterization of Correlated Uniform Noise}
\label{sec:neural_para}

In Algorithm~\ref{algo:neural_corr_noise}, we present the workflow to produce correlated uniform noise $\boldsymbol u \in \mathbb{R}^d$. Here, we offer a further elucidation of the neural reparameterization of it. We denote the output of the hidden layer in ChoiceNet as $f(\boldsymbol{w};\boldsymbol{x})$  as $\boldsymbol{O}_h \in \mathbb{R}^{h_c}$, and the weight parameter of the layer which produces $\sigma$ as $\boldsymbol{W}_{ \sigma}\in \mathbb{R}^{h_c\times d}$. Additionally, $\boldsymbol{W}_{\boldsymbol L}$ stands for the weight parameter of the layer which generates matrix $\boldsymbol L$. If we opt for the low-rank approximation, the shape of $\boldsymbol{W}_{\boldsymbol L}$ is $h_c \times (p\times d)$, otherwise $h_c \times (d\times d)$. The concrete implementation via neural reparameterization is provided as follows:

\begin{algorithm}
\caption{Neural Reparameterization of Correlated Uniform Noise}
\label{algo:neural_corr_noise}
\KwIn{Activation $\boldsymbol{O}_h$ of the hidden layer in ChoiceNet $f(\boldsymbol{w};\boldsymbol{x})$, weight parameters $\boldsymbol{W}_{\sigma}$ and $\boldsymbol{W}_{\boldsymbol L}$}
\KwOut{Correlated uniform noise $\boldsymbol u$.}
$\boldsymbol L =\textrm{ReLU}(\boldsymbol{O}_h\boldsymbol{W}_{\boldsymbol L})$\;

$ \sigma =\textrm{Tanh}(\boldsymbol{O}_h\boldsymbol{W}_{\boldsymbol \sigma})$ \;

Obtain the covariance matrix via low-rank approximation $\boldsymbol \Sigma =\boldsymbol L^T \boldsymbol L + \sigma^2 \boldsymbol I$\ or full-rank approximation $\boldsymbol \Sigma =\boldsymbol L^T \boldsymbol L$\;

Perform Cholesky factorization on $\boldsymbol \Sigma$ to get Cholesky factor $\boldsymbol V$ \;

Generate a Gaussian noise vector $\boldsymbol \zeta$ from standard normal distribution $\boldsymbol{\zeta} \sim \mathcal{N}(0,\,\boldsymbol{I} )$\;

Calculate the Gaussian vector $\boldsymbol{q}=\boldsymbol V \boldsymbol{\zeta}$\;

Apply Gaussian copula to obtain $\boldsymbol u$ as ${u}_i =\Phi_{\boldsymbol R}(q_i), \forall i=1,\ldots,d $\;
\end{algorithm}






\section{Details of Baseline Methods}
\label{sec:baseline}
 The summary of some baseline methods of binary feature selection are as follows.

\begin{itemize}
    \item \textbf{Xgboost} We used the Gini index as the splitting criterion. Specifically, we used the DecisionTreeClassifier function from the scikit-learn library with default parameters. The DecisionTreeClassifier function builds a decision tree by recursively splitting the data based on the feature with the highest Gini importance score. The Gini importance score measures the total reduction of impurity brought by a feature in the decision tree, and features with higher Gini importance scores are considered more important. After building the decision tree, we selected the top-$k$ features with the highest Gini importance scores as the selected features. The value of $k$ was determined using the same experimental setup and evaluation protocol as our proposed method. 

    \textbf{LASSO}  In the paper, LASSO is a linear regression-based feature selection method, where we used $L_1$ regularization to encourage sparsity in the model coefficients. Specifically, we used the LogisticRegression function from the scikit-learn library with $L_1$ penalty and default parameters. The $L_1$ penalty in the LogisticRegression function encourages sparsity in the model coefficients by adding a penalty term to the loss function that is proportional to the absolute value of the coefficients. This penalty term forces the model to select only a subset of the most important features, effectively performing feature selection. After fitting the logistic regression model with $L_1$ regularization, we selected the top-$k$ features with the highest absolute coefficients as the selected features. The value of $k$ was determined using the same experimental setup and evaluation protocol as our proposed method, including a hold-out strategy and 5-fold cross-validation for hyperparameter tuning.
 
    \item \textbf{L2X}~\citep{chen2018learning} introduces a new model interpretation way from the feature selection perspective, it aims to learn a feature selection network that maximizes the mutual information between selected feature subsets and corresponding outputs, we use the official implementation to evaluate the result from the link: 
    
    https://github.com/Jianbo-Lab/L2X
    \item \textbf{INVASE}~\citep{yoon2019invase} proposes an instance-wise feature selection algorithm based on the actor-critic framework to selects most relevant features that minimizes the Kullback-Leibler (KL) divergence between full conditional distribution and suppressed feature distribution, we use the official implementation to evaluate the result from the link:
    
    https://github.com/jsyoon0823/INVASE
    \item \textbf{LIME}~\citep{lime} is a model-agnostic explanation algorithm, it learns an interpretable model locally in a non-redundant and faithful manner by formulating the task as a submodular optimization problem, we use the official implementation to evaluate the result from the link:
    
    https://github.com/marcotcr/lime
    \item \textbf{Shap}~\citep{lundberg2017unified} proposes a novel framework that employs the shapely value to calculate the feature importance, we use the official implementation to evaluate the result from the link:
    
    https://github.com/slundberg/shap
    \item \textbf{Knockoff}~\citep{barber2015controlling} aims to find which variables are important to the response by comparison between knock-off variables and original variables. we use the official implementation to evaluate the result from the link: 
    
    http://web.stanford.edu/group/candes/knockoffs/software/knockoff/
    
  \end{itemize}
In our experiments, we used a neural network as the predictive model in conjunction with Shap and LIME. The neural network had the same size and architecture as the one in our proposed method, in order to ensure fairness and exclude the influence of other factors such as network architecture and size. Regarding the cutoffs, we used a threshold of 0.5 for the neural network to predict the binary class labels. We applied the same threshold when generating the Shap and LIME explanations, to ensure consistency in the interpretation of feature importance across different methods.

  
   The description of some baseline methods of top-$k$ feature ranking are as follows:
   
   \begin{itemize}
     \item \textbf{STG}~\citep{yamada2020feature} provides a novel algorithm that depends on the Gaussian-based relaxation of the Bernoulli distribution to select relevant features, we use the official implementation to evaluate the result from the link: 
     
     https://github.com/runopti/stg
     \item \textbf{CAE}~\citep{abid2019concrete} introduces an auto-encoder architecture for global feature selection while reconstructing the input, we use the official implementation to evaluate the result from the link: 
     
     https://github.com/mfbalin/Concrete-Autoencoders
 \end{itemize}
 \subsection{Discussions of Baseline Methods}
Our method is capable of discerning pertinent features on a global scale (Syn1, Syn2, and Syn3) as well as on an individual basis (Syn4, Syn5, and Syn6). Notably, our approach surpasses prior neural network-based approaches (INVASE, L2X) in terms of individual performance, with the improvement being more pronounced in Syn4 and Syn5 than in Syn6. Random Forests (RFs) can select global features, thus performing better on Syn1, Syn2, and Syn3, but not as well on Syn4, Syn5, and Syn6. Shapley-based methods calculate the variable importance to elucidate the linear dependency for each sample; however, it is difficult to capture the non-linear relationships in synthetic data, thus rendering it less effective in high-dimensional data. LIME utilizes simple functions to interpret complexity locally; however, it can only explain the particular instance, meaning that it may not be able to accommodate for unseen instances. Knockoff filters features according to certain criteria, yet there is no assurance that this metric is uniquely optimal, thus its performance may vary across datasets.



 \section{Implementation Details}
 \label{sec:impleme}
 \subsection{Synthetic Datasets}
We have employed the same datasets and network structure as in~\citet{yoon2019invase}. For all experiments on the six synthetic datasets, the hyperparameters $h_c$ and $h_p$ were set to 100 and 200, respectively. The activation function of the last layer was a sigmoid. The entire network (ChoiceNet and PredictNet) was trained for 1,000 epochs using Adam with a batch size of 1,000, a weight decay of 0.001, and coefficients of 0.9 and 0.999 for computing running averages of gradient and its square. The constant learning rate was set to 0.0001, and the temperature parameter $t$ was set to either 3 or 5. Finally, cross-validation was employed to tune the hyperparameter $\lambda$.
\subsection{Real Datasets}
When performing top-$k$ feature ranking, $h_c$ and $h_p$ were both set to 16. In all the experiments, we discovered that the Cholesky decomposition to obtain the Cholesky factor matrix $\boldsymbol{L}$ was time-consuming, so we employed the full-rank scheme. We trained the network for 100 epochs using Adam, with coefficients used for computing running averages of gradient set to 0.9 and 0.999. The constant learning rate was set to 0.001 for Fashion-MNIST and MNIST, and 0.0001 for ISOLET. The batch size was set to 1,000 for MNIST and Fashion-MNIST, and 64 for ISOLET, while the temperature $t$ was set to $1$ for all the experiments evaluated on real datasets. All the parameters of the neural networks were randomly initialized. For a fair comparison with baseline methods based on neural networks such as INVASE, L2X, STG, and CAE, we employed the same hyper-parameters and architecture to evaluate the performance.



\bibliography{peng_153}

\end{document}
