\documentclass[accepted]{uai2024} % for initial submission
%\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amssymb,amsfonts,amsmath,amsthm}
\usepackage{algorithm}
\usepackage{algorithmicx}
\usepackage{algpseudocode}
\usepackage{graphicx,subcaption}
\usepackage{multirow}
\usepackage{stmaryrd}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\newtheorem{defi}{Definition}
\newtheorem{lemm}{Lemma}
\newtheorem{thrm}{Theorem}
\newtheorem{coro}{Corollary}
\newtheorem{prop}{Proposition}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\newcommand\twintablefontsize{\fontsize{5.5pt}{5.5pt}\selectfont}

\title{Unsupervised Feature Selection towards \\ Maximizing Pattern Discrimination Power}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<seowangduk@gmail.com>?Subject=Your UAI 2024 paper}{Wangduk~Seo}}
\author[1,2\thanks{Corresponding author}]{\href{mailto:<curseor@cau.ac.kr>?Subject=Your UAI 2024 paper}{Jaesung~Lee}}
% Add affiliations after the authors
\affil[1]{%
    AI/ML Innovation Research Center\\
    Chung-Ang University\\
    Seoul, South Korea
}
\affil[2]{%
    Department of Artificial Intelligence\\
    Chung-Ang University\\
    Seoul, South Korea
}
  
\begin{document}
\maketitle

\begin{abstract}
The goal of unsupervised feature selection is to identify a feature subset based on the intrinsic characteristics of a given dataset without user-guided information such as class variables.
To achieve this, score functions based on information measures can be used to identify essential features.
The major research direction of conventional information-theoretic unsupervised feature selection is to minimize the entropy of the final feature subset.
Although the opposite way, i.e., maximization of the joint entropy, can also lead to novel insights, studies in this direction are rare.
For example, in the field of information retrieval, selected features that maximize the joint entropy of a feature subset can be effective discriminators for reaching the target tuple in the database.
Thus, in this work, we first demonstrate how two feature subsets, each obtained by minimizing/maximizing the joint entropy, respectively, are different based on a toy dataset.
By comparing these two feature subsets, we show that the maximization of the joint entropy enhances the pattern discrimination power of the feature subset.
Then, we derive a score function by remedying joint entropy calculation; high-dimensional joint entropy calculation is circumvented by using the low-order approximation.
The experimental results on 30 public datasets indicate that the proposed method yields superior performance in terms of pattern discrimination power-related measures.
\end{abstract}

\section{Introduction}

Recent advancements in storage technology have led to exponential growth of data, such as the web ecosystem \citep{brickley2019google}.
This proliferation of data poses significant challenges in distinguishing meaningful patterns within the vast complexity and volume of information \citep{puerto2023feature}.
In particular, when the number of features becomes excessively large, patterns in the dataset can lose their discriminative power since the similarity among all patterns becomes similar, making subsequent analysis unreliable \citep{watanabe1969knowing}.
In such contexts, Unsupervised Feature Selection (UFS) emerges as an effective approach, offering a method to select core features from the original dataset.

The goal of UFS is to reduce the number of features needed for data representation while maintaining the essential information \citep{li2012unsupervised,shang2022feature,karami2023unsupervised,feng2016unsupervised,wang2015embedded}.
Because there is no user-guided information, such as class variables, UFS methods should identify a feature subset based on the intrinsic characteristics of the dataset.
By eliminating unnecessary features, the high dimensionality of the dataset can be remedied, and hence novel patterns \citep{wang2023outliers} in the dataset can easily be identified.
Regarding this, information entropy is known as a popular tool for measuring the information content of a variable set.

Conventional studies of information-theoretic UFS predominantly focus on measuring the relevancy or redundancy of features in the original dataset and selecting essential features \citep{hu2022feature,wang2022feature}.
In this framework, the algorithms are often designed to minimize the entropy of feature pairs in candidate feature subsets\footnote{It should be noted that, in the information theory, the minimization of joint entropy of feature pairs is equivalent to the maximization of mutual information of feature pairs.}, which is a major research direction of current UFS studies \citep{zhu2023unsupervised,zhang2023possibilistic}.
By comparison, the study of maximizing the entropy of feature subsets is pretty rare, even though this strategy can also lead to novel insights where the discrimination power of patterns is important.
To achieve this, we can consider a UFS process that maximizes the information content or the pattern discrimination power of selected feature subset for the given dataset, which can be viewed as the opposite direction of conventional studies.

In this work, we propose an information-theoretic UFS method that identifies a compact yet effective feature subset by maximizing entropy.
First, we will demonstrate how two feature subsets, each obtained by minimizing/maximizing the joint entropy of features to be selected, respectively, are different based on a simple toy dataset.
Our example shows that the maximization of joint entropy leads to the enhancement of the pattern discrimination power for given dataset; all the patterns in the datasets can be discriminable with a minimal number of features.
Next, we derive our score function for information-theoretic UFS based on the joint entropy maximization.
Unfortunately, it is well-known that the estimation of high-dimensional joint entropy is demanding in practice due to a limited number of patterns \citep{seo2019generalized,lee2013feature}.
As a possible solution, we remedy the computation of high-dimensional joint entropy from a large number of features by decomposing it into the sum of low-dimensional joint entropy terms instead of introducing heuristic approaches \citep{yuan2021novel,yuan2021unsupervised}.
Finally, we validated the performance of the proposed UFS method using 30 public datasets and confirmed its superiority in terms of pattern discrimination power-related measures.
The main contributions of this work are summarized as follows:
\begin{itemize}
    \item An information-theoretic UFS method is introduced, which maximizes entropy to identify an effective feature subset, thereby significantly enhancing pattern discrimination power within datasets.
    \item A comparative analysis demonstrates our approach using a toy dataset, illustrating the differences between feature subsets obtained through entropy minimization and maximization, and highlighting the enhanced pattern discrimination power achieved with entropy maximization.
    \item To tackle the computational challenge of high-dimensional joint entropy, we introduce a novel score function for UFS based on joint entropy decomposition. 
    \item The efficacy of the proposed method is validated through extensive testing on 30 public datasets, which confirms its superiority in improving pattern discrimination power over existing UFS methods.
\end{itemize}


\section{Related Work}

The primary aim of UFS is to reduce the dimensionality of data while preserving the inherent structure useful for subsequent tasks.
These UFS methods can be roughly categorized according to their strategy for FS as filter, wrapper, and hybrid methods \citep{solorio2020review}.
Among them, the filter methods that do not assume a fixed learning algorithm can be further divided into two groups: univariate and multivariate approaches.
Typically, the multivariate unsupervised feature filters yield better performance than the univariate feature filters because the multivariate approach is able to consider the relation among features to be selected.
In this work, we focus on the multivariate feature filter approach.

A majority of UFS methods are concentrated on preserving local structures and identifying latent labels of the given dataset, relying on similarity among data patterns.
In this regard, \citet{he2006laplacian} introduced the Laplacian score, a method that ranks features based on how well they preserve the local structure of data. 
Similarly, \citet{cai2010unsupervised} proposed a multi-cluster FS technique designed to maintain the multi-cluster characteristics of the data.
Other works tried to incorporate additional information for better identification of local structures.
For instance, \citet{yang2011l2} suggested an unsupervised discriminative FS method that integrates both discriminative information and intrinsic data structure using an \(l_{2,1}\)-norm.
\citet{li2012unsupervised} devised a UFS algorithm with a non-negative constraint, utilizing non-negative matrix factorization to construct the projection matrix for FS.
\citet{zhu2023unsupervised} leveraged an \(l_{2,0}\)-norm constraint to perform group UFS.
Lastly, \citet{villa2021utility} proposed a radial basis kernel (U2) to manage non-linearity within the conventional UFS framework of \(l_{2,1}\)-norm regularization.

In another group of studies, the UFS was embedded into a subsequent unsupervised learner, such as the clustering algorithm.
\citet{wang2015embedded} combined clustering algorithms with a UFS process in their embedded UFS method.
\citet{miao2022graph} also presented an embedded technique based on graph regularization, which is capable of preserving the local reconstruction relationships among neighboring data points.
\citet{shang2022feature} proposed a UFS approach that utilizes non-negative spectral feature learning with an adaptive rank constraint.
This adaptive constraint enables the algorithm to update the local structure more accurately during the UFS process.
\citet{zhang2020unsupervised} incorporated an adaptive graph learning constraint to integrate a similarity matrix into the existing UFS framework.
Recently, \citet{karami2023unsupervised} introduced a variance-covariance distance (VCSD) to tackle both dimensionality reduction and subspace learning.
% However, those methods often require a combination of multiple hyperparameters, which can be challenging to tune.

Numerous UFS methods have been developed to account for the information within the data. 
\citet{faivishevsky2012unsupervised} introduced a UFS method that estimates mutual information (MI) between features.
Instead of relying on a parametric model for this calculation, their approach uses statistical dependencies between features. 
\citet{yuan2021unsupervised} developed a UFS method grounded in fuzzy rough set theory for handling mixed data.
Specifically, their method constructs a fuzzy information system from the original data and selects features based on a fuzzy dependency function, thereby maximizing feature relevance. 
Another notable extension is Fuzzy MI (FMI), which reformulates MI under the fuzzy theory \citep{yuan2021novel}.
However, these methods often entail a high computational cost for calculating MI between features and additionally require the tuning of hyperparameters. 
\citet{feng2016unsupervised} proposed an efficient UFS method for hyperspectral image datasets by employing heuristic high-dimensional MI estimation.
Similarly, \citet{lim2021pairwise} incorporated pairwise MI into the spectral learning framework.
A common drawback of those approaches is that the analysis of the final feature subset may not be theoretically supported because the score function is devised heuristically.
In addition, they are unsuitable for maximizing the pattern discrimination power of data, which is the primary focus of this work.




\section{Proposed Method}

\subsection{Motivation}

\begin{table}[!t]
\centering
\caption{\label{tb:toydataset_b} A toy dataset composed of six patterns and four categorical features ($f_1$, $f_2$, $f_3$, and $f_4$), and two selected feature subsets guided by minimizing ($min$) and maximizing ($max$) joint entropy, respectively.}
\begin{tabular}{l|cccc|cc|cc}
\toprule
 & \multicolumn{4}{c|}{\multirow{2}{*}{Original Features}} & \multicolumn{4}{c}{Entropy-based UFS} \\
Pattern & \multicolumn{4}{c|}{} & \multicolumn{2}{c|}{$min$} & \multicolumn{2}{c}{$max$} \\
 & $f_1$ & $f_2$ & $f_3$ & $f_4$ & $f_1$ & $f_2$ & $f_3$ & $f_4$ \\
\midrule
$p_1$ & A & A & A & A & A & A & A & A \\
$p_2$ & B & A & B & A & B & A & B & A \\
$p_3$ & A & B & C & A & A & B & C & A \\
$p_4$ & A & B & A & B & A & B & A & B \\
$p_5$ & A & B & B & B & A & B & B & B \\
$p_6$ & A & B & C & B & A & B & C & B \\
\bottomrule
\end{tabular}
\end{table}

Conventional information-theoretic UFS methods often concentrate on selecting features that decrease the dissimilarity among patterns, thereby enhancing the performance of subsequent learners, such as the clustering algorithm. 
In this context, the objective function can be designed to minimize the joint entropy of the final feature subset $S$, which can be represented as $\argmin_{S } H(S)$ where $H(S) = -\sum P(S) \log P(S)$. In contrast, we may search for a feature subset by optimizing $\argmax_{S } H(S)$.

To demonstrate how the two objective functions lead to different final feature subsets, we created a toy dataset consisting of six patterns and four categorical features ($f_1$, $f_2$, $f_3$, and $f_4$) as shown in Table~\ref{tb:toydataset_b}.
The toy dataset shows that none of a single feature can distinguish all the patterns; for example, patterns $p_1, p_3, p_4, p_5$ and $p_6$ are indiscriminable in the viewpoint of $f_1$ because they are assigned to the same category, $A$.
Suppose that we want to identify the optimal feature subset where $\vert S \vert = 2$ that minimizes $H(S)$.
This problem can be solved by instantiating all the feature pairs, measuring the joint entropy values, and then choosing a feature subset that yields a minimal $H(S)$ value.

The second rightmost column of Table~\ref{tb:toydataset_b}, namely, $min$, shows the final feature subset of this case where $f_1$ and $f_2$ are selected, with a joint entropy of 1.252.
In this case, there are three groups $\{ p_1 \} = (A,A)$, $\{ p_2 \} = (B, A)$, and $\{ p_3, p_4, p_5, p_6 \} = (A, B)$ that are discriminable to each other.
In other words, patterns $p_3, p_4, p_5$, and $p_6$ are indiscriminable to each other.
Next, the rightmost column $max$ of Table~\ref{tb:toydataset_b} shows the final feature subset $S = \{ f_3, f_4 \}$ obtained by maximizing $H(S)$, with a joint entropy of 2.585.
In this case, there are six discriminable groups $\{ p_1 \} = (A,A)$, $\{ p_2 \} = (B, A)$, $\{ p_3 \} = (C, A)$, $\{ p_4 \} = (A, B)$, $\{ p_5 \} = (B, B)$, and $\{ p_6 \} = (C, B)$.

Our example shows that the strategy of maximizing $H(S)$ will lead to the selection of features that make discriminable groups as much as possible until all the patterns in the dataset become discriminable, i.e., a maximum pattern discrimination power is reached. The concept of maximizing $H(S)$ can be used to build an effective taxonomy of a system because it can reduce the number of discriminators in the system, and hence accelerate the retrieval time.


\subsection{Score Function}
\label{sec:score}

Let $W \in \mathcal{R}^{|F|}$ be the original dataset with $|F|$ features $F = \{f_1, f_2, \cdots, f_{|F|}\}$ and the goal of the UFS is to identify a feature subset $S$ consisting of $n$ features with the optimal pattern discrimination power where $n$ is the number of features to be selected. 
Because there are $2^{|F|}$ possible feature subsets, it is impractical to identify the optimal feature subset by searching all possible feature subsets, i.e., the exhaustive search.
To achieve this, UFS methods often employed an incremental search strategy to effectively instantiate candidate feature subsets.
The incremental search starts with an empty set and iteratively adds a new feature to the subset of features until $\vert S \vert$ reaches $n$.

Owing to the monotonicity of entropy, the original feature set $F$ should have the largest entropy $H(F)$ or pattern discrimination power. However, because only $n \ll |F|$ features can be included in $S$, the original pattern discrimination power will be damaged after the FS process.
To maintain the pattern discrimination power as much as possible, the difference between $H(F)$ and $H(S)$ should be minimized.
Thus, the objective function can be written as
\begin{equation}
  \label{eq:objective}
  \argmin_{S} \left(H(F) - H(S) \right).
\end{equation}

Because $H(F)$ is constant, Equation~(\ref{eq:objective}) can be rewritten as
\begin{equation}
  \label{eq:objective2}
  \begin{split}
  \argmin_{S} \{H(F) - H(S)\} & \propto \argmin_{S} - H(S) \\
  & = \argmax_{S} H(S).
  \end{split}
\end{equation}

Based on the incremental search strategy, the algorithm must identify a new feature $f^+$ from $F \setminus S$ that maximally increases the entropy of feature subset $S$.
Thus, the objective function $J$ can be represented as
\begin{equation}
  \label{eq:objective3}
  J = \argmax_{f^+ \in F \setminus S} H(S, f^+).
\end{equation}

Because $H(S, f^+)$ can be a high-dimensional joint entropy term due to $S$, an accurate estimation may not be achieved in practice because of an insufficient number of patterns.
To circumvent this issue, in this work, we estimate $H(S, f^+)$ by using low-order joint entropy terms involving only $k \ll \vert S \vert$ features, which is a frequently-used strategy in the field of information-theoretic FS.
For brevity, we define $k$-cardinality entropy \citep{lee2015mutual} as Definition~\ref{defi:d1}.
\begin{defi} Sum of the $k$-cardinality entropy.
	\label{defi:d1}
	\begin{equation}
	  U_k(X) = \sum_{Y \in X_{k}^{'}} H(Y),
	\end{equation}
\end{defi}
where $X^{'}$ is the power set of $X$ and $X_{k}^{'} = \{e|e \in X^{'}, \vert e \vert = k\}$.  

Based on the Definition~\ref{defi:d1}, the upper bound of Han's inequality \citep{te1978nonnegative} can be rewritten as Proposition~\ref{eq:han}.
\begin{prop} $k$-cardinality representation of Han's inequality
\label{eq:han}
\begin{equation}
\label{eq:han2}
  H(X) \leq \frac{1}{n-1} U_{n - 1}(X'),
\end{equation}
\end{prop}
where $n$ is the number of variables in $X$.
Based on the Proposition~\ref{eq:han}, we get Lemma~\ref{lemm1} by applying the upper bound to its subsequent joint entropy terms.
\begin{lemm} Let $U_k(S')$ be the $k$-cardinality entropy of given variable sets $S$. Then the lower bound and the upper bound of $U_k(S')$ can be defined as
\label{lemm1}
\begin{equation}
\label{lemm111}
\begin{split}
\frac{1}{k-1} \left( kU_k(S') - {{n-1}\choose{k-1}} U_1(S') \right) \leq U_k(S') \\ 
\leq \left( \frac{n-k+1}{k-1} \right) U_{k-1}(S').
\end{split}
\end{equation}
\begin{proof}
The detailed proof is provided in the work of \citet{lee2015mutual}.
\end{proof}
\end{lemm}

Lemma \ref{lemm1} indicates that the upper bound of $U_k(S')$ is determined by the $(k-1)$-cardinality entropy term. Thus, by recursively applying Lemma \ref{lemm1}, we can obtain the $k$-cardinality approximation of the high-dimensional joint entropy $H(X)$, as stated in Theorem \ref{eq:t1}.
\begin{thrm} Upper bound of the $H(X)$ with $k$-cardinality entropy is
\label{eq:t1}
\begin{equation}
\label{eq:tt1}
H(X) \leq \left(\prod_{i=1}^{b} \frac{i}{n-i}\right) U_{k} (X'),
\end{equation}
where $b=\min(n-k, k-1)$.
\begin{proof}
The detailed proof is provided in the work of \citet{seo2019generalized}.
\end{proof}
\end{thrm}

Theorem~\ref{eq:t1} indicates that the upper bound becomes tighter when $k$ in Equation~(\ref{eq:tt1}) is set to a large value, and hence a better estimation of $H(X)$ can be obtained.
From the Theorem~\ref{eq:t1}, $H(S, f^+)$ can be approximated based on the sum of the $k$-cardinality entropy if the upper bound is accepted as the estimator.
In our experiments, we set $k$ to two because it is the minimum value for the score function being a multivariate feature filter\footnote{If $k$ is set to one, then a score function for univariate feature filter is instantiated because the maximum number of features to be considered is one when $k = 1$, thereby the relation among features, for example, feature pairs, cannot be considered.}, which is the main focus of this work.
As a result, the objective function $J$ is approximated as
\begin{equation}
  \label{eq:objective4}
  \begin{split}
    J & \approx \argmax_{f^+} \sum_{i=1}^{b} \frac{i}{\vert S \vert + 1 -i} U_{2} (\{S', f^+\}') \\
    & = \argmax_{f^+} \frac{1}{\vert S \vert} U_{2} (\{S', f^+\}'),
  \end{split}
\end{equation}
where $b = min(\vert S \vert + 1 -2 , 2-1) = 1$.
Because elements of the power set $\{S, f^+\}'$ can be divided into two parts, whether the element contains $f^+$ or not, Equation~(\ref{eq:objective4}) can be rewritten as
\begin{equation}
\label{eq:objective5}
  J \approx \argmax_{f^+} \frac{1}{\vert S \vert} \left( U_{2} (S') + U_{2} (f^+ \times S') \right),
\end{equation}
where $\times$ denotes the Cartesian product between two sets.
Because the $U_{2} (S')$ is constant to $f^+$, Equation (\ref{eq:objective5}) can be rewritten as
\begin{equation}
\label{eq:objective6}
  J \approx \argmax_{f^+} U_{2} (f^+ \times S').
\end{equation}

Finally, $J$ can be approximated as
\begin{equation}
\label{eq:objective7}
  J \approx \argmax_{f^+} \sum_{f \in S} H(f^+, f).
\end{equation}

In the case of the $S = \{ \emptyset \}$ where none of the features are selected yet, Equation~(\ref{eq:objective7}) can be represented as
\begin{equation}
\label{eq:objective8}
  J \approx \argmax_{f^+} H(f^+).
\end{equation}

It is worth noting that Equation~(\ref{eq:objective8}) is the $k$-cardinality entropy where $k$ is 1.
However, the algorithm may start with the optimal feature pairs (Please refer to Section~\ref{sec:exr}.)

\begin{algorithm}[!t]
\caption{Incremental Search for the Proposed Method}
\begin{algorithmic}[1]
\State $f^+ \gets \argmax_{f^+ \in F} H(f^+)$
\State $S \gets \{f^+\}$
  \While{$\vert S \vert < n$}
    \State $f^+ \gets \argmax_{f^+ \in F - S} \sum_{f \in S} H(f^+, f)$
    \State $S \gets S \cup \{f^+\}$
  \EndWhile
\end{algorithmic}
\label{alg:prop}
\end{algorithm}



\subsection{Incremental Search}

The proposed method is designed as a model-free, non-parametric measure that only requires calculating information-theoretic quantities based on joint entropy calculation.
Specifically, the proposed method incrementally selects a new feature $f^+$ from the $F \setminus S$ and adds it to the subset of features $S$.
The Algorithm~\ref{alg:prop} depicts the incremental search process of the proposed method.
First, $f^+$ is selected by Equation~(\ref{eq:objective8}) and the $S$ is initialized as $\{f^+\}$ (Lines 1--2).
Then, the algorithm iteratively selects the new feature $f^+$ from the $F \setminus S$ and adds it to the $S$ by selecting the new feature $f^+$ that is determined by Equation~(\ref{eq:objective7}) (Lines 3--6).
The algorithm is terminated when the number of already selected features $|S|$ is equal to the number of features $n$, which is the maximum number of features to be selected, defined by the user.
The computational complexity of the Algorithm~\ref{alg:prop} is $O(n + n^2) = O(n^2)$ because $n$ and $n^2$ unit times are consumed for calculating entropy values of single features and that of feature pairs $f^+$ and $f \in S$.

\begin{table}[!t]
\caption{\label{tb:datasets} Summary of the datasets used in the experiments}
\centering
\resizebox{\columnwidth}{!}{
\begin{tabular}{l rrr l}
\toprule
Dataset & $\vert W \vert$ & $\vert F \vert$ & $\vert F' \vert$ & Domain \\
\midrule
ALLAML & 72 & 7,129 & 7,129 & Biology \\
Alzheimer & 174 & 450 & 450 & Biology \\
Arcene & 100 & 9,920 & 9,920 & Biology \\
Audiology & 226 & 71 & 93 & Biology \\
Ba & 1,404 & 320 & 320 & Image \\
Chess & 3,196 & 37 & 38 & Game \\
CLL\_SUB\_111 & 111 & 11,340 & 11,340 & Biology \\
Coil20 & 1,440 & 1,024 & 1,024 & Image \\
Colon & 62 & 2,001 & 5,994 & Biology \\
Leukemia & 72 & 7,070 & 7,070 & Biology \\
LSVT & 126 & 310 & 310 & Biology \\
Lung & 203 & 3,312 & 3,312 & Biology \\
Lymphoma & 96 & 4,026 & 4,026 & Biology \\
Madelon & 2,600 & 500 & 500 & Artificial \\
Mushrooms & 8,124 & 23 & 98 & Biology \\
Nci9 & 60 & 9,712 & 9,712 & Biology \\
Nursery & 12,960 & 9 & 26 & Biology \\
Pdspeech & 756 & 752 & 752 & Audio \\
Promoters & 105 & 58 & 228 & Biology \\
Prostate\_GE & 102 & 5,966 & 5,966 & Biology \\
SCADI & 70 & 206 & 206 & Biology \\
Semeion & 1,593 & 256 & 256 & Image \\
SPECT & 265 & 22 & 23 & Biology \\
Splice & 3,190 & 61 & 287 & Biology \\
Tox171 & 171 & 5,748 & 5,748 & Biology \\
Tic-Tac-Toe & 958 & 10 & 27 & Game \\
Umist & 575 & 644 & 644 & Image \\
WarpAR10P & 130 & 2,400 & 2,400 & Image \\
WarpPIE10P & 210 & 2,420 & 2,420 & Image \\
Yaleb & 2,414 & 1,024 & 1,024 & Image \\
\bottomrule
\end{tabular}
}
\end{table}

\begin{table*}[!t]
\caption{\label{tb:res1} Comparison results of five UFS methods in terms of $Entropy$, $PDP$, and the minimum number of features that can ensure all patterns are discriminable.}
\resizebox{\textwidth}{!}{
\begin{tabular}{l ccccc|ccccc|ccccc}
\toprule
\multicolumn{1}{c}{\multirow{2}{*}{Dataset}} & \multicolumn{5}{c|}{$Entropy$} & \multicolumn{5}{c|}{$PDP$} & \multicolumn{5}{c}{Minimum Number of features} \\
\multicolumn{1}{c}{} & Proposed & EGC & U2 & VCSD & FMI & Proposed & EGC & U2 & VCSD & FMI & \multicolumn{1}{c}{Proposed} & \multicolumn{1}{c}{EGC} & \multicolumn{1}{c}{U2} & \multicolumn{1}{c}{VCSD} & \multicolumn{1}{c}{FMI} \\
\midrule
ALLAML & \textbf{6.17} & 6.06 & 6.09 & 6.11 & \textbf{6.17} & \textbf{1.00} & 0.94 & 0.96 & 0.97 & \textbf{1.00} & \textbf{4} & 5 & 6 & 5 & \textbf{4} \\
Alzheimer & 7.43 & 2.59 & \textbf{7.44} & 7.40 & 7.40 & 0.99 & 0.22 & \textbf{1.00} & 0.98 & 0.98 & 18 & 71 & \textbf{10} & 95 & 23 \\
Arcene & \textbf{6.64} & 4.97 & 5.60 & 0.48 & 6.56 & \textbf{1.00} & 0.54 & 0.66 & 0.07 & 0.96 & \textbf{5} & 26 & 20 & 78 & 10 \\
Audiology & \textbf{7.11} & 4.85 & 4.05 & 6.49 & 6.39 & \textbf{0.72} & 0.26 & 0.17 & 0.56 & 0.54 & -- & -- & -- & -- & -- \\
Ba & \textbf{10.23} & 7.03 & 6.92 & 8.66 & 7.98 & \textbf{0.90} & 0.25 & 0.20 & 0.49 & 0.40 & -- & -- & -- & -- & -- \\
Chess & 11.63 & 11.49 & 11.54 & \textbf{11.64} & 11.63 & 1.00 & 0.92 & 0.95 & \textbf{1.00} & 1.00 & 38 & 38 & 37 & \textbf{36} & 38 \\
CLL\_SUB\_111 & \textbf{6.79} & 6.74 & 6.78 & 6.74 & \textbf{6.79} & \textbf{1.00} & 0.97 & 0.99 & 0.97 & \textbf{1.00} & \textbf{5} & 7 & 7 & 8 & \textbf{5} \\
Coil20 & \textbf{10.49} & 4.89 & 10.38 & 10.49 & 10.49 & \textbf{1.00} & 0.30 & 0.97 & 1.00 & 1.00 & \textbf{87} & -- & 119 & 107 & 100 \\
Colon & \textbf{5.95} & 5.72 & 5.08 & 4.57 & 4.03 & \textbf{1.00} & 0.89 & 0.65 & 0.65 & 0.42 & \textbf{10} & 17 & 23 & 176 & -- \\
Leukemia & \textbf{6.17} & 5.98 & 6.09 & 6.11 & 6.11 & \textbf{1.00} & 0.90 & 0.96 & 0.97 & 0.97 & \textbf{7} & 11 & 20 & 8 & 8 \\
LSVT & \textbf{6.98} & 2.10 & 3.24 & 6.13 & 6.91 & \textbf{1.00} & 0.09 & 0.31 & 0.67 & 0.97 & \textbf{5} & 27 & 42 & 12 & 6 \\
Lung & \textbf{7.67} & 5.52 & \textbf{7.67} & 7.60 & \textbf{7.67} & \textbf{1.00} & 0.48 & \textbf{1.00} & 0.97 & \textbf{1.00} & \textbf{5} & 14 & \textbf{5} & 7 & \textbf{5} \\
Lymphoma & \textbf{6.58} & 6.43 & 6.45 & 6.52 & 6.42 & \textbf{1.00} & 0.93 & 0.94 & 0.97 & 0.93 & \textbf{8} & 11 & 16 & 11 & 14 \\
Madelon & 11.34 & 11.19 & 11.34 & 11.34 & \textbf{11.34} & 1.00 & 0.93 & 1.00 & 1.00 & \textbf{1.00} & 10 & 14 & 10 & 12 & \textbf{9} \\
Mushrooms & \textbf{8.49} & 5.42 & 5.42 & 7.26 & 5.22 & \textbf{0.09} & 0.01 & 0.01 & 0.03 & 0.01 & -- & -- & -- & -- & -- \\
Nci9 & \textbf{5.91} & 5.74 & 5.81 & 5.59 & 5.73 & \textbf{1.00} & 0.92 & 0.95 & 0.85 & 0.92 & \textbf{7} & 10 & 9 & 9 & 12 \\
Nursery & 12.71 & \textbf{13.66} & 12.08 & \textbf{13.66} & 12.26 & 0.60 & \textbf{1.00} & 0.33 & \textbf{1.00} & 0.40 & 25 & \textbf{23} & 25 & \textbf{23} & 25 \\
Pdspeech & \textbf{9.56} & 0.89 & 9.55 & \textbf{9.56} & \textbf{9.56} & \textbf{1.00} & 0.05 & 0.99 & \textbf{1.00} & \textbf{1.00} & -- & -- & -- & -- & -- \\
Promoters & \textbf{6.71} & 6.68 & 6.55 & 6.68 & 6.65 & \textbf{1.00} & 0.98 & 0.92 & 0.98 & 0.97 & \textbf{17} & 22 & 29 & 23 & 33 \\
Prostate\_GE & \textbf{6.67} & 6.37 & 4.80 & 5.91 & 6.44 & \textbf{1.00} & 0.85 & 0.54 & 0.74 & 0.89 & \textbf{4} & 8 & 12 & 44 & 9 \\
SCADI & \textbf{6.13} & 5.89 & 5.09 & 5.22 & 5.19 & \textbf{1.00} & 0.89 & 0.59 & 0.71 & 0.64 & \textbf{51} & 98 & 124 & 85 & 116 \\
Semeion & \textbf{10.64} & 10.63 & 10.51 & 10.63 & 10.60 & \textbf{1.00} & 1.00 & 0.95 & 1.00 & 0.98 & \textbf{51} & 71 & 123 & 82 & 97 \\
SPECT & \textbf{7.33} & 6.96 & 6.96 & 7.15 & 7.18 & \textbf{0.79} & 0.72 & 0.72 & 0.74 & 0.76 & -- & -- & -- & -- & -- \\
Splice & \textbf{11.13} & 9.97 & 7.75 & 11.09 & 11.08 & \textbf{0.79} & 0.46 & 0.08 & 0.78 & 0.78 & -- & -- & -- & -- & -- \\
Tox171 & \textbf{9.90} & 8.46 & 8.46 & \textbf{9.90} & \textbf{9.90} & \textbf{1.00} & 0.45 & 0.45 & \textbf{1.00} & \textbf{1.00} & \textbf{17} & 23 & 23 & \textbf{17} & \textbf{17} \\
Tic-Tac-Toe & \textbf{7.42} & 7.34 & 7.29 & 7.35 & 7.37 & \textbf{1.00} & 0.96 & 0.95 & 0.96 & 0.98 & \textbf{4} & 7 & 5 & 5 & 5 \\
Umist & 9.12 & 8.86 & 9.10 & 9.11 & \textbf{9.13} & 0.98 & 0.88 & 0.97 & 0.97 & \textbf{0.98} & -- & -- & -- & -- & -- \\
WarpAR10P & \textbf{7.02} & 6.78 & 6.76 & 6.45 & 6.84 & \textbf{1.00} & 0.91 & 0.88 & 0.78 & 0.94 & \textbf{5} & 10 & 49 & 26 & 122 \\
WarpPIE10P & \textbf{7.71} & 7.29 & 6.89 & 7.52 & \textbf{7.71} & \textbf{1.00} & 0.85 & 0.77 & 0.94 & \textbf{1.00} & \textbf{10} & 27 & 66 & 84 & \textbf{10} \\
Yaleb & \textbf{11.12} & 8.54 & 8.72 & 10.55 & 11.06 & \textbf{0.97} & 0.70 & 0.71 & 0.88 & 0.96 & -- & -- & -- & -- & -- \\
\midrule
Avg. Rank & \textbf{1.20} & 3.87 & 3.90 & 2.80 & 2.50 & \textbf{1.20} & 3.87 & 3.90 & 2.77 & 2.43 & \textbf{1.20} & 2.67 & 2.80 & 2.53 & 2.27 \\
\bottomrule
\end{tabular}
}
\end{table*}

\section{Experimental Results}

\subsection{Experimental Settings}

To validate the performance of the proposed method, we employed 30 public datasets from two sources: UCI Machine Learning Repository and Kaggle website.
The datasets were selected to represent a wide range of domains, including biology, image, game, and audio, and to include various data types, such as numerical and nominal data.
Table~\ref{tb:datasets} represents details of employed datasets used in the experiments.
The table includes the number of instances $\vert W \vert$, original features $\vert F \vert$, preprocessed features $\vert F' \vert$, and the domain of each dataset.
A preprocessing step was applied to each dataset, and each nominal feature in the original feature set $F$ that has more than two categories is converted into the binary features of $F'$ by the one-hot encoding. 

Four state-of-the-art UFS methods were selected for performance comparison: EGC \citep{zhang2020unsupervised}, U2 \citep{villa2021utility}, VCSD \citep{karami2023unsupervised}, and FMI \citep{yuan2021novel}.
We detailed the parameter settings for each compared method as follows.
\begin{itemize}
  \item \textbf{EGC} incorporates the between-class scatter matrix and an adaptive graph structure into the traditional UFS framework. It requires two hyperparameters, $\alpha$ and $\lambda$, which were set to 0.001 and 0.1, respectively.
  \item \textbf{U2} uses a radial basis kernel function to address non-linearity within the conventional UFS framework and does not require any hyperparameters.
  \item \textbf{VCSD} introduces a variance-covariance subspace distance to leverage feature correlations, requiring a hyperparameter $\rho$ to adjust a term in the objective function, which was set to 100.
  \item \textbf{FMI} integrates fuzzy mutual information into the UFS framework and requires a hyperparameter $\lambda$ for fuzzy-based entropy calculations, set to 0.1.
\end{itemize}

Because the proposed method is based on the entropy calculation that requires the discrete probability distribution of the features, all numerical features are discretized.
Specifically, the discretization process is conducted by the equal-width binning method \citep{talukdar2018kernel} where the number of bins is set to ten.
The maximum number of selected features was set to 300 regarding to conventional UFS setting \citep{lim2021pairwise}.

To validate the superiority of UFS methods, three evaluation measures are considered.
We measured the entropy of the feature subsets selected by the proposed method and compared methods ($Entropy$), which can be represented as
\begin{equation}
  \label{eq:entropy}
  Entropy = H(S).
\end{equation}

Next, We employed the pattern discrimination power test ($PDP$) that measures the portion of the discriminable data patterns in the dataset based on the feature subset.
The $PDP$ can be represented as 
\begin{equation}
  PDP(W) = \frac{1}{|W|} \cdot \sum_{i=1}^{|W|} \Biggl\llbracket \left(\sum_{j=1}^{i-1} \bigl\llbracket w_i = w_j \bigr\rrbracket \right) = 0  \Biggr\rrbracket,
\end{equation}
where $W$ is the dataset comprised of the feature subset $S$, $w_i$ is the $i$-th pattern, and $\bigl\llbracket \cdot \bigr\rrbracket$ yields one if the proposition stated in the brackets is true and returns zero otherwise.
The range of the $PDP$ is from $\frac{1}{\vert W \vert}$ to 1, where each value means that all data patterns are indiscriminable/discriminable, respectively.
Both $PDP$ and $Entropy$ measures exhibit monotonic increases with the inclusion of additional features \citep{artstein2004solution}.
In particular, the relationship between the $PDP$ and $Entropy$ is illustrated in the supplementary material to show the correlation between the two measures.
Finally, the minimum number of features that all patterns become discriminable is measured based on the feature subsets selected by the proposed and compared methods.

\begin{figure*}[!t]
  \centering
  \begin{tabular}{cccc} 
    & & & \\
    \includegraphics[width=0.235\linewidth]{./figs/uni_ALLAML.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_alzheimer.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_arcene.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_audiology.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/uni_ba.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_chess.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_CLL_SUB_111.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_coil20.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/uni_colon.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_leukemia.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_lsvt.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_lung.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/uni_lymphoma.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_madelon.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_mushrooms.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_nci9.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/uni_nursery.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_pdspeech.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_promoters.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_Prostate_GE.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/uni_SCADI.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_semeion.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_splice.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_Tic-Tac-Toe.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/uni_umist.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_warpAR10P.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_warpPIE10P.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/uni_yaleb.eps} \\
    \\
  \end{tabular}
  \caption{\label{fig:uni} Comparison results of $PDP$ performance according to the number of features selected by the five UFS methods.}
\end{figure*}

\begin{table}[!t]
  \centering
  \caption{Comparison results of $Entropy$ and $PDP$ performance based on maximization/minimization of $H(S)$.}
  \label{tb:minmax}
\begin{tabular}{lcc cc}
  \toprule
  \multicolumn{1}{c}{\multirow{2}{*}{Dataset}} & \multicolumn{2}{c}{$Entropy$} & \multicolumn{2}{c}{$PDP$} \\
  & $max$ & $min$ & $max$ & $min$ \\
  \midrule
ALLAML & \textbf{6.17} & 0.32 & \textbf{1.00} & 0.06 \\
Alzheimer & \textbf{7.44} & 2.18 & \textbf{1.00} & 0.18 \\
Arcene & \textbf{6.62} & 0.08 & \textbf{0.99} & 0.02 \\
CLL\_SUB\_111 & \textbf{6.79} & 0.37 & \textbf{1.00} & 0.05 \\
Coil20 & \textbf{10.49} & 3.67 & \textbf{1.00} & 0.25 \\
Colon & \textbf{5.95} & 1.81 & \textbf{1.00} & 0.19 \\
Leukemia & \textbf{6.17} & 0.73 & \textbf{1.00} & 0.11 \\
LSVT & \textbf{6.96} & 0.13 & \textbf{0.99} & 0.02 \\
Lung & \textbf{7.67} & 1.20 & \textbf{1.00} & 0.14 \\
Nci9 & \textbf{5.91} & 0.37 & \textbf{1.00} & 0.07 \\
Nursery & \textbf{13.66} & 12.66 & \textbf{1.00} & 0.50 \\
Prostate\_GE & \textbf{6.67} & 0.08 & \textbf{1.00} & 0.02 \\
Tic-Tac-Toe & \textbf{7.42} & 4.52 & \textbf{1.00} & 0.26 \\
WarpAR10P & \textbf{7.02} & 2.96 & \textbf{1.00} & 0.31 \\
WarpPIE10P & \textbf{7.70} & 3.67 & \textbf{0.99} & 0.28 \\
\midrule
Avg. Rank & \textbf{1.00} & 2.00 & \textbf{1.00} & 2.00 \\
  \bottomrule
\end{tabular}
\end{table}

\begin{table}[!t]
  \centering
  \caption{Comparison results of $Entropy$ and $PDP$ performance based on different initial settings, $H(f^+)$ and $H(f^+,f)$, for the proposed method.}
  \label{tb:cardinality}
  \resizebox{\columnwidth}{!}{
  \begin{tabular}{lcc cc}
  \toprule
  \multicolumn{1}{c}{\multirow{2}{*}{Dataset}} & \multicolumn{2}{c}{$Entropy$} & \multicolumn{2}{c}{$PDP$} \\
\multicolumn{1}{c}{} & $H(f^+)$ & $H(f^+,f)$ & $H(f^+)$ & $H(f^+,f)$ \\
  \midrule
ALLAML & \textbf{6.17} & \textbf{6.17} & \textbf{1.00} & \textbf{1.00} \\
Alzheimer & \textbf{7.44} & \textbf{7.44} & \textbf{1.00} & \textbf{1.00} \\
Arcene & 6.62 & \textbf{6.64} & 0.99 & \textbf{1.00} \\
Audiology & \textbf{7.11} & \textbf{7.11} & \textbf{0.72} & \textbf{0.72} \\
Ba & 10.23 & \textbf{10.23} & \textbf{0.90} & 0.90 \\
Chess & \textbf{11.64} & \textbf{11.64} & \textbf{1.00} & \textbf{1.00} \\
CLL\_SUB\_111 & \textbf{6.79} & \textbf{6.79} & \textbf{1.00} & \textbf{1.00} \\
Coil20 & 10.49 & \textbf{10.49} & 1.00 & \textbf{1.00} \\
Colon & \textbf{5.95} & 5.92 & \textbf{1.00} & 0.98 \\
Leukemia & \textbf{6.17} & 6.14 & \textbf{1.00} & 0.99 \\
LSVT & 6.96 & \textbf{6.98} & 0.99 & \textbf{1.00} \\
Lung & \textbf{7.67} & \textbf{7.67} & \textbf{1.00} & \textbf{1.00} \\
Lymphoma & \textbf{6.58} & \textbf{6.58} & \textbf{1.00} & \textbf{1.00} \\
Madelon & \textbf{11.34} & \textbf{11.34} & \textbf{1.00} & \textbf{1.00} \\
Mushrooms & \textbf{8.49} & \textbf{8.49} & \textbf{0.09} & \textbf{0.09} \\
Nci9 & \textbf{5.91} & \textbf{5.91} & \textbf{1.00} & \textbf{1.00} \\
Nursery & \textbf{13.66} & \textbf{13.66} & \textbf{1.00} & \textbf{1.00} \\
Pdspeech & \textbf{9.56} & \textbf{9.56} & \textbf{1.00} & \textbf{1.00} \\
Promoters & \textbf{6.71} & \textbf{6.71} & \textbf{1.00} & \textbf{1.00} \\
Prostate\_GE & \textbf{6.67} & \textbf{6.67} & \textbf{1.00} & \textbf{1.00} \\
SCADI & \textbf{6.13} & \textbf{6.13} & \textbf{1.00} & \textbf{1.00} \\
Semeion & \textbf{10.64} & \textbf{10.64} & \textbf{1.00} & \textbf{1.00} \\
SPECT & \textbf{7.33} & \textbf{7.33} & \textbf{0.79} & \textbf{0.79} \\
Splice & \textbf{11.13} & \textbf{11.13} & \textbf{0.79} & \textbf{0.79} \\
Tox171 & \textbf{9.90} & \textbf{9.90} & \textbf{1.00} & \textbf{1.00} \\
Tic-Tac-Toe & \textbf{7.42} & \textbf{7.42} & \textbf{1.00} & \textbf{1.00} \\
Umist & 9.12 & \textbf{9.12} & 0.98 & \textbf{0.98} \\
WarpAR10P & \textbf{7.02} & \textbf{7.02} & \textbf{1.00} & \textbf{1.00} \\
WarpPIE10P & 7.70 & \textbf{7.71} & 0.99 & \textbf{1.00} \\
Yaleb & \textbf{11.12} & 11.11 & \textbf{0.97} & 0.97 \\
\midrule
Avg. Rank & 1.20 & \textbf{1.10} & 1.17 & \textbf{1.13} \\
\bottomrule
  \end{tabular}
  }
\end{table}

\subsection{Experimental Results}
\label{sec:exr}

Table~\ref{tb:res1} presents experimental results on 30 datasets in terms of $Entropy$, $PDP$, and the minimum number of features required to make all patterns discriminable.
Because the maximum number of selected features was set to 300, if all patterns within a dataset remain indiscriminable with more than 300 selected features, the corresponding entries in the table are filled with "--" due to the exhaustive time consumption.
Subsequently, each entry for $Entropy$ and $PDP$ represents the result corresponding to the smallest number of features required to make all patterns discriminable among all UFS methods.
For the datasets where all UFS methods failed to discriminate all patterns with fewer than 300 features, the results for $Entropy$ and $PDP$ are reported when the number of selected features is 17, reflecting the average of the smallest number of features required among the other datasets.
In the table, the best results are highlighted in bold, and we reported the average rank (Avg. Rank) of the proposed and compared methods in the last row of the table.

Experimental results demonstrate that the proposed method outperforms the compared methods in terms of $Entropy$ on 25 out of 30 datasets.
These results indicate that the incremental search for the proposed method, along with its derived objective function, effectively selects a feature subset that can maximize the entropy of the datasets with selected features.
Consequently, the selected features by the proposed method contain more information content compared to those selected by other methods.
Next, the proposed method outperforms the compared methods in terms of $PDP$ on 25 out of 30 datasets, indicating the superiority of the proposed method against the compared methods.
In the experiments based on the minimum number of features that can ensure all patterns are discriminable, the superiority of the proposed method is observed again because the proposed method outputs a significantly compact feature subset compared to other methods.
Notably, on average, the proposed method selects $31\%$ fewer features than the second-best method on datasets where the proposed method outperforms the compared methods.
Overall, the proposed method outperforms compared methods in terms of three pattern discrimination power-related measures: $Entropy$, $PDP$, and the minimum number of features.

Figure~\ref{fig:uni} illustrates the comparison results of $PDP$ performance as the number of selected features increases until all patterns are discriminable, with a maximum of 17 as same experimental setting in Table~\ref{tb:res1}.
Experimental results of 28 out of 30 datasets are represented as line plots where the $x$- and $y$-axis represent the number of features and the $PDP$ performance, respectively.
Figure~\ref{fig:uni} shows that the proposed method consistently outperformed the compared methods on most of the datasets.
Specifically, on Arcene, Audiology, and Coil20 datasets, the proposed method demonstrated superior performance regardless of the number of selected features.
Moreover, on SCADI, Semeion, and Lsvt datasets, the $PDP$ performance of the proposed method rapidly increased compared to other methods, indicating the compactness of the feature subset selected by the proposed method.

The proposed method selects a feature that maximizes the entropy, which can be viewed as an opposite concept to the conventional information-theoretic UFS methods.
To investigate this aspect, we conducted additional experiments on the maximization and the minimization of the entropy.
In contrast to Algorithm~\ref{alg:prop}, the minimization approach starts with a feature with the smallest entropy and then adds a new feature that preserves the entropy of $S$.
Table~\ref{tb:minmax} represents the comparison results on the ten datasets.
Experimental results indicate that the two approaches lead to significantly different results.
For example, in the case of the Coil20 dataset, the feature subset based on the maximization approach yields a 1.00 $PDP$ value, indicating that all patterns are discriminable.
In contrast, 75\% of patterns are indiscriminable when the minimization approach is applied because the corresponding $PDP$ value is 0.25.

In Section~\ref{sec:score}, we explained that the algorithm may start with either maximizing $H(f^+)$ or $H(f^+,f)$ where $f \in F\setminus f^+$.
To clarify whether there is a significant difference according to the initial setting, we conducted additional experiments because it affects the entire subsequent iterations.
Table~\ref{tb:cardinality} shows the experimental results of different initial settings in terms of $Entropy$ and $PDP$.
Overall, the results for $H(f^+)$ and $H(f^+,f)$ are largely consistent with one another, indicating that the initial setting affects the FS process insignificantly in terms of $Entropy$ and $PDP$.

Finally, we evaluate the classification accuracy and the MI based on the feature subset because it was frequently used in traditional UFS studies. 
The experimental results show that the feature subset identified by the proposed method can yield better classification results.
In addition, we visualize the entropy value according to the number of features selected by the proposed and compared methods.
Detailed experimental results are provided in the supplementary material.


\section{Conclusion}

In this work, we proposed a UFS method based on maximizing entropy. 
This approach aims to produce a subset of features that maximizes the discriminability among patterns, thus serving the need for identifying novel patterns.
With a simple example demonstrating the consequences of the minimization and maximization approaches,
we provided a rigorous formulation of the score function based on theoretical derivation.
The experimental results showed that the proposed method outperforms compared methods in terms of pattern discrimination power-related measures.

Future work can explore the potential for real-world applications of the proposed method, including but not limited to, real-time recommendation systems, search engines, and dynamic content optimization. Moreover, the proposed method could be extended to handle different types of data and computational frameworks, providing a more universal solution to the challenges of high-dimensional data.

%------------------------------------------------------------------------------
\begin{acknowledgements} % will be removed in pdf for initial submission,
This work was supported by Institute of Information \& communications Technology Planning \& Evaluation (IITP) grant funded by the Korea government(MSIT) (2021-0-01341,Artificial Intelligence Graduate School Program(Chung-Ang University)).
\end{acknowledgements}

%	BIBLIOGRAPHY
%------------------------------------------------------------------------------
\bibliography{sample-base}

\onecolumn

\title{Unsupervised Feature Selection towards\\ Maximizing Pattern Discrimination Power \\(Supplementary Material)}
\maketitle

\appendix
\section{Relationship between $PDP$ and $Entropy$ Measures}
In this section, we provide a theoretical proof to establish the relationship between the $Entropy$, the joint entropy of a feature subset, and the $PDP$ measure.
$PDP$ is defined as
\begin{equation}
  PDP(W) = \frac{1}{|W|} \cdot \sum_{i=1}^{|W|} \Biggl\llbracket \left(\sum_{j=1}^{i-1} \bigl\llbracket w_i = w_j \bigr\rrbracket \right) = 0  \Biggr\rrbracket,
\end{equation}
where $W$ is the dataset comprised of the feature subset $S$, $w_i$ is the $i$-th pattern, and $\bigl\llbracket \cdot \bigr\rrbracket$ yields one if the proposition stated in the brackets is true and returns zero otherwise.

We first provide a proposition to establish the relationship between the joint entropy of a feature subset and the $PDP$ measure.
\begin{prop}
  \label{prop:entropy}
  The joint entropy $H(S)$ is maximized when all patterns in the dataset $W$ are discriminable.
\end{prop}

\begin{proof}
The joint entropy $H(S)$ of the dataset quantifies the level of uncertainty within the feature subset $S$. It is defined as $H(S) = -\sum_{i=1}^{|W|} P(w_i) \log P(w_i)$, where $P(w_i)$ represents the probability of the $i$-th pattern occurring in the dataset. To simplify, let $P(w_i) = y_i$, transforming our expression to function terms as $f(y) = -y \log y$ with $y = P(w_i)$.

Considering the concavity of the function $f(y) = -y \log y$, we can employ Jensen's Inequality \citep{jensen1906fonctions} to establish an upper bound for the sum $\sum_{i=1}^{|W|} f(y_i)$, which is essential for our entropy calculation as

\begin{equation}
    \sum_{i=1}^{|W|} f(P(w_i)) = \sum_{i=1}^{|W|} -P(w_i) \log P(w_i) \leq |W| \cdot f\left(\frac{1}{|W|}\right).
\end{equation}

This formulation bounds the joint entropy as
\begin{equation}
    -\sum_{i=1}^{|W|} P(w_i) \log P(w_i) \leq \log |W|.
\end{equation}

When all patterns in $W$ are discriminable, this condition indicates a uniform distribution of occurrences, with $P(w_i) = \frac{1}{|W|}$ for all $i$, thereby transforming our inequality into the equality as
\begin{equation}
    H(S) = \log |W|.
\end{equation}

Thus, when all patterns in the dataset $W$ are discriminable, the joint entropy $H(S)$ of the dataset is maximized.
\end{proof}

Because the proposed score function aims to maximize $H(S)$, a theoretical proof is provided to establish the relationship between the entropy of a feature subset and $PDP$, showing that a decrease in $PDP$ leads to a decrease in the upper bound of $H(S)$. 
First, let $m = (1 - PDP(W)) \cdot |W|$. 
For instance, consider $m=4$ and $|W| = 10$ with a $PDP$ value of 0.6.
In this case, at least five patterns are indiscriminable from each other, as illustrated by the sequence $\{1, 2, 3, 4, 5, 6, 6, 6, 6, 6\}$, where the last five patterns are indiscriminable.
Conversely, the situation with a maximum of eight indiscriminable patterns within the dataset could be exemplified by $\{1, 2, 3, 3, 4, 4, 5, 5, 6, 6\}$.
Based on the example, the upper bound of $H(S)$ can be represented by the following lemma.

\begin{lemm}
\label{lem:ent_bound}
  The upper bound joint entropy $H(S)$ of a dataset $W$ is bounded as
\begin{equation}
  H(S) \leq -\frac{|W|-2m}{|W|} \cdot \log \frac{1}{|W|} - \frac{2m}{|W|} \cdot \log \frac{2}{|W|}, \label{eq:upperBound}
\end{equation}
where $m = (1 - PDP(W)) \cdot |W|$.
\end{lemm}

\begin{proof}
The number of indiscriminable patterns in $W$ ranges from $m+1$ to $2m$, as illustrated in the provided example.
Thus, the entropy of $|W|-2m$ discriminable patterns, $-\frac{|W|-2m}{|W|} \cdot \log \frac{1}{|W|}$, is constant.
Considering the remaining $2m$ patterns, the entropy is maximized when the patterns are uniformly distributed, as mentioned in Proposition \ref{prop:entropy}. This condition is achieved when there are $m$ pairs of patterns, with each pair being identical, yielding $-\frac{2m}{|W|} \cdot \log \frac{2}{|W|}$.
Equation (\ref{eq:upperBound}) represents the upper bound of the joint entropy $H(S)$.
\end{proof}

According to Lemma \ref{lem:ent_bound}, the relationship between the $PDP$ and the joint entropy $H(S)$ can be established, demonstrating that a decrease in $PDP$, which corresponds to an increase in $m$, leads to a decrease in the upper bound of $H(S)$
Given the uncertainty regarding the precise number of indiscriminable patterns as $m$ increases, we construct our proof by focusing on the upper bounds of $H(S)$ for both $m$ and $m+1$. 

\begin{thrm}
\label{thm:entropy}
The upper bound of $H(S)$ for $m$ is greater than the upper bound of $H(S)$ for $m+1$.
\end{thrm}
\begin{proof}
By applying Lemma \ref{lem:ent_bound}, the upper bound of $H(S)$ for $m+1$ represents as follows:
\begin{equation}
  -\frac{|W|-2(m+1)}{|W|} \cdot \log \frac{1}{|W|} -\frac{2(m+1)}{|W|} \cdot \log \frac{2}{|W|}
\end{equation}
Subtracting the upper bound of $H(S)$ for $m+1$ from the upper bound of $H(S)$ for $m$, as detailed in Equation \ref{eq:upperBound}, we obtain
\begin{equation}
  \begin{split}
    &-\frac{|W|-2m}{|W|} \cdot \log \frac{1}{|W|} -\frac{2m}{|W|} \cdot \log \frac{2}{|W|} \\
    &- \left( -\frac{|W|-2(m+1)}{|W|} \cdot \log \frac{1}{|W|} - \frac{2(m+1)}{|W|} \cdot \log \frac{2}{|W|} \right) \\
    &= \frac{2}{|W|} \geq 0.
  \end{split}
\end{equation}
\end{proof}

Because the $PDP$ and the upper bound of $H(S)$ are both decreasing functions of $m$, Theorem \ref{thm:entropy} indicates that the proposed method, which aims to maximize $H(S)$, can effectively select a feature subset that maximizes the pattern discrimination power.

\section{Additional Experimental Results}

\begin{table}[h]
\centering
\caption{Comparsion results of execution time and MI performance.}
\label{tb:timemi}
  \resizebox{\columnwidth}{!}{
\begin{tabular}{lccccc|ccccc}
  \toprule
\multicolumn{1}{c}{\multirow{2}{*}{Dataset}} & \multicolumn{5}{c|}{Execution Time in Seconds} & \multicolumn{5}{c}{MI} \\
\multicolumn{1}{c}{} & Proposed & EGC & U2 & VCSD & FMI & Proposed & EGC & U2 & VCSD & FMI \\
\midrule
ALLAML & 375.65 & \textbf{221.69} & 5978.10 & 1138.32 & 1486.96 & \textbf{0.93} & 0.46 & 0.54 & 0.04 & 0.50 \\
Alzheimer & 1165.01 & \textbf{662.63} & 23696.24 & 3816.60 & 5357.24 & \textbf{0.97} & 0.03 & 0.08 & 0.34 & 0.71 \\
Arcene & 303.85 & \textbf{149.41} & 4204.49 & 741.45 & 1587.23 & \textbf{0.95} & 0.34 & 0.22 & 0.06 & 0.17 \\
Audiology & 0.18 & \textbf{0.01} & 1.96 & 0.46 & 0.74 & \textbf{3.04} & 1.32 & 2.08 & 2.03 & 2.01 \\
Ba & 0.06 & 0.06 & 1.05 & \textbf{0.04} & 0.17 & \textbf{4.17} & 1.60 & 1.41 & 2.97 & 3.14 \\
Chess & 2.28 & \textbf{0.10} & 7.33 & 3.87 & 13.82 & \textbf{0.58} & 0.32 & 0.26 & 0.26 & 0.22 \\
CLL\_SUB\_111 & 775.53 & \textbf{440.78} & 15112.69 & 2839.68 & 3555.26 & \textbf{1.37} & 0.93 & 0.87 & 0.41 & \textbf{1.37} \\
Coil20 & 0.08 & \textbf{0.07} & 0.07 & 0.27 & 4.89 & \textbf{4.32} & 0.57 & 2.87 & 4.32 & 4.27 \\
Colon & 7.83 & \textbf{3.59} & 16.22 & 6.59 & 650.64 & \textbf{0.94} & 0.87 & 0.81 & 0.66 & 0.41 \\
Leukemia & \textbf{0.20} & 27.34 & 9.36 & 12.26 & 514.60 & \textbf{0.93} & \textbf{0.93} & \textbf{0.93} & 0.88 & 0.76 \\
Lsvt & 92.26 & \textbf{4.93} & 49.98 & 20.90 & 8496.83 & \textbf{0.89} & 0.05 & 0.05 & 0.41 & 0.13 \\
Lung & 266.99 & \textbf{136.40} & 3663.49 & 768.40 & 911.85 & \textbf{1.49} & 1.38 & \textbf{1.49} & \textbf{1.49} & \textbf{1.49} \\
Lymphoma & 388.49 & \textbf{218.35} & 5872.40 & 1102.46 & 1600.35 & \textbf{2.45} & \textbf{2.45} & \textbf{2.45} & \textbf{2.45} & \textbf{2.45} \\
Madelon & 0.99 & \textbf{0.15} & 6.03 & 2.46 & 4.38 & \textbf{0.35} & 0.00 & 0.23 & 0.01 & 0.24 \\
Mushrooms & 162.93 & \textbf{22.47} & 569.18 & 166.40 & 1085.49 & \textbf{0.93} & 0.76 & 0.76 & 0.67 & 0.86 \\
Nci9 & 148.46 & \textbf{40.09} & 954.09 & 263.30 & 660.76 & \textbf{3.08} & 2.64 & \textbf{3.08} & 0.58 & 2.98 \\
Nursery & 34.69 & 21.69 & 60.47 & \textbf{19.09} & 12031.08 & \textbf{1.13} & 0.17 & 0.38 & 1.06 & 1.11 \\
Pdspeech & \textbf{2.06} & 32.62 & 28.36 & 52.79 & 5279.50 & \textbf{0.65} & 0.01 & 0.09 & 0.10 & 0.14 \\
Promoters & 659.48 & \textbf{441.33} & 16076.21 & 2555.64 & 3328.74 & \textbf{1.00} & 0.71 & 0.41 & 0.97 & 0.96 \\
Prostate\_GE & \textbf{0.40} & 1389.47 & 184.80 & 194.09 & 22083.95 & \textbf{0.98} & 0.71 & 0.86 & 0.15 & 0.20 \\
SCADI & 18.67 & \textbf{1.17} & 19.37 & 10.07 & 712.49 & \textbf{2.18} & 1.66 & 1.62 & 1.63 & 1.54 \\
Semeion & 0.51 & \textbf{0.03} & 3.97 & 1.28 & 1.85 & \textbf{2.62} & 1.94 & 0.82 & 1.29 & 2.13 \\
SPECT & 5.58 & \textbf{3.64} & 11.89 & 5.70 & 992.32 & 0.40 & 0.37 & 0.37 & \textbf{0.42} & 0.39 \\
Splice & \textbf{14.00} & 29.64 & 55.27 & 21.94 & 7801.44 & \textbf{1.32} & 0.03 & 0.02 & 1.27 & 1.31 \\
Tox171 & 426.87 & \textbf{120.82} & 3212.45 & 637.41 & 2417.68 & \textbf{2.00} & 1.97 & \textbf{2.00} & 1.90 & \textbf{2.00} \\
TTTgame & \textbf{0.04} & 0.66 & 1.16 & 0.68 & 5.77 & \textbf{0.93} & 0.19 & 0.19 & 0.62 & \textbf{0.93} \\
Umist & 16.96 & \textbf{0.59} & 14.52 & 7.02 & 479.49 & \textbf{4.28} & 4.27 & \textbf{4.28} & \textbf{4.28} & \textbf{4.28} \\
WarpAR10P & 68.45 & \textbf{7.59} & 191.55 & 78.93 & 264.45 & \textbf{3.32} & 3.25 & 3.17 & 3.18 & 3.20 \\
WarpPIE10P & 99.19 & \textbf{7.64} & 207.88 & 80.65 & 491.89 & \textbf{3.32} & \textbf{3.32} & 3.02 & 3.14 & \textbf{3.32} \\
Yaleb & 185.28 & \textbf{18.93} & 127.98 & 31.47 & 35409.53 & \textbf{5.24} & 5.18 & 4.30 & 5.13 & 5.23 \\
\midrule
  Avg. Rank & 2.20 & \textbf{1.47} & 4.07 & 2.70 & 4.57 & \textbf{1.03} & 3.47 & 3.33 & 3.33 & 2.60 \\
  \bottomrule
\end{tabular}}
\end{table}

\begin{table*}[!t]
  \centering
  \caption{Comparison results of classification accuracy performance based on the feature subsets selected by the five UFS methods.}
  \label{tb:res2}
    \resizebox{\textwidth}{!}{
  \begin{tabular}{l ccccc|ccccc}
  \toprule
  \multicolumn{1}{c}{\multirow{2}{*}{Dataset}} & \multicolumn{5}{c|}{Na{\"i}ve Bayes} & \multicolumn{5}{c}{Decision Tree} \\
  \multicolumn{1}{c}{} & Proposed & EGC & U2 & VCSD & FMI & Proposed & EGC & U2 & VCSD & FMI \\
  \midrule
  ALLAML & 0.57 $\pm$ 0.22 & \textbf{0.72 $\pm$ 0.20} & 0.67 $\pm$ 0.11 & 0.65 $\pm$ 0.24 & 0.71 $\pm$ 0.10 & 0.56 $\pm$ 0.23 & \textbf{0.68 $\pm$ 0.21} & 0.61 $\pm$ 0.13 & 0.65 $\pm$ 0.24 & 0.59 $\pm$ 0.19 \\
  Alzheimer & \textbf{0.76 $\pm$ 0.12} & 0.41 $\pm$ 0.08 & 0.52 $\pm$ 0.14 & 0.71 $\pm$ 0.09 & 0.68 $\pm$ 0.13 & \textbf{0.75 $\pm$ 0.09} & 0.42 $\pm$ 0.09 & 0.45 $\pm$ 0.13 & 0.65 $\pm$ 0.08 & 0.69 $\pm$ 0.12 \\
  Arcene & 0.60 $\pm$ 0.14 & 0.45 $\pm$ 0.16 & \textbf{0.62 $\pm$ 0.15} & 0.56 $\pm$ 0.14 & 0.58 $\pm$ 0.12 & 0.58 $\pm$ 0.08 & 0.49 $\pm$ 0.09 & 0.56 $\pm$ 0.15 & 0.55 $\pm$ 0.14 & \textbf{0.60 $\pm$ 0.12} \\
  Audiology & \textbf{0.63 $\pm$ 0.14} & 0.52 $\pm$ 0.10 & 0.50 $\pm$ 0.10 & 0.35 $\pm$ 0.12 & 0.30 $\pm$ 0.12 & \textbf{0.66 $\pm$ 0.10} & 0.52 $\pm$ 0.11 & 0.49 $\pm$ 0.10 & 0.33 $\pm$ 0.11 & 0.38 $\pm$ 0.12 \\
  Ba & \textbf{0.25 $\pm$ 0.03} & 0.07 $\pm$ 0.02 & 0.07 $\pm$ 0.03 & 0.21 $\pm$ 0.03 & 0.24 $\pm$ 0.05 & 0.21 $\pm$ 0.04 & 0.08 $\pm$ 0.02 & 0.08 $\pm$ 0.02 & 0.21 $\pm$ 0.03 & \textbf{0.25 $\pm$ 0.03} \\
  Chess & \textbf{0.75 $\pm$ 0.03} & 0.71 $\pm$ 0.02 & 0.67 $\pm$ 0.01 & 0.63 $\pm$ 0.02 & 0.62 $\pm$ 0.03 & \textbf{0.82 $\pm$ 0.02} & 0.73 $\pm$ 0.02 & 0.70 $\pm$ 0.02 & 0.67 $\pm$ 0.01 & 0.66 $\pm$ 0.02 \\
  CLL\_SUB\_111 & 0.62 $\pm$ 0.19 & 0.32 $\pm$ 0.12 & 0.46 $\pm$ 0.12 & 0.57 $\pm$ 0.16 & \textbf{0.65 $\pm$ 0.16} & 0.56 $\pm$ 0.18 & 0.38 $\pm$ 0.14 & 0.33 $\pm$ 0.15 & \textbf{0.59 $\pm$ 0.12} & 0.49 $\pm$ 0.13 \\
  Coil20 & \textbf{0.85 $\pm$ 0.03} & 0.10 $\pm$ 0.02 & 0.36 $\pm$ 0.05 & 0.78 $\pm$ 0.04 & 0.82 $\pm$ 0.04 & 0.81 $\pm$ 0.04 & 0.10 $\pm$ 0.02 & 0.45 $\pm$ 0.03 & 0.79 $\pm$ 0.03 & \textbf{0.81 $\pm$ 0.04} \\
  Colon & 0.57 $\pm$ 0.20 & 0.53 $\pm$ 0.22 & \textbf{0.66 $\pm$ 0.22} & 0.59 $\pm$ 0.15 & 0.53 $\pm$ 0.22 & 0.42 $\pm$ 0.15 & 0.50 $\pm$ 0.24 & 0.44 $\pm$ 0.21 & 0.32 $\pm$ 0.10 & \textbf{0.60 $\pm$ 0.13} \\
  Leukemia & \textbf{0.80 $\pm$ 0.15} & 0.54 $\pm$ 0.17 & 0.67 $\pm$ 0.19 & 0.65 $\pm$ 0.14 & 0.69 $\pm$ 0.15 & 0.44 $\pm$ 0.04 & 0.13 $\pm$ 0.02 & 0.28 $\pm$ 0.03 & 0.29 $\pm$ 0.04 & \textbf{0.47 $\pm$ 0.04} \\
  LSVT & \textbf{0.77 $\pm$ 0.06} & 0.66 $\pm$ 0.15 & 0.66 $\pm$ 0.15 & 0.70 $\pm$ 0.12 & 0.60 $\pm$ 0.10 & \textbf{0.73 $\pm$ 0.13} & 0.48 $\pm$ 0.22 & 0.63 $\pm$ 0.19 & 0.69 $\pm$ 0.12 & 0.67 $\pm$ 0.16 \\
  Lung & 0.80 $\pm$ 0.07 & 0.74 $\pm$ 0.08 & 0.75 $\pm$ 0.08 & 0.80 $\pm$ 0.08 & \textbf{0.86 $\pm$ 0.06} & 0.69 $\pm$ 0.11 & 0.66 $\pm$ 0.15 & 0.67 $\pm$ 0.15 & \textbf{0.79 $\pm$ 0.07} & 0.60 $\pm$ 0.13 \\
  Lymphoma & 0.58 $\pm$ 0.18 & 0.51 $\pm$ 0.22 & 0.43 $\pm$ 0.11 & 0.75 $\pm$ 0.15 & \textbf{0.80 $\pm$ 0.13} & 0.80 $\pm$ 0.12 & 0.66 $\pm$ 0.10 & 0.62 $\pm$ 0.09 & 0.76 $\pm$ 0.07 & \textbf{0.83 $\pm$ 0.08} \\
  Madelon & 0.56 $\pm$ 0.02 & 0.47 $\pm$ 0.02 & 0.49 $\pm$ 0.02 & 0.47 $\pm$ 0.02 & \textbf{0.62 $\pm$ 0.03} & 0.50 $\pm$ 0.17 & 0.47 $\pm$ 0.22 & 0.48 $\pm$ 0.16 & 0.58 $\pm$ 0.16 & \textbf{0.64 $\pm$ 0.13} \\
  Mushrooms & 0.89 $\pm$ 0.01 & \textbf{0.93 $\pm$ 0.01} & \textbf{0.93 $\pm$ 0.01} & 0.80 $\pm$ 0.02 & 0.90 $\pm$ 0.03 & 0.54 $\pm$ 0.02 & 0.47 $\pm$ 0.02 & 0.51 $\pm$ 0.03 & 0.47 $\pm$ 0.02 & \textbf{0.69 $\pm$ 0.03} \\
  Nci9 & \textbf{0.17 $\pm$ 0.16} & 0.07 $\pm$ 0.09 & \textbf{0.17 $\pm$ 0.14} & 0.10 $\pm$ 0.09 & \textbf{0.17 $\pm$ 0.16} & \textbf{0.98 $\pm$ 0.00} & 0.94 $\pm$ 0.01 & 0.94 $\pm$ 0.01 & 0.86 $\pm$ 0.01 & 0.96 $\pm$ 0.01 \\
  Nursery & 0.76 $\pm$ 0.01 & 0.42 $\pm$ 0.01 & 0.51 $\pm$ 0.01 & \textbf{0.76 $\pm$ 0.01} & 0.76 $\pm$ 0.01 & 0.18 $\pm$ 0.12 & 0.10 $\pm$ 0.12 & \textbf{0.20 $\pm$ 0.15} & 0.15 $\pm$ 0.12 & 0.13 $\pm$ 0.17 \\
  Pdspeech & \textbf{0.75 $\pm$ 0.03} & \textbf{0.75 $\pm$ 0.03} & 0.74 $\pm$ 0.03 & 0.74 $\pm$ 0.04 & 0.69 $\pm$ 0.06 & \textbf{0.79 $\pm$ 0.01} & 0.41 $\pm$ 0.01 & 0.54 $\pm$ 0.01 & 0.76 $\pm$ 0.01 & 0.78 $\pm$ 0.01 \\
  Promoters & \textbf{0.93 $\pm$ 0.11} & 0.67 $\pm$ 0.19 & 0.48 $\pm$ 0.16 & 0.92 $\pm$ 0.13 & 0.88 $\pm$ 0.13 & 0.37 $\pm$ 0.07 & 0.36 $\pm$ 0.05 & 0.15 $\pm$ 0.07 & 0.35 $\pm$ 0.08 & \textbf{0.45 $\pm$ 0.06} \\
  Prostate\_GE & 0.74 $\pm$ 0.10 & \textbf{0.79 $\pm$ 0.14} & 0.64 $\pm$ 0.13 & 0.57 $\pm$ 0.15 & 0.57 $\pm$ 0.24 & 0.70 $\pm$ 0.08 & 0.75 $\pm$ 0.03 & 0.74 $\pm$ 0.03 & 0.75 $\pm$ 0.03 & \textbf{0.79 $\pm$ 0.04} \\
  SCADI & \textbf{0.80 $\pm$ 0.17} & 0.66 $\pm$ 0.14 & \textbf{0.80 $\pm$ 0.15} & 0.74 $\pm$ 0.11 & 0.73 $\pm$ 0.16 & \textbf{0.90 $\pm$ 0.13} & 0.72 $\pm$ 0.13 & 0.53 $\pm$ 0.12 & 0.88 $\pm$ 0.11 & 0.86 $\pm$ 0.13 \\
  Semeion & 0.50 $\pm$ 0.03 & 0.53 $\pm$ 0.07 & 0.28 $\pm$ 0.03 & 0.27 $\pm$ 0.04 & \textbf{0.54 $\pm$ 0.04} & 0.72 $\pm$ 0.13 & \textbf{0.79 $\pm$ 0.13} & 0.54 $\pm$ 0.17 & 0.61 $\pm$ 0.13 & 0.55 $\pm$ 0.20 \\
  SPECT & \textbf{0.81 $\pm$ 0.06} & 0.76 $\pm$ 0.06 & 0.76 $\pm$ 0.06 & 0.81 $\pm$ 0.05 & 0.80 $\pm$ 0.07 & \textbf{0.80 $\pm$ 0.15} & 0.73 $\pm$ 0.18 & \textbf{0.80 $\pm$ 0.15} & 0.76 $\pm$ 0.17 & 0.77 $\pm$ 0.10 \\
  Splice & \textbf{0.93 $\pm$ 0.02} & 0.52 $\pm$ 0.02 & 0.52 $\pm$ 0.02 & 0.91 $\pm$ 0.02 & 0.91 $\pm$ 0.02 & 0.53 $\pm$ 0.05 & 0.54 $\pm$ 0.07 & 0.31 $\pm$ 0.04 & 0.28 $\pm$ 0.03 & \textbf{0.56 $\pm$ 0.05} \\
  Tox171 & 0.45 $\pm$ 0.15 & 0.30 $\pm$ 0.08 & 0.46 $\pm$ 0.11 & \textbf{0.51 $\pm$ 0.15} & 0.47 $\pm$ 0.08 & 0.81 $\pm$ 0.08 & 0.80 $\pm$ 0.08 & 0.80 $\pm$ 0.08 & \textbf{0.84 $\pm$ 0.08} & 0.81 $\pm$ 0.07 \\
  Tic-Tac-Toe & 0.72 $\pm$ 0.06 & 0.65 $\pm$ 0.04 & 0.65 $\pm$ 0.04 & 0.69 $\pm$ 0.05 & \textbf{0.72 $\pm$ 0.05} & 0.93 $\pm$ 0.02 & 0.52 $\pm$ 0.02 & 0.52 $\pm$ 0.02 & 0.93 $\pm$ 0.01 & \textbf{0.93 $\pm$ 0.01} \\
  Umist & 0.63 $\pm$ 0.08 & 0.69 $\pm$ 0.06 & 0.83 $\pm$ 0.03 & 0.60 $\pm$ 0.07 & \textbf{0.83 $\pm$ 0.07} & \textbf{0.52 $\pm$ 0.12} & 0.36 $\pm$ 0.11 & 0.44 $\pm$ 0.13 & 0.40 $\pm$ 0.06 & 0.48 $\pm$ 0.11 \\
  WarpAR10P & 0.40 $\pm$ 0.14 & 0.18 $\pm$ 0.09 & 0.27 $\pm$ 0.15 & 0.26 $\pm$ 0.17 & \textbf{0.46 $\pm$ 0.15} & 0.94 $\pm$ 0.02 & 0.69 $\pm$ 0.04 & 0.69 $\pm$ 0.04 & 0.82 $\pm$ 0.04 & \textbf{0.95 $\pm$ 0.02} \\
  WarpPIE10P & \textbf{0.49 $\pm$ 0.11} & 0.31 $\pm$ 0.06 & 0.45 $\pm$ 0.10 & 0.24 $\pm$ 0.09 & 0.37 $\pm$ 0.10 & 0.64 $\pm$ 0.10 & 0.69 $\pm$ 0.03 & \textbf{0.80 $\pm$ 0.07} & 0.67 $\pm$ 0.05 & 0.77 $\pm$ 0.08 \\
  Yaleb & 0.08 $\pm$ 0.02 & \textbf{0.09 $\pm$ 0.02} & 0.06 $\pm$ 0.01 & 0.07 $\pm$ 0.02 & 0.07 $\pm$ 0.01 & 0.38 $\pm$ 0.08 & 0.20 $\pm$ 0.07 & 0.33 $\pm$ 0.13 & 0.41 $\pm$ 0.19 & \textbf{0.47 $\pm$ 0.09} \\
  \midrule
  Avg. Rank & \textbf{1.97} & 3.70 & 3.17 & 3.30 & 2.57 & \textbf{2.09} & 3.73 & 3.61 & 3.09 & 2.27 \\
  \bottomrule
  \end{tabular}
  }
\end{table*}

\begin{figure*}[!t]
  \centering
  \resizebox{\textwidth}{!}{
  \begin{tabular}{cccc} 
    \multicolumn{2}{c}{Na{\"i}ve Bayes} & \multicolumn{2}{c}{Decision Tree} \\ 
    \includegraphics[width=0.235\linewidth]{./figs/nbacc_alzheimer.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/nbacc_audiology.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/tree_alzheimer.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/tree_audiology.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/nbacc_chess.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/nbacc_SCADI.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/tree_chess.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/tree_SCADI.eps} \\
  \end{tabular}
  }
  \caption{\label{fig:acc} Comparison results of classification accuracy performance according to the number of features selected by the five UFS methods}
\end{figure*}

\begin{figure*}[!t]
  \centering
  \begin{tabular}{cccc} 
    & & & \\
    \includegraphics[width=0.235\linewidth]{./figs/ent_ALLAML.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_alzheimer.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_arcene.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_audiology.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/ent_ba.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_chess.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_CLL_SUB_111.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_coil20.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/ent_colon.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_leukemia.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_lsvt.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_lung.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/ent_lymphoma.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_madelon.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_mushrooms.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_nci9.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/ent_nursery.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_pdspeech.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_promoters.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_Prostate_GE.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/ent_SCADI.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_semeion.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_splice.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_Tic-Tac-Toe.eps} \\
    \includegraphics[width=0.235\linewidth]{./figs/ent_umist.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_warpAR10P.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_warpPIE10P.eps} &
    \includegraphics[width=0.235\linewidth]{./figs/ent_yaleb.eps} \\
  \end{tabular}
  \caption{\label{fig:ent} Comparison results of entropy performance according to the number of features selected by the five UFS methods}
\end{figure*}
In Table~\ref{tb:timemi}, we present a comparative analysis of the execution times and MI for the conducted UFS methods.
The best performance is highlighted in bold, and the average rank is presented at the bottom of the table.
Execution times are reported in seconds, reflecting the computational effort required to process each dataset.
In particular, the experiments were conducted on a system equipped with a 13th Generation Intel\textsuperscript{\textregistered} Core\textsuperscript{™} i9-13900K processor, clocked at 3.00 GHz.
As a result, our proposed method ranks as the second-fastest method on average, following EGC, which highlights its efficiency and potential applicability to a wide range of datasets.
Furthermore, for algorithms such as EGC, which rely on specific parameter settings, it is important to note that the time to achieve optimal results can vary significantly depending on the number of parameters adjusted.
In such contexts, the execution time can increase substantially based on the complexity of the parameter space being navigated.

Subsequently, in traditional UFS studies, the superiority of devised methods was validated by measuring the MI between selected feature subset and class variable or directly using classifiers such as na{\"i}ve Bayes or decision tree.
Although this kind of validation is only possible when the dataset contains the class variable $C$, it is reasonable from the viewpoint of information theory because the increment of MI between input features and class variable leads to the reduction of na{\"i}ve Bayes error owing to the Fano's inequality and $k$-cardinality approximation of MI between $S$ and $C$ when $k=1$ leads to the unsupervised score function \citep{seo2019generalized}, i.e., maximizing $H(S)$ can contribute to the increment of MI $M(S,C) = H(S) + H(C) - H(S,C)$. Here, we report the performance of the proposed and compared methods in terms of MI and classification accuracy.

In Table~\ref{tb:timemi}, the proposed method achieved the best performance on 29 of the 30 datasets, with an average rank of 1.03.
These results suggest that the proposed method could potentially improve classification performance more effectively than the compared methods.
To verify this, experiments regarding the classification accuracy were performed using the features selected by the proposed and compared methods.

Table~\ref{tb:res2} summarizes the classification accuracy of features selected by both the proposed and compared methods, evaluated using na{\"i}ve Bayes and decision tree classifiers.
Specifically, the classification accuracy is measured by the 10-fold cross-validation with the na{\"i}ve Bayes classifier and the decision tree classifier, which were trained by the data composed of the selected features.
In the case of the na{\"i}ve Bayes classifier, the proposed method achieved superior classification accuracy on 14 out of 30 datasets, with an average rank of 1.97.
This performance outperformed that of the next most effective method, FMI, which garnered an average rank of 2.57.
For the decision tree classifier, the proposed method also led the field, securing the highest classification accuracy on 9 out of the 30 datasets and an average rank of 2.09.
This surpassed the second-best method, EGC, which obtained an average rank of 2.27.


Figure~\ref{fig:acc} illustrates the classification accuracy achieved by the proposed and compared methods across varying feature sizes on the Alzheimer, Audiology, Chess, and SCADI datasets.
In the figure, the blue line symbolizes the proposed method, while the black line represents the comparative methods.
Notably, the classification accuracy of the proposed method surpassed that of the compared methods across all four datasets as the feature size increased.
This superior performance can be attributed to the feature subset that is optimized towards pattern discrimination power, thereby contributing to improved classification accuracy.

Figure~\ref{fig:ent} shows the comparison result of entropy performance according to the number of features selected by the five UFS methods.
Although it is a trivial result because the proposed method directly maximizes the entropy of the feature subset, whereas compared methods do not, we observed that the proposed method outperforms the compared method significantly in terms of entropy.

\begin{figure}[t]
  \centering
  \begin{tabular}{c} 
    \includegraphics[width=0.6\linewidth]{./figs/ent_min.eps} \\
    (a) Entropy Minimization $S=\{f_7, f_{13}, f_{55}\}$ \\
     \includegraphics[width=0.6\linewidth]{./figs/ent_max.eps} \\
    (b) Entropy Maximization $S=\{f_{90}, f_{11}, f_{76}\}$\\
  \end{tabular}
  \caption{\label{fig:ent_comp} Taxonomic trees constructed by selected features from the proposed method (Entropy Maximization) and the entropy minimization method.}
\end{figure}

\section{Application: Taxonomic Tree Construction}
The motivation behind maximizing entropy is to fully utilize the discrimination power of patterns, leading to improved information retrieval \citep{zhong2011energy}, such as enhancing the process of taxonomy construction for each pattern.
This method aligns with information theory principles by selecting features that offer maximal informational value, thereby improving pattern discrimination power.
Specifically, constructing the taxonomic tree \citep{reiman2020popphy} starting from the root node and distributing patterns into the tree based on prioritized features with the highest entropy enables a significant reduction in the maximum depth of the tree.
This effectiveness arises because features selected in order of descending entropy are arranged from the root node downwards, allowing the values of each pattern associated with these features to be evenly distributed across the tree.
Such an organization ensures that the tree expands in a balanced manner, simplifies analytical processes by making the data structure more compact, and enhances the clarity and efficiency of data interpretation. 
The proposed method selects features iteratively based on their scores from the score function that maximizes the entropy of the feature subset. 
Considering that the highest entropy is observed in a uniform distribution when patterns are distributed using the selected features in a tree structure, this strategy enables the construction of a tree that closely approximates a balanced tree.

We conducted experiments on the binary Mushrooms dataset to assess the effectiveness of taxonomic trees constructed via the proposed entropy maximization method compared to the conventional entropy minimization approach.
For the entropy minimization method, features were selected using the same algorithm as the proposed method, except the objective was to minimize entropy within the feature subset.
Both strategies selected three features from the dataset and constructed the trees starting from the root node in the selection order, resulting in trees with a depth of three levels.
Patterns were then assigned to each node based on their feature values.

Figure~\ref{fig:ent_comp} depicts the constructed taxonomic trees.
The numbers in nodes indicate the number of assigned patterns at each node, whereas the circles highlight the feature at each division.
To visually represent the quantity of patterns per node, the width of each node was adjusted logarithmically in proportion to the number of patterns it contains.
The tree derived from the entropy minimization method revealed a significantly skewed structure, with the majority of patterns assigned in the leftmost node.
In contrast, the tree resulting from the proposed entropy maximization method exhibited a more balanced structure, approximating a uniform distribution of patterns across the nodes.
Given the skewed nature of the tree resulting from the entropy minimization method, which may frequently require additional comparisons to locate a novel pattern, the proposed entropy maximization method presents itself as a potentially more efficient solution, such as information retrieval systems.

\begin{algorithm}[!t]
\caption{Incremental Search for the Proposed Method when $k=3$}
\begin{algorithmic}[1]
\State $f^+ \gets \argmax_{f^+ \in F} H(f^+)$ \Comment{Select the first feature}
\State $S \gets \{f^+\}$
\State $f^+ \gets \argmax_{f^+ \in F - S} \sum_{f \in S} H(f^+, f)$ \Comment{Select the second feature}
\State $S \gets \{f^+\}$
  \While{$\vert S \vert < n$}
    \State $f^+ \gets \argmax_{f^+ \in F - S} \sum_{f_i \in S}\sum_{f_j \in S} H(f^+, f_i, f_j)$
    \State $S \gets S \cup \{f^+\}$
  \EndWhile
\end{algorithmic}
\label{alg:prop_k3}
\end{algorithm}

\section{Score Function varying $k$-cardinality}
We provide a detailed explanation of the implications of using values of $k$ larger than 2 by giving a concrete example by introducing a newly instantiated score function when $k=3$.
First, the score function when $k=3$ for the proposed method can be rewritten as

\begin{equation}
  \label{eq:objective_k31}
  \begin{split}
    J & \approx \argmax_{f^+} \left(\sum_{i=1}^{b} \frac{i}{\vert S \vert + 1 -i}\right) U_{3} (\{S', f^+\}') \\
    & = \argmax_{f^+} \frac{1}{\vert S \vert \cdot \left(\vert S \vert -1 \right)} U_{3} (\{S', f^+\}'),
  \end{split}
\end{equation}
where $b = min(\vert S \vert + 1 -3 , 3-1) = 2$.
Equation~(\ref{eq:objective_k31}) can be rewritten as
\begin{equation}
  \label{eq:objective_k32}
  J \approx \argmax_{f^+} \sum_{f_i \in S} \sum_{f_j \in S} H(f^+, f_i, f_j),
\end{equation}
by the identical process when $k=2$.
Because the newly instantiated score function requires at least two features in $S$, the first and second features are selected based on the score functions when $k=2$.

Algorithm~\ref{alg:prop_k3} depicts the incremental search process of the proposed method when $k=3$.
The computational complexity of the proposed method expands to $O(n+n^2+n^3) = O(n^3)$ due to the $n$, $n^2$, and $n^3$ unit times required for calculating entropy values.
With $k=3$, the proposed method demands additional computational resources to calculate the entropy values, as Equation~(\ref{eq:objective_k32}) involves joint entropies among three features.
This new score function captures more complex relationships among features by calculating joint entropies of candidate features with all pairs of selected features.
However, this results in a significant increase in computational complexity compared to the $k=2$ method, which has a complexity of $O(n^2)$.
Furthermore, estimating the joint entropy between high-dimensional features often requires a large number of patterns to achieve reliable approximations \citep{lee2015mutual}.
\end{document}
