%\documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}

% \usepackage[british]{babel}
\usepackage{amsthm}
%\usepackage{algorithm}
\usepackage[ruled]{algorithm2e}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
%\title{Learning Sparse Representations of Preferences within Choquet Expected Utility Theory}
\title{Learning Sparse Representations of Preferences\\ within Choquet Expected Utility Theory}
%\title{Elicitation of Sparse Preference Models within Choquet Expected Utility Theory}
%\title{Learning Sparse Representations of the Choquet Expected Utility Model\\ for Decision Making Under Uncertainty}
%\title{Learning Sparse Preference Models within Choquet Expected Utility Theory for Decision Making Under Uncertainty}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<margo.herin@lip6.fr>?Subject=Your UAI 2022 paper}{Margot Herin}{}}
%\author[1]{Margot Herin}
\author[2]{Patrice Perny}
\author[3]{Nataliya Sokolovska}
% Add affiliations after the authors
\affil[1,2]{%
Sorbonne University\\ CNRS, LIP6, UMR 7606 \\
4 place Jussieu, 75005 Paris, France}
\affil[3]{%
    Sorbonne University\\ INSERM, NutriOmics, UMR S 1269\\  91, Boulevard de l’Hôpital, 75013 Paris, France
}


\usepackage{amsmath,amsfonts,amssymb}
\newtheorem{defi}{\bf Definition}
\newtheorem{thrm}{\bf Theorem}
\newtheorem{prop}{\bf Proposition}
\newtheorem{exam}{\bf Example}

\DeclareMathOperator{\argmin}{argmin} \DeclareMathOperator{\argmax}{argmax}


  \begin{document}
\maketitle

\begin{abstract}
This paper deals with preference elicitation within Choquet Expected Utility (CEU) theory for decision making under uncertainty.
We consider the Savage's framework with a finite set of states and assume that preferences of the Decision Maker over acts are observable. The CEU model involves two parameters that must be tuned to the value system of the decision maker: a set function (capacity) modeling weights attached to events, of size exponential in the number of states, and a utility function defined on the space of outcomes.  Our aim is to learn a sparse representation of the CEU model from preference data.  We propose and test a preference learning approach based on a spline representation of utilities and the sparse learning of capacities to obtain CEU models achieving a good tradeoff between the aim of sparsity and the expressivity required by preference data.
\end{abstract}


\section{Introduction}\label{sec:intro}

Decision theory has developed an entire stream of theoretical works
on the axiomatic foundations of preference models either for
descriptive, normative or prescriptive purposes \citep{vonnM47,savageLJ54,fishburn70,gilboa08,quiggin12,wakker13}. The mathematical models used to describe preferences include parameters that can be fitted to the value system of the decision maker (DM). The
role of these preference parameters is well understood and the decision behavior of an individual can be interpreted by analysing the values of these parameters. For example, in expected utility theory, risk aversion is equivalent to the concavity of the utility function and the level of risk-aversion of an individual can be measured
from the curvature and the slope of his/her utility function \citep{arrow71,pratt78}.

In the framework of decision under uncertainty (i.e., no probability of the events needs to be given) and risk (i.e., probabilities of the events are known) various models have been proposed, involving an increasing number of preferential parameters to cover an ever larger
class of decision behaviors. For example, \citet{tversky79} have observed a frequent violation of the {\em sure thing principle} of \citet{savageLJ54} in their experiments on preferences. Such violations preclude any representation of the observed preferences by Expected Utility (EU). Then, rank-dependent models have been introduced relying on a weakened version of the sure thing principle. Among them, Choquet Expected Utility (CEU) \citep{schmeidler89} has received much attention due to its high descriptive possibilities and the fact that it boils down to well known simpler models for some well identified subclasses of capacities or utilities. Among them let us mention rank-dependent utility (RDU) \citep{quiggin12}, Yaari's model \citep{yaari87} and EU \citep{savageLJ54}.

CEU can easily explain standard preference inversions as those observed in Allais paradox (in the context of risk) and in Ellsberg paradox (in the context of uncertainty) \citep{ellsberg61,wakker01}. However, the CEU model requires the definition of a non-necessarily additive set function (named capacity) that assigns a weight to every event that may occur in the problem, in addition to the utility function. In a problem where uncertainty is represented by a finite set $S$ of $n$ states of nature, the set of events under consideration is the set of all subsets of $S$. Hence the definition of the capacity and therefore of CEU requires the determination, in the general case, of $2^n$ coefficients, in addition to those that are necessary for the definition of the utility function.

The growing number of preference parameters, justified by descriptive objectives, comes at a cost: sophisticated decision models are harder to learn and need larger bases of preference data to be able to make reliable predictions on new data. Since preference data are usually not very numerous in practical applications and may be costly to obtain (preference queries must be asked to the DM or derived from a history of previous decisions) there is a need of flexible approaches allowing to adapt the number of preference parameters used in the model to the expressivity required  by preference data.  This question is of particular relevance for CEU due to its expressivity but also to its lack of compactness in the general case. For this reason, we study in this paper the potential of sparse learning to determine compact instances of  CEU from small preference databases. 

One possible source of difficulties here is the interplay of utilities and capacities in the computation of CEU values, making the learning of these two types of parameters interdependent. Another difficulty comes from the fact that utilities and capacities are not assumed to be directly observable and should be derived from preference statements over pairs of alternatives. Taking these specificities into account, we propose a learning approach that proceeds in two steps: using preference questions specially designed for utility elicitation we learn a spline representation of the utility function and then derive a sparse representation of the capacity from further preference examples.

The paper is organized as follows: Section 2 introduces the CEU model and some basic concepts and properties related to Choquet integrals. In Section 3 we review some works on learning decision models based on Choquet integrals and introduce the premises of a two-phase approach to learn utilities and then capacities from preference data. Then the learning of the utility function is presented in Section 4 and the learning of the capacity in Section 5. In the two latter sections we present numerical tests to show the effectiveness of the proposed approach.


%\subsection*{Related works}

%One popular approach to deal with compact capacity functions in decision theory is to restrict the model by allowing only k-additive capacities (Grabisch). Such capacities can be entirely characterized by xx coefficients, one for every subset of $S$ of cardinality smaller or equal to $k$. For instance people often restrict the use of Choquet integral to 2-additive capacities that can be described by only $n + n(n-1)/2$ coefficients. But fixing $k=2$ prior to preference analysis may drastically weaken the descriptive potential of CEU. Moreover, when at least a parameter is required for a subset of size $k>2$ the fact of considering a $k$-additive capacity may lead to consider many other

\section{Background on CEU}

We adopt the standard setting of \citet{savageLJ54} for decision making under uncertainty.  We have to compare acts the outcomes of which depend on the (unknown) state of nature. Here we consider a finite set of states $S =\{1, \ldots, n\}$ that is supposed to include all relevant possible futures. Any subset $A \subseteq S$ must be interpreted as an event. For instance, if $S= \{1, 2, 3\}$ the set $A =\{1,3\}$ represents the event ``$s = 1$ or $s = 3$'' where $s$ is the actual state of nature.

The acts to be compared are seen as functions defined from $S$ to the outcome space $X$. For simplicity we assume here that outcomes are payoffs and that $X$ is the real line. Any possible act $x$ is characterized by an outcome vector $(x_1, \ldots, x_n)$ where $x_i$ is the outcome of $x$ in state $i$, for $i = 1, \ldots, n$. We will denote $\mathcal{X} = X \times \ldots \times X$  the homogeneous cartesian product containing all possible acts given $S$ and $X$. Within $\mathcal{X}$ we distinguish constant acts denoted $\bar{x}=(x, \ldots, x)$ for any $x \in X$ (their outcome does not depend on the state of nature). We also define a mixture of acts as follows: for any $A \subset S$ and for any two acts $x, y \in \mathcal{X}$, let $x A y$ denote the act of $\mathcal{X}$ defined by:$$
(xAy)_i= \left\{\begin{array}{ll}x_i & \mbox{if}~ i \in A\\ y_i &\mbox{otherwise} \end{array} \right.~~i = 1, \ldots, n.
$$
It represents an act whose possible outcomes are those of $x$ if event $A$ occurs and those of $y$ otherwise. Mixtures of constant acts of type $\bar{x}A\bar{y}$, $x, y \in X$ are binary acts  the outcome of witch is $x$ if $A$ occurs and $y$ otherwise. They are useful to design informative preference queries in the elicitation of utilities as we will see later in the paper. 

Now, we introduce the Choquet Expected Utility model in the context of Savage. It is defined from two parameters: a utility function $u$ modeling the sensitivity of the DM with respect to outcomes and the capacity $v$ which is a set function defined on $2^S$, monotonic w.r.t to set inclusion (i.e., $v(A) \leq v(B)$ whenever $A \subseteq B \subseteq S$) and normalized (i.e., $v(\emptyset) = 0$ and $v(S) = 1$) modeling the sensitivity of the decision maker towards uncertainty (its chance attitude e.g., optimism or pessimism, see \citet{wakker01}). Given these two parameters $u$ and $v$, the CEU model assigns to every act $x$ an overall value $f_v^u(x)$ defined as the discrete Choquet integral of the utility of the outcome vector which reads as follows:
\begin{eqnarray}  f_v^u(x) &= & \textstyle\sum_{i=1}^n \big [ v(X_{(i)}) - v(X_{(i+1)}) \big ] u(x_{(i)}) \label{ch1}\\
      &= &    \textstyle  \sum_{i=1}^n \big [ u(x_{(i)}) - u(x_{(i-1)}) \big ] v(X_{(i)})  \label{ch2}
\end{eqnarray}
\noindent where (.) is any permutation of $S$ such that $x_{(1)} \leq \ldots \leq x_{(n)}$, and $X_{(i)}=\{{(i)}, \ldots, {(n)}\}$ is the event ``the outcome of $x$ is greater or equal to $x_{(i)}$'' for $i =1, \ldots, n$. Furthermore we assume that $x_{(0)} = 0$ and $X_{(n+1)} = \emptyset$.
For example, if $S=\{1, 2, 3\}$ then $f_v^u(100, 10, 60) = u(10) v(\{1,2,3\}) + [u(60) - u(10)] v(\{1,3\}) + [u(100) - u(60)] v(\{1\})$ by Equation \ref{ch2}.

CEU theory provides an axiomatic framework under which the DM's preferences $\succsim$ over acts are represented by $f$ \citep{schmeidler89,gilboa08}. Formally we have: $x \succsim y ~ \mbox{iff}~ f_v^u(x) \ge f_v^u(y)$. Let us briefly recall some key properties that illustrate the role of the capacity in the model:
\begin{itemize}
\item the monotonicity of $v$ is required to make sure that $f_v^u(x) \ge f_v^u(y)$ when $x_i \ge y_i$ for all $i \in S$.
\item $CEU$ boils down to Savage's expected utility when $v$ is additive (i.e., $v(A \cup B)+v(A \cap B) = v(A) + v(B)$ for all $A, B \subseteq S$).
\item the preference induced by $f$ satisfies {\em uncertainty aversion} if and only if $u$ is concave and $v$ supermodular (i.e., $v(A \cup B)+v(A \cap B) \ge v(A) + v(B)$ for all $A, B \in S$) \citep{chateauneuf02}. Uncertainty aversion (a.k.a convexity of preferences) reads as follows: if the DM is indifferent between $x$ and $y$ (denoted $x \sim y$) then $\alpha x + (1-\alpha) y$ will be preferred to $x$ (and also to $y$ by symmetry) for any $\alpha \in [0, 1]$. The convex mixture of $x$ and $y$ reduces the uncertainty of outcomes w.r.t $x$ and $y$ and makes the DM better off.
\item $CEU$ boils down to the rank-dependent utility for decision making under risk whenever $v(A) = w(p(A))$ where $p$ is the probability measure on events and $w$ is a monotonic weighting function such that $w(0) = 0$ and $w(1) = 1$ \citep{quiggin12}. If in addition $u$ is linear then CEU boils down to  Yaari's model \citep{yaari87}.
\end{itemize}

Another useful formulation of the CEU model relies on the Möbius inverse of the capacity. The Möbius inverse of $v$ is another set function $m$ defined on $S$ by: $m(A) = \sum_{ B \subseteq A} (-1)^{|A \backslash B|} v(B)$ for all $A \subseteq S$. The coefficients $m(A)$ are called Möbius masses, they completely characterize $v$. We indeed have $v(A) = \sum_{B \subseteq A} m(B)$. The values of $m$ can be positive or negative but add up to 1 since $\sum_{B \subseteq S} m(B) = v(S) = 1$. When $v$ is additive the only non-null Möbius masses are those of singletons.

Interestingly enough, the CEU model can be directly expressed from the Möbius inverse \citep{chateauneuf89} by:
\begin{equation} \label{mobchoq}
\textstyle f_v^u(x) = \sum_{B \subseteq S} m(B) \min_{i \in B}\{u(x_i)\}
\end{equation}
This formulation shows that $f_v^u(x)$ might admit a compact representation whenever the Möbius inverse is sparse.
A frequent option used to handle capacities with a sparse representation is to require that Möbius masses vanish for all subsets of states larger than a given $k$ smaller than $n$. In this case, the resulting capacity is said to be $k$-additive \citep{grabisch97} and admits a more compact representation than in the general case. For instance, when the capacity is 1-additive then all Möbius masses are null except for singletons where they are positive due to monotonicity.  However, in this case, Equation \ref{mobchoq} shows that $f$ boils down to an expected utility with a significant loss of expressivity.

A more interesting tradeoff could be obtained with $k$-additivity for some small value of $k$ larger than 1 but it seems difficult to select a suitable value of $k$ without looking at preference data. Moreover it may happen that very sparse but still $n$-additive capacities perfectly match preference data as illustrated in the following:
\begin{exam}
Assume that the DM is  pessimistic and behaves according to the min  criterion refined by an expectation to ensure strict monotonicity w.r.t  Pareto dominance. This behavior can be described by $f(x) =(1-\epsilon)\min\{u(x_i), i \in S\} + \epsilon \sum_{i=1}^{n}p_ix_i$ where $p_i$ are subjective (positive) probabilities and $\epsilon > 0$ is chosen arbitrarily small. $f$ is an instance of CEU obtained from Equation \ref{mobchoq} with $m(\{i\}) = \epsilon p_i$, $\forall i = 1 \ldots n$, $m(S) = 1- \epsilon $ and $m(B)=0$ for all $B \subset S$ such that $|B|>1$.
\label{minex}
\end{exam}



Example \ref{minex} shows that preferences induced by $f$ could not be properly described nor approximated by a $k$-additive capacity with $k<n$ (because of the drop of the most important term of weight $1-\epsilon$). Yet, $f$ can be closely approximated with the min criterion which admits a very sparse Möbius inverse representation. This calls for a more efficient approach to derive sparse representations of CEU from preference data. This question will be adressed later in the paper.


In order to illustrate both the descriptive potential of CEU and its ability to admit a sparse representation in terms of Möbius inverse we now
consider a standard urn example due to \citet{ellsberg61}.

\begin{exam} \label{ellsberg}
 An urn contains 90 balls including 30 red, and 60 blue or yellow balls in unknown proportion. We consider four bets, on the one hand $x$ (resp. $y$) yielding 100 if the drawn ball is red (resp. blue), and on the other hand $z$ (resp. $w$) yielding 100 if the drawn ball is not blue (resp. not red). Here $S = \{R, B, Y\}$ for red, blue, yellow, and the acts under consideration are $x = (100, 0, 0)$,  $y = (0, 100, 0)$,  $z= (100, 0, 100)$ and $w= (0, 100, 100)$. Note that the pair $(x, y)$ compares similarly to the pair $(z, w)$ except that the common outcome attached to yellow balls moves from 0 to 100. Despite this similarity, most of people prefer $x$ to $y$ but $w$ to $z$. It can easily be checked that such preferences are not representable by EU. 
 
 
 However, these preferences can be represented by CEU.  Let us assume that $u(0)=0$ and $u(100)=1$ and $v(\{R\}) = 1/3, v(\{B\}) = v(\{Y\}) = 0$, $v(\{R, B\}) = v(\{R, Y\}) =1/3$, $v(\{B, Y\}) = 2/3$ and $v(\{R, B, Y\})=1$. Note that for all events, $v$ yields the lower possible probability of the event according to our knowledge of the urn content. We have $f_v^u(x) = 0 v(\{R, B, Y\}) + (1-0) v(\{R\}) = 1/3$. Similarly we obtain $f_v^u(y)=0$, $f_v^u(z) = 1/3$ and $f_v^u(w)= 2/3$. Hence, $f_v^u(x) > f_v^u(y)$ and $f_v^u(w) > f_v^u(z)$ which is consistent with the observed preferences. Moreover, the Möbius inverse of $v$
 is everywhere 0 except that $m(\{R\})=1/3$ and $m(\{B,Y\}) = 2/3$. Hence we get a sparse representation of $f$ that fits the observed preferences: $f_v^u(x_1, x_2, x_3) = u(x_1)/3 + 2 \min\{u(x_2), u(x_3)\}/3$.
 \end{exam}

We end the section by mentioning a third representation of the capacity based on interaction indices, $I(A)$ for all $A \subseteq S$ \citep{grabisch97} that will be discussed later in the paper. When $A$ is a singleton $\{k\}$, the interaction index $I(\{k\})$ is nothing else but the so-called Shapley value measuring the average marginal increment $v(B \cup \{k\})- v(B)$ taken on all events $B \subseteq S$ that do not contain $k$. The notion of Shapley interaction index extends to any subset $A$ of $S$. Interaction indices can be uniquely defined from $v$ or $m$ indifferently. Conversely $v$ and $m$ can be obtained from $I$. For more details see \citet{grabisch97}. In the case of Example \ref{ellsberg}, the interaction indices are given by $I(\emptyset) = 7/18$, $I(\{1\}) = I(\{2\}) =  I(\{3\}) = 1/3, I(\{2.3\})= 2/3$, the other coefficients being null.


\section{Learning the CEU model}

\subsection{Related work}

Fitting the parameters of a decision model based on a Choquet integral to observed preferences is a question present both in the literature on decision theory (preference elicitation) and in the literature on machine learning (preference learning).  In the context of decision making under risk, some elicitation protocols proposing a series of preference queries involving pairs of lotteries have been proposed to construct methodically a set of points on the utility curve and then on the probability weighting function defining the capacity in the RDU model and in cumulative prospect theory (CPT) \citep{wakker96,abdellaoui00}. 

Another stream of work developed in the literature on multicriteria decision aid concerns the use of non-linear regression for the identification of the capacity from overall evaluations prescribed by the DM, and the use of ordinal regression method from preference examples,  assuming the utilities are known \citep{grabisch08,grabisch10}. The prior construction of utility in this setting is often based on direct queries on difference of attractiveness between attribute values, see e.g., the Macbeth method \citep{bana97}.

Another approach developed in AI consists in progressively reducing the  uncertainty about the preference parameters.  A first set of methods proceeds by successive reductions of the parameter space using preference queries adaptively selected for their information value (e.g., using the minimax regret criterion). This incremental approach was used for the identification of utilities in \citep{WangBoutilier03}, for the identification of the probability weighting function \citep{Hines10,perny16}, and for the identification of Choquet capacities in \citep{benabbou17}. A second set of methods proposes another adaptive elicitation procedure based on a Bayesian approach used to iteratively revise a probability density on the parameter space, see e.g. \citep{chajewska00,bourdache19,Gu20}. 

 None of the above mentioned contributions addresses the question of learning sparse representations of the capacity; however, some of them assume that capacities are $k$-additives for a prior reduction of model complexity.


The Choquet integral is also used in machine learning to replace the linear function of variables which is commonly used in standard regression methods~\citep{Gagolewski19,Beliakov20,beliakov20b}. For example, logistic regression was extended to Choquistic regression~\citep{Tehrani11,Tehrani12a,Tehrani13}.  It is also used for learning to rank with the Choquet integral \citep{Tehrani12} where the data are provided with the labels which are preference degrees from an ordered categorical scale. The Choquet integral was also introduced as a kernel method \citep{Tehrani21}. 

 In the machine learning community, statistical regularization is used to find a tradeoff between the model's generalizing performance and the model's complexity  \citep{Tibshirani96,Hastie15}. In particular, the Lasso method introduces the $L_1$ penalty to the objective function to obtain a sparse solution in a high dimensional setting with small number of observations. It can be used to obtain compact representations of capacities that include, in the general case, $2^n-2$ free parameters.  Several attempts have been made to reduce the complexity of the non-additive integrals via the $L_1$ penalty term. For example, the sparsity inducing penalty was applied to the capacity \citep{Anderson14,Adeyeba15}; the penalised sum of squared errors with Gini-Simpson index regularisation and the $L_0$ norm on the Shapley values were considered in~\citep{Pinar17}. The $L_1$ penalty was also applied to capacities represented by interaction indices in~\citep{deOliveira22}.

%The seminal work of~\cite{Tibshirani96} introduces the Lasso where the $L_1$ penalty term is added to the objective function, resulting in a sparse solution. The sparsity of a model is controlled by a hyper-parameter to be fixed by a grid search. In the past years, a number of regularisation terms based on the $L_1$ norm were proposed (see~\citep{Hastie15} for a general overview). The ability of the $L_1$ penalty to overcome the issue of learning models in high-dimension %(number of features $p \gg$  number of examples $N$) 
%is particularly interesting in our case since the exponential number of parameters of $v$ can quickly overpass the limited size of the preferences datases. 
The specificity of our approach is to learn from pairwise comparisons both a smooth utility function and a sparse M\"obius representation of the capacity with no prior reduction of the class of admissible capacities, in the framework of decision making under uncertainty and CEU theory.
%Our aim in this paper is to propose a preference learning approach with the following specificities. In the framework of decision making under uncertainty and CEU theory, we want to learn both a smooth utility function and a sparse capacity from  preference/indifference examples obtained from the DM. No prior assumption is made on the capacity  (no $k$-additivity, no transformed of a probability measure). We also want to show the advantages of working with the Möbius masses as far as sparsity is concerned. 
\subsection{Approximating preferences with CEU}

It is assumed here that utilities and capacities cannot be  directly requested from the decision maker who may have no idea of the model under consideration. Moreover, overall values $f^u_v(x)$ are not assumed to be observable. We want to derive preference parameters from observed choices or from preference statements obtained from the DM on some pairs of alternatives. Standard preference statements are of type ``$x$ {\em is at least as good as} $y$'' (denoted $x \succsim y$), or ``$x$ {\em and} $y$ {\em are indifferent}'' (denoted  $x \sim y)$. Within CEU theory
the weak preference $x \succsim y$ (resp. the indifference $ x \sim y$) is interpreted as 
$f_v^u(x) \ge f_v^u(y)$ (resp. $f_v^u(x) = f_v^u(y)$).


We remark that the above inequalities and equalities are linear in $v$ for any fixed $x, y$ and $u$ by definition of $f$ (see Equation \ref{ch2}). We assume here that such preference statements are available, either because a preference database is available or because the DM is able to answer on demand to some preference queries. 

Given a set of indifference statements $\mathcal{I} = \{(x^i, y^i)\in \mathcal{X}^2: x^i \sim y^i, i = 1, \ldots, q\}$ and/or a set of preference statements $\mathcal{P} = \{(x^i, y^i)\in \mathcal{X}^2: x^i \succsim y^i, i = 1, \ldots, p\}$, we look for a utility function $u$ and a capacity $v$ that match with the observed preferences. Since the CEU model may not perfectly match with the preferences expressed by the DM, we look for an approximate representation of preference data. %Hence, denoting $f$ the actual value function of the DM, we assume that $f_v^u(x) = CEU(x) + \delta_x$ for all $x \in X$, where $\delta_x$ is the approximation gap attached to act $x$. 
The approximation problem can be formulated as follows:

\begin{eqnarray}
&\textstyle \min \sum_{i=1}^q (\epsilon^+_i + \epsilon^-_i) + \sum_{i=1}^p \epsilon_i \label{approx1}\\[1ex]
&\left\{
\begin{array}{lll}
f_v^u(x^i) - f_v^u(y^i) +\epsilon^+_i - \epsilon^-_i = 0,~i=1...q\\
f_v^u(x^j) - f_v^u(y^j) + \epsilon_i \ge 0,~\forall (x^j, y^j),~j=1...p  \notag\\
v(A)\le v(A \cup \{i\}) \forall i \in S, \forall A \subseteq S\setminus\{i\}
\end{array}
\right. 
\\
&\epsilon^+_i \ge 0, \epsilon^-_i \ge 0, \epsilon_j \ge 0, i=1...q, j=1...p.  \notag
\end{eqnarray}

The third line in the system of constraints is here to enforce the monotonicity of $v$ with respect to set inclusion. We remark that the above optimization problem is not linear since Choquet values of type $f_v^u(x)$ appearing in constraints include products of variables defining the utilities $u(x_i)$ and  the capacities values $v(X_{(i)})$ (see Equation \ref{ch1}). 

In many contributions on preference learning methods based on the discrete Choquet integral, the utility function is assumed to be known and the focus is made on fitting the capacity. In this case, all constraints of the approximation problem formulated in Equation \ref{approx1} are linear in $v$ and the capacity can be obtained using standard linear programming solvers. This suggests learning the utility function first and then the capacity. 

On the other hand, some recent contributions propose to learn simultaneously the utility function and the capacity. Finding exact solutions simultaneously both for the utility function and the capacity is a difficult task, since the problem is not linear and the constraints are not convex. Some heuristics to solve this problem were proposed. A stochastic method was introduced  by~\citet{Angilella04}, and \citet{Goujon13} discussed a fixed-point method where the problem is split into two iterative linear tasks. Another heuristic based on a linear approximation of the product of the utility functions with Shapley values and interaction indices was considered by~\citet{Galand17}. 
An approach to find an exact solution for both utilities and capacities (in the context of the Choquistic regression) was proposed by~\citet{Tehrani14} where the utility function was represented as a linear combination of sigmoid functions. More recently, \citet{Bresson20} developed a neural architecture to learn both utilities and the corresponding parameters of hierarchical Choquet integrals.


As mentionned earlier,  \citet{wakker96,abdellaoui00} have shown that, by a careful selection of preference queries, the utility can be indirectly observed and acquired, regardless the capacity. We would like to use this specificity of the CEU model to learn the utility in a first stage. Then, determining the capacity becomes easier and we can focus on learning sparse representations of $v$ in a second stage. In this respect, we now discuss the relative interest of several standard representations of $v$.
%This approach is implemented in series of preferences queries named standard sequences \citep{wakker96,abdellaoui00} aiming at constructing point by point the utility curve, using at every step the answer of the DM to generate a new preference query. Our aim here is to learn 
\subsection{Various representations of capacity}

In Section 2 we have mentioned two alternative representations of a general capacity $v$: the Möbius inverse $m$, and the interaction indices $I$. Let us compare now their ability to provide compact representations.
First of all, we can remark that, if $v(\{i\}) > 0$ for some $i$, then $v(A) > 0$ for any event $A\supset \{i\}$. Moreover, $m$ is at least as compact as $v$ due to the following proposition.
\begin{prop}
 Let $v$ be a capacity and $m$ its M\"obius inverse, we have: $|| v ||_0 \geq || m ||_0 $, 
    where $ ||.||_0 $ denotes the $L_0$ norm, i.e., the number of non-zero coefficients.
\end{prop}
\begin{proof}
Consider a capacity $v$ and its M\"obius representation $m$. If $v(A) = 0$ for some $A\subseteq S$, then $v(B) =0$ for all $B \subseteq A$. Hence $m(A) = \sum_{B \subseteq A} (-1)^{|A\backslash B|} v(B) = 0$. Then $\{ A : v(A) = 0 \} \subseteq  \{ A: m(A) = 0 \}$ and $ || v||_0  = 2^n - |\{ A: \mu(A) = 0 \}|  \geq  2^n -|\{ A: m(A) = 0 \}| = || m ||_0 $.
\end{proof}

Moreover, the following result shows that the representation of $v$ in terms of interaction $I$ may lack of compactness e.g., when $v$ is a belief function (i.e., when $m$ is non-negative).

\begin{prop}
Let $v$  be a capacity and $m$ and $I$ its Möbius and interaction representations respectively. If $m$ is non-negative,  then  $ \|I \|_0 \geq 2^{|T^*|}, \textnormal{where}~ T^* = \argmax_{T \subseteq S} \big \{|T|: m(T) > 0\big \}.$
\end{prop}
\begin{proof}
 The interaction index $I$ is linked to $m$ by the following equation: $I(A) = \sum_{T \supseteq A}\frac{1}{t-a+1}m(T)$ for all $A \subseteq S$ \citep{grabisch97}. Hence, for any $T$ s.t. $m(T) > 0$, we have $I(A)> \frac{1}{t-a+1}m(T) > 0$ for all $A \subseteq T$.  Let $T^*$ be the subset of maximal cardinality among those such that $m(T)>0$, then the $2^{|T^*|}$ subsets of $T^*$ have a strictly positive interaction index. Hence, $\| I \|_0 \geq 2^{|T^*|}$.
\end{proof}

As an illustration, let us consider again the maximin criterion  $f(x) = \min_{i} u(x_i)$ which is an instance of CEU obtained from Equation \ref{mobchoq} with $m(A)=0$ for all $A \subset S$ and $m(S)=1$. Then
the above proposition shows that $I(A)>0$ for all $A \subseteq S$. In this case the $I$ representation is of size $2^n$ whereas the Möbius representation is very sparse (it includes a single non-null coefficient).
Considering the above propositions and the well-known interest of Möbius masses to identify focal elements in beliefs, we focus hereafter on regularizations based on the Möbius representation of $v$ aiming to minimize $||m||_0$.


\section{Learning the utility function}

\subsection{Assessing utilities from indifference statements}

Let us remind that the DM is not assumed to be able to provide the overall value of an act (this would amount to directly asking utilities values). This is a source of difficulty because constraints of type $f_v^u(x) = \alpha$ frequently used to perform regressions cannot be obtained by questioning the DM. However the DM can be asked to compare any act $x$ to a constant act $\bar{y}=(y, \ldots, y)$ for some $y \in X$. Hence $x \succsim \bar{y}$ is equivalent to $f_v^u(x) \ge f_v^u(\bar{y})=u(y)$. Whenever the DM is indifferent between $x$ and $\bar{y}$ we have $f_v^u(x) = u(y)$. In such a case, outcome $y$ is said to be the {\em certainty equivalent} of $x$. Indifference statements giving the certainty equivalent $y$ of binary acts of type $\bar{x} A \bar{z}$ for some $x,z \in X$ such that $x>z$ are often considered for preference elicitation. Indeed, $\bar{x} A \bar{z} \sim \bar{y}$ means that $f_v^u(\bar{x} A \bar{z}) = v(A) u(x)+(1-v(A))u(z) = u(y)$. Hence, if $v(A)$ is known for some $A$, this enables to derive $u(y)$ from $u(x)$ and $u(z)$. Thus, $u$ might be constructed point by point on a given interval $[x_m, x_M]$ from such indifferences, starting with two reference values $u(x_m) < u(x_M)$ arbitrary selected (e.g., $u(x_m)=0$ and $u(x_M)=1$). This process was used to elicit utilities in the context of risk  \citep{Hines10,perny16}. However, in our context we have no simple way to obtain $v(A)$ for some $A$ before knowing the utility function. For this reason we  propose to learn the utility function by regression from indifference statements obtained with the tradeoff method \citep{wakker96,abdellaoui00} adapted to the context of uncertainty.

The tradeoff method initially introduced in the context of risk involves preference queries using gambles. Here, we describe a counterpart of this method in the context of uncertainty, to assess the utilities of outcomes within a given interval $[x_m, x_M]$ using mixtures of constant acts. This method requires that there exists an event $A$ such that $\bar{x}_m \prec \bar{x}_mA\bar{x}_M \prec \bar{x}_M$. Within CEU theory these strict preferences translate into  $f_v^u(\bar{x}_m)  < f_v^u(\bar{x}_mA\bar{x}_M) < f_v^u(\bar{x}_M)$ which is equivalent to $ 0 = u(x_m) = f_v^u(\bar{x}_m) < u(x_m)(1-v(\bar{A})) + v(\bar{A}) < f^u_v( \bar{x}_M ) = u(x_M) = 1$, i.e., $0 < v(\bar{A}) < 1$. 


So, given such an event $A$, let us choose $z, r, R \in X$ such that $z \le x_M < r < R$ and consider the two following preference queries:\\[1ex]
 $Q(y,z)$: what is the outcome $y$ such that: $\bar{y} A \bar{R} \sim \bar{z} A \bar{r}$?\\
 $Q(x,y)$: what is the outcome $x$ such that: $\bar{x} A \bar{R} \sim \bar{y} A \bar{r}$?

With such indifferences, the DM makes a tradeoff between
upgrading $r$ in $R$ and downgrading $z$ in $y$ (or $y$ in $x$). Since $y \leq z$ and $x \leq y$, we have  $f_v^u(\bar{y} A \bar{R}) =  u(y)(1-v(\bar{A}))+u(R)v(\bar{A})$ and $f_v^u(\bar{z} A \bar{r}) =u(z)(1-v(\bar{A}))+u(r)v(\bar{A})$. Hence $f_v^u(\bar{y} A \bar{R}) = f_v^u(\bar{z} A \bar{r})$ implies $u(y)(1-v(\bar{A}))+u(R)v(\bar{A}) =u(z)(1-v(\bar{A}))+u(r)v(\bar{A})$ and therefore $(1-v(\bar{A})) [u(z)-u(y)]= v(\bar{A})[u(R)-u(r)]$. Similarly, $\bar{x} A \bar{R} \sim \bar{y} A \bar{r}$ implies  $(1-v(\bar{A})) [u(y)-u(x)]= v(\bar{A})(u(R)-u(r)]$. Since $v(\bar{A}) > 0$, if $u(R)-u(r) >0$ then $u(z)>u(y)>u(x)$ and therefore $z > y > x$. Finally, we have
$(1-v(\bar{A})) [u(z)-u(y)]=(1-v(\bar{A})) [u(y)-u(x)]$. Since $v(\bar{A}) < 1$ we obtain:
%\vspace{-0.8cm}
\begin{equation}\label{equ}
    u(z)-u(y) = u(y)-u(x)
\end{equation}
Such queries are often involved in a {\em standard sequence} that consists in sequentially asking question $Q(x_{i+1}, x_i)$ for $i=0$ to $N-1$, starting from $x_0 = x_M$ until $x_N \ge x_m$.  From the observed indifferences, Equation $\ref{equ}$ yields $u(x_{i-1})-u(x_{i}) = u(x_i)-u(x_{i+1})$ and therefore $u(x_{i+1}) = 2 u(x_i) - u(x_{i-1})$. Hence, fixing arbitrarily the utilities of $x_0$ and $x_1$ completely determines the utilities $u(x_i)$ for $i>1$. However, if the DM makes some errors in assessing $x_i$ in the early steps of the sequence, these errors will propagate and impact the whole sequence  \citep{blavatskyy2006error}. In order to reduce the error propagation, we propose to perform a regression from a database of indifference statements obtained from queries of type $Q(y, z)$ and $Q(x,y)$ rather than performing a standard sequence.

More precisely, our learning approach proceeds as follows: for various non-null events $A$, various steps $s = R - r$ defined by different pairs $(r, R)$ and various $z$, the two preference queries $Q(y, z)$ and then $Q(x,y)$ are asked to the DM. The resulting database of indifference statements yields a set of necessary linear constraints on utility values given by Equation \ref{equ}, for all triplets $(x, y, z)$ obtained from the answers to $Q$ queries. Then a regression by a monotonic spline is performed to identify the utility function that best fits the set of linear constraints on utility values.

\subsection{Monotonic spline regression under utility constraints}

In order to represent the utility function, we use a monotonic spline function, i.e., a piecewise polynomial function of class $C^k$.  Spline functions are widely used for data interpolation or approximation due to their ability to smoothly approximate complex shapes. Moreover they allow for a compact representation of utilities. Indeed, a spline function can be expressed as a linear combination of basis functions and is thus characterized by the coefficients of the combination. Since utility increases with payoffs, we will use a basis  $(I_l)_{l=1}^L$ of monotonically increasing spline functions, known as I-spline functions  \citep{Ramsay88} weighted by positive coefficients (adding up to 1 so as to have $u(x_M) = 1$). We use here cubic I-splines ($k = 3$) because they have matching first and second derivatives while preserving a local influence of every components. Formally, $u$ is defined by:
\begin{equation}\label{uspline}
\textstyle \forall x \in \left[x_m,x_M\right], ~~ u_\alpha(x) = \sum_{l = 1}^L \alpha_l I_l(x) 
\end{equation}
where $\alpha = (\alpha_1, \ldots, \alpha_L) \in [0, 1]^L$.

For the sake of illustration, we represent in Figure \ref{fig:Ispline}  the I-spline basis for $L = 10$ (value used in our tests) and an instance of utility function generated from this basis.

\begin{figure}[h]
  \centering
  \includegraphics[width=0.7\linewidth]{picture/output15.png}
  \caption {$u_{\alpha}(x)$ generated from the I-splines}\label{fig:Ispline}
\end{figure}

Our observations have been obtained using the $Q$ queries leading to a database $\mathcal{B}$ of $N$ triplets $(x^i,y^i,z^i)$, as described in the previous subsection. We want to determine the parameters $\alpha_l$ that best fit the associated constraints $2u(y^i) - u(z^i) -u(x^i) = 0, i =1, \ldots, N$ obtained from Equation \ref{equ}. Hence, using Equation \ref{uspline}, the problem can be formalized as a linear program $P(\mathcal{B})$ with relaxed constraints ($N+1$ constraints and $ L +2N$ variables):
\begin{eqnarray*}
&  P(\mathcal{B}) :~ \textstyle  \min z = \sum_{i=1}^{N} (\epsilon^+_i + \epsilon^-_i) \\
&\left\{
\begin{array}{lll}
\displaystyle \sum_{l=1}^L  \alpha_l (2I_l(y^i)-I_l(z^i)- I_l(x^i))  + \epsilon^+_i - \epsilon^-_i= 0, \forall i\\
\displaystyle  \sum_{l=1}^L \alpha_l  =  1 \\
\end{array}
\right.\\
&\epsilon^+_i \ge 0, \epsilon^-_i \ge 0, \alpha_l \ge 0.
\end{eqnarray*}
Hereafter let $\alpha^*_l$ denote the optimal solution and $z^*$ the optimal value.
%The optimal solution $\alpha^*_l$ characterize a spline function that best fits the constraints  and $z^*$ is the optimal value of $z$.
%The $N$ constraints induced from the database of
%triplets might not be sufficiently informative to determine a utility function with confidence.
%Indeed, they consist in linear combinations of preference examples and thus allow for more flexibility. In other words,the set of solutions respecting approximately the constraints could contain very diverse solutions, and the optimal solution $(u^* = \sum_l \alpha^*_l I_l, \epsilon^*)$ could consist in an arbitrary choice among them.
Taking into consideration that the number of observations is always limited  we need to assess the level of uncertainty of the utility function. To this end, we investigate a neighborhood of the optimal solution defined by $z^*\le z  \le z^* + \delta$ where $\delta$ is a tolerance threshold. This neighborhood $V_{\delta}(z^*)$ contains all spline functions that satisfy the constraints on utilities with an error at most equal to  $z^* + \delta$. The range of variation of utilities within this set is a good indicator of the level of uncertainty allowed by the constraints.
It can be measured by 
the quantity $\rho=   \max_{x \in [x_m,x_M]}\{ \max_{\alpha \in V_{\delta}(z^*)} u_{\alpha}(x) - \min_{\alpha \in V_{\delta}(z^*)} u_{\alpha}(x)\}$ estimated by discretization of $[x_m, x_M]$. When $\rho$ is too large, the constraints are considered too weak to allow for the identifiability of $u$; one should carry on the $Q$ queries process. The elicitation procedure is formalized in Algorithm \ref{algorithm}.
%This principle will be illustrated below.
\begin{algorithm}[H]
 %\KwData{this text}
 %\KwResult{Utility elicitation with Q-queries}
% initialization\;
$i \leftarrow 0$, $B \leftarrow \emptyset$ 
\\
 \Repeat{$\rho \leq \epsilon$}{
 % read current\;
 Select $A^i,x^i,R^i,r^i$ \\
 Ask queries $Q(x^i,y^i),Q(y^i,z^i)$ \\
 $\mathcal{B} \leftarrow \mathcal{B}  \cup \{(x^i,y^i,z^i)\}$\\
  $(\alpha^* , z^*) \leftarrow P(\mathcal{B})$\\
Compute $\rho$   \\
 $i \leftarrow i +1$
 }
 \caption{Utility elicitation with Q-queries}
 \label{algorithm}
\end{algorithm}
%\subsection{Numerical tests}

Let us illustrate our approach.
We simulate a $Q$ queries process with a DM answering according to a given CEU model $f_v^u$.
Answers to queries of type $Q(y,z)$ for a given
$z$ and a pair $(r, R)$ are simulated by solving the equation $f_v^u(\bar{y}A\bar{R}) = f_v^u(\bar{z}A\bar{r})$ which gives $y = u^{-1}(u(z)+[u(R)-u(r)](v(A)-1)/v(A))$.
Then $x$ is derived from $y$ using a similar process to simulate the answer to question $Q(x, y)$. Then the resulting triplet $(x, y, z)$ is slightly distorted using a random uniform noise. This process is iterated $N$ times for randomly chosen $z$, $r$, $R$, and $A$. We used the mathematical programming solver Gurobi (version  9.1.2) to perform the optimization task. The result of the learning process is presented on Figure \ref{fig:U-learning 1} where we increase the size of the database in order to reduce $\rho$. In this instance, $u$ is already well estimated with tight bounds for $N = 32$.
\begin{figure}[h]
  \centering
  \includegraphics[width=1\linewidth,height=4.5cm]{picture/output22.png}
  \caption{ Identification of the utility function $u$ for $N = 4$, $N= 16$, $N = 32$ (left to right). }\label{fig:U-learning 1}
\end{figure}

This experiment has been conducted for $1000$ utility functions $u$ randomly generated in the space of spline functions. Below we show the decrease of $\rho$ and of the distance $d(u,u_{\alpha^*})$ between the estimated utility function $u_{\alpha^*}$ and $u$ as the number $N$ of learning examples increases. The distance is computed as the average absolute difference between both functions on a discretization of $[x_m, x_M]$.
%The distance is estimated on a subdivision of $[x_m, x_M]$.

\begin{table}[h]
\centering
\begin{tabular}{cccc}
\toprule
{} &     N = 4 &    N = 16 &    N = 32 \\
\midrule
$\rho$ &  0.687 &  0.124 &  0.072 \\
 $d(u,u_{\alpha^*})$& 0.354 &  0.024 &  0.004 \\
\bottomrule 
\end{tabular}
\caption{ $\rho$ and $d(u,u_{\alpha^*})$ w.r.t the number of constraints $N$.}
\label{table1}
\end{table}
\section{Learning the capacity}

Given the utility function $u$ obtained as described in Section 4, we want to learn a sparse Möbius representation of the CEU model, based on Equation \ref{mobchoq}. However, since 
$\|m\|_1 \ge \sum_{B \subseteq S} m(B) = 1$ by definition, it is not quite natural to penalize with $\|m\|_1$ since its impossibility to decrease to zero would make ineffective any further reinforcement of the penalization as soon as $\|m\|_1 =1$.

To overcome the problem we use the following representation of $m$: $m(B) = 1/n + w_B$ if $|B|=1$ and $m(B) = w_B$ if $|B|>1$, where $w_B$ are real coefficients (positive or negative) such that $\sum_{B\subseteq S}w_B =0$ (hence $\sum_{B\subseteq S} m(B) =1$). Note that when all coefficients $w_B$ are null, CEU boils down to a simple instance of Expected Utility where states are equally weighted. In the general case, $w$ represents the gap to this basic model. In order to obtain a sparse representation in terms of Möbius we penalise on $\|w\|_1$ instead of $\|m\|_1$. This leads to solve the following linear program, which is a regularized version of (\ref{approx1}) reformulated with Möbius masses ($m_B$ are variables representing the masses $m(B), B \subseteq S$):

\begin{eqnarray*}
&\textstyle \min \sum_{i=1}^q (\epsilon^+_i + \epsilon^-_i) + \sum_{i=1}^p \epsilon_i + \lambda \sum_{B \subseteq S}(w^+_B +  w^-_B)\\
&\left\{
\begin{array}{lll}
\sum_{B \subseteq S} m_B (U_B^{x^i} -  U_B^{y^i}) +\epsilon^+_i - \epsilon^-_i = 0,~i=1...q\\
\sum_{B \subseteq S} m_B (U_B^{x^i} -  U_B^{y^i})  + \epsilon_i \ge 0,~i=1...p\\
m_B = 1/n + w_B, ~~\forall B \subseteq S: |B|=1\\
m_B = w_B, ~~\forall B \subseteq S: |B|>1\\
w_B  = w^+_B - w^-_B,~~\forall B \subseteq S \\
\sum_{C \subseteq B }m_{C \cup \{i\} } \geq 0, ~~ \forall i \in S, ~~\forall B \subseteq S\backslash \{i\}\\
\end{array}
\right.\\
&\epsilon^+_i \ge 0, \epsilon^-_i \ge 0, \epsilon_j \ge 0, w^+_B \ge 0, ,w^-_B \ge 0, m_B, w_B \in \mathbb{R}
\end{eqnarray*}%i=1...q, j=1...p$$

where $U_B^{x^i} =  \min_{j \in B}\{u(x^i_j)\}, \forall B \subseteq S, \forall i$.

The number of variables and constraints  are respectively $2^{n+2} + 2q +p $  and $ q + p + 2^{n+1} +  n2^{n-1}$. In practice, the LP above remains tractable because the number of states $n$ under consideration is generally low (at most a dozen).


 %\subsection{Numerical experiment}
 
Now, we share the results of our numerical experiments to illustrate the learned sparse model with the linear program described above. Here also the results are obtained with the Gurobi optimizer. First, we investigate how the  generalizing performance  evolves with the sparsity of $m$. We generated preference data as follows. A utility function $u$ and a Möbius-sparse capacity $v$ are randomly generated and preferences compatible with $f^u_v$ are generated. Training data take the form of $N$ pairs $(x^i,y^i)$ of acts whose outcomes are randomly drawn from $\left[x_m,x_M\right]$. The preferences are stated from this pairs as follows: let $\Tilde{f}^u_v(x)$ be a perturbation of $f^u_v(x)$ by uniform noise randomly drawn from a given interval $[-\sigma, \sigma]$, for any act $x$. If the difference $|\Tilde{f}^u_v(x^i)-\Tilde{f}^u_v(y^i)|\le \sigma$, then $x^i$ and $y^i$ are considered as indifferent. If the difference is greater than $\sigma$, we conclude to preference. Pairs with Pareto dominance are discarded.

For the sake of illustration, we present the results of our approach on a toy dataset with $n = 7$ states and $N = 100$ preference examples. Hence, we have $2^7-2$ parameters to learn.  The learning is performed for various values of  $\lambda$ (the weight of the regularization term) ranging from 0 to 100 in order to obtain a sequence of increasingly sparse representations. For each obtained model, we assess the performance in generalization by measuring the error rate on a test set of 1000 preferences. Figure \ref{fig:learning-v-1} (left) shows various possible tradeoffs between the test error and the compactness of the Möbius representation measured by $\|m\|_0$.
The curve shows that the introduction of the penalization term relevantly reduces both the error in tests and the number of non-null masses up to a point where we get close to the true model. Beyond this point, we see that further enforcing sparsity is counterproductive and increases the error in test.
Figure \ref{fig:learning-v-1} (right) represents three Möbius representations of the capacity respectively learned without regularization ($\lambda = 0$, plot (1)), with regularization and optimal tradeoff ($\lambda = 0.5$, plot (2)), and the true one (plot (3)). It shows that the penalty term is needed to recover a model close to the true one.


 \begin{figure}[h]
  \centering
  \includegraphics[scale = 0.27]{picture/output31.png}
  \caption{Test error versus $\|m\|_0$ }
  \label{fig:learning-v-1}
\end{figure}

A second experiment aims at highlighting the very special benefit of the regularization for small preferences databases in term of predictive performance. Figure \ref{fig:learning-v-2} illustrates the advantage of sparse models (obtained by regularization) for settings where the number of preference examples is small. We observed the average error rate on datasets of increasing sizes ranging from $N=50$ to $N = 1000$. The average is taken on 30 random datasets each time. We observe on Figure \ref{fig:learning-v-2} that the smaller the dataset, the bigger the increase of performance obtained by regularized models. %We used randomly generated preference databases of size 100 or more but it is important to note that a more efficient learning is possible, using much less preference examples in contexts where queries can be selected for their informative value. A preference example like $\bar{x}A\bar{z} \succ \bar{y}$ directly provides a lower bound: $v(A) >$.

 \begin{figure}[t!]
  \centering
   \includegraphics[scale = 0.218]{picture/output28.png}
  \caption{Comparative test error for sparse $(\lambda = 0.5)$ and dense  $(\lambda = 0)$ models w.r.t the training set size $N$.}
  \label{fig:learning-v-2}
\end{figure}
Finally, we provide the result of a comparison between our approach (sparse regression) and a method based on 2-additive models (2-ADD) in Tables \ref{errortest} and \ref{timecomplexity}. We simulated 10 random models $f^u_v$ of dimension $n=5$ and $n=10$ and associated training sets of size $N = 70$ and $N = 400$ (and test sets of size $1000$).
We observe that our approach which adapts sparsity to preference data has significantly lower error rates than the method that enforces sparsity with 2-additivity. However, this advantage comes at an additional computational cost due to the increase of variables.


 \begin{table}[t!]
 \centering
 \begin{tabular}{lccccccc}
\toprule
              $n$ &   5 &  10  \\
\midrule
     Sparse reg. & $5.98 \pm 4.36 \%$ & $10.28 \pm 3.24 \%$\\ 
%Non regularized ? & $... \pm ...$ & $... \pm ...$ \\
 2-ADD  & $ 12.09  \pm  4.98 \%$ & $15.98 \pm 1.18 \%$  \\
\bottomrule
\end{tabular}
\caption{Comparative average test error w.r.t $n$. }
\label{errortest}
\end{table}


 \begin{table}[t!]
 \centering
 \begin{tabular}{lccccccc}
\toprule
              $n$ &   5 &  10  \\
\midrule
     Sparse reg. & $0.0077 \pm 0.0021$ & $23.52\pm 6.21$\\ 
%Non regularized ? & $... \pm ...$ & $... \pm ...$ \\
 2-ADD  & $ 0.0066 \pm  0.0022$ & $0.35 \pm 0.06$  \\
\bottomrule
\end{tabular}
\caption{Comparative average training time (sec) w.r.t $n$. }
\label{timecomplexity}
\end{table}
\section{Conclusion}

We have presented a new approach for learning the utility function and the capacity in CEU. A spline representation of utilities is obtained via a regression from selected indifference learning examples. Then, a sparse representation of the capacity is obtained based on Möbius masses. Our tests confirm the practical effectiveness of the method. By proposing various tradeoffs between compactness and performance in the test phase, our approach allows the simplification of the general CEU model while maintaining the level of expressiveness required to describe the preference data.

A natural extension of this work is to develop an active learning version of our approach where the elicitation burden is oriented towards the determination of the best choice within a given set of alternatives. 
Also, an active selection of preference queries could reduce the number of examples required to learn a sparse representation of the capacity. Besides, we could extend our approach to the framework of multiattribute decision making. $Q$ queries could be adapted to learn a utility function per attribute using spline regression; then a sparse representation of the capacity could be learned to reveal the non-essential attributes and determine a multiattribute utility model keeping non-additive utilities only when necessary.   


\bibliography{herin_78}

\end{document}
