% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

\usepackage{hyperref}       % hyperlinks
%\usepackage{xr}
%\externaldocument{suehiro_uai2022_supp}
\usepackage{nameref} 
\usepackage{zref-xr}
\zxrsetup{toltxlabel} 
\zexternaldocument*{suehiro_124-supp}

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
\mathtoolsset{showonlyrefs=true}
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

\usepackage{amssymb}
%% The amsthm package provides extended theorem environments
%% \usepackage{amsthm}

%% The lineno packages adds line numbers. Start line numbering with
%% \begin{linenumbers}, end it with \end{linenumbers}. Or switch it on
%% for the whole article with \linenumbers.
%% \usepackage{lineno}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors

%\usepackage{jmlr2e}
\usepackage{times}
\usepackage{lscape}
%\usepackage[utf8]{inputenc} % allow utf-8 input
%\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
%\usepackage[pdftex]{hyperref}       % hyperlinks
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{caption}
\usepackage{enumitem}
%%%% mine %%%%%
\usepackage{amsfonts}
\usepackage{amsmath}
%\usepackage{tabularx}
\usepackage{braket}
\usepackage{boxedminipage}
\usepackage{epsf}
\usepackage{bm}
\usepackage{amsthm}
\usepackage{xspace}
\usepackage{wrapfig}
\usepackage{algorithm,algpseudocode}
\usepackage{multirow}
%\usepackage[numbers]{natbib}
%%%% natbib %%%%
%\newcommand{\citet}[1]
%{\citeauthor{#1}~\shortcite{#1}}
%\newcommand{\citep}{\cite}
%\newcommand{\citealp}[1]
%{\citeauthor{#1}~\citeyear{#1}}
%%%%%%%%%%%
\algnewcommand{\Inputs}[1]{%
  \State \textbf{Inputs:}
  \Statex \hspace*{\algorithmicindent}\parbox[t]{.8\linewidth}{\raggedright #1}
}
\algnewcommand{\Initialize}[1]{%
  \State \textbf{Initialize:}
  \Statex \hspace*{\algorithmicindent}\parbox[t]{.8\linewidth}{\raggedright #1}
}
\def\indot<#1>{\langle #1 \rangle}

% Definitions of handy macros can go here

% local.sty: local settings


\newtheorem{defi}{Definition}
\newtheorem{theo}{Theorem}
\newtheorem{prop}[theo]{Proposition}
\newtheorem{coro}[theo]{Corollary}
\newtheorem{rem}[theo]{Remark}
\newtheorem{lemm}{Lemma}
%\newtheorem{claim}{Claim}
\newtheorem{fact}{Fact}
\newtheorem{ex}{Example}
%\newcommand{\square}{\rlap{$\sqcup$}$\sqcap$}
%\def\beginproof{\noindent {\bf Proof.~}}
%\def\endproof{~~\square\bigskip}
\def\remark{\par\noindent\hangindent0pt{\bf Remark.}~}

% Proofs
%\def\remark{\par\noindent\hangindent0pt{\bf Remark.~}}
%\def\beginsome#1{\paragraphskip\noindent{\bf #1~}}
%\def\endsome{\paragraphskip}
%\def\beginproof{\par\noindent{\bf Proof.~}}
%\def\beginproofarg#1{\par\noindent{\bf #1.~}}
%\def\square{\rlap{$\sqcup$}$\sqcap$}
%\def\endproof{~~\square\paragraphskip}
%\def\endproofarg#1{~~\square~#1\paragraphskip}

% others
\def\OMIT#1{}
\def\newwd#1{{\em #1}}



% local.mac
\newcommand{\mnote}[1]{\marginpar{#1}}
\newcommand{\mynote}[1]{{\bf {#1}}}


%
% symbol.tex
%


\newcounter{nombre}
\renewcommand{\thenombre}{\arabic{nombre}}
\setcounter{nombre}{0}
\newenvironment{OP}[1][]{\refstepcounter{nombre}\par\bigskip \abovedisplayskip=0.5\abovedisplayskip \noindent{\sf OP \thenombre : #1}}{\par}

%\global\long\def\T#1{#1^{\top}}

\newcommand{\shapelets}{shapelets}
\newcommand{\shapelet}{shapelet}
%\newcommand{\LPS}{local pattern set}
%\newcommand{\LPS}{$\mathrm{LPM}$}
\newcommand{\LPS}[0]{LPM\xspace}
\newcommand{\targetH}{{local pattern matching hypothesis}}
\newcommand{\Nh}{N_{\mathrm{high}}}
%shogi
\newcommand{\fsvm}{f_{\mathrm{SVM}}}
\newcommand{\frsvm}{f_{\mathrm{RSVM}}}
\newcommand{\OPTH}{\mathrm{OPT}_\textrm{hard}}
\newcommand{\OPTS}{\mathrm{OPT}_\textrm{soft}}
%\newcommand{\LPS}{\mathrm{LP}_\mathrm{soft}}
\newcommand{\OPTSNU}{\mathrm{OPT}_\mathrm{soft}(\nu^+)}
\newcommand{\OPTG}{\mathrm{OPT}_{\mathcal{G}}}
\newcommand{\MILD}{\mathrm{MIL}_{\mathcal{D}}}
\newcommand{\ProbG}{\mathrm{Prob}_{{G}}}
\newcommand{\optG}{\mathrm{opt}_{\mathcal{G}}}
%\newcommand{\qed}{$\square$}
%\newcommand{\qed}{\hspace{\fill}$\square$}
%\newenvironment{proof}{\begin{trivlist} \item{\bf Proof }}{\end{trivlist}}
\newenvironment{claim}{\begin{trivlist}\item[]\textit{Claim}}{\end{trivlist}}
\newcommand{\AUC}{\mathrm{AUC}}
%\newcommand{\bphi}{\boldsymbol{\phi}}
\newcommand{\bphi}{{\mathbf \phi}}
%\newcommand{\bx}{\boldsymbol{x}}
\newcommand{\bx}{{\mathbf x}}
\newcommand{\by}{{\mathbf y}}
\newcommand{\bzero}{{\mathbf 0}}
%\newcommand{\bw}{\boldsymbol{w}}
\newcommand{\bmu}{\boldsymbol{\mu}}
%\newcommand{\bmu}{{\mathbf \mu}}
\newcommand{\bsigma}{\boldsymbol{\sigma}}
\newcommand{\bomega}{{\boldsymbol{\omega}}}
\newcommand{\blambda}{\boldsymbol{\lambda}}
\newcommand{\kernel}{\boldsymbol{\mathrm{K}}}
%\newcommand{\kernel}{\mathrm{K}}
\newcommand{\Vmat}{\boldsymbol{\mathrm{V}}}
\newcommand{\Xmat}{\boldsymbol{\mathrm{X}}}
\newcommand{\bw}{{\mathbf w}}
\newcommand{\bW}{{\mathbf W}}
\newcommand{\bd}{{\mathbf d}}
\newcommand{\bk}{{\mathbf{k}}}
\newcommand{\bv}{{\mathbf v}}
\newcommand{\bu}{{\mathbf u}}
\newcommand{\bz}{{\mathbf z}}
\newcommand{\allins}{P_S}
\newcommand{\hatallins}{\hat{P}_S}
\newcommand{\multiallins}{P_S}
\newcommand{\op}{\textsf{OP}}
%\newcommand{\bz}{\boldsymbol{z}}
%\newcommand{\bs}{\boldsymbol{s}}
\newcommand{\bs}{{\mathbf s}}
%\newcommand{\bt}{\boldsymbol{t}}
\newcommand{\btau}{{\boldsymbol{\tau}}}
\newcommand{\bt}{{\mathbf t}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\bgamma}{\boldsymbol{\gamma}}
\newcommand{\bxi}{\boldsymbol{\xi}}
\newcommand{\bzeta}{\boldsymbol{\zeta}}
\newcommand{\sbsq}{\mathrm{sub}}
\newcommand{\edge}{\mathrm{edge}}
\newcommand{\Ksub}{K_{\mathrm{sub}}}
\newcommand{\Ourmethod}[0]{our method\xspace}
\newcommand{\Ourshape}[0]{our shape.\xspace}
\newcommand{\hsigma}{\widehat\sigma}
\newcommand{\convhull}{\mathcal{H}}
\newcommand{\err}{\mathrm{err}}
\newcommand{\RW}{\mathrm{RW}}
\newcommand{\RCS}{\mathrm{RCS}}
\newcommand{\RV}{\mathrm{RV}}
\newcommand{\SRS}{\mathrm{SRS}}
\newcommand{\RSG}{\mathrm{RSG}}
\newcommand{\sign}{\mathrm{sign}}
\newcommand{\dom}{\mathcal{X}} %domain of interest
\newcommand{\domp}{\mathcal{X}^{pos}} 
\newcommand{\domn}{\mathcal{X}^{neg}} %
\newcommand{\range}{\mathcal{Y}} %range
\newcommand{\Natural}{\mathbb{N}} % 
\newcommand{\Real}{\mathbb{R}} % Eucledian space
\newcommand{\Hilbert}{\mathbb{H}} % Hilbert space
\newcommand{\Prob}{\mathbb{P}}
\newcommand{\F}{\mathrm{False}}
\newcommand{\T}{\mathrm{True}}
\newcommand{\calL}{\mathcal{L}}
\newcommand{\calS}{\mathcal{S}}
\newcommand{\calB}{\mathcal{B}}
\newcommand{\calF}{\mathcal{F}}
\newcommand{\calG}{\mathcal{G}}
\newcommand{\calT}{\mathcal{T}}
\newcommand{\calY}{\mathcal{Y}}
\newcommand{\Hyp}{\mathcal{H}}
\newcommand{\calW}{\mathcal{W}}
\newcommand{\EX}{\mathrm{EX}}
\newcommand{\filtEX}{\mathrm{FiltEX}}
\newcommand{\HSelect}{\mathrm{HSelect}}
\newcommand{\WL}{\mathrm{WL}}
%\newcommand{\breg}{D}
\newcommand{\vecx}{\boldsymbol{x}}
\newcommand{\vecy}{\mbox{\boldmath $y$}}
\newcommand{\vecw}{\boldsymbol{w}}
\newcommand{\vecz}{\boldsymbol{z}}
\newcommand{\vecg}{\mbox{\boldmath $g$}}
\newcommand{\veca}{\mbox{\boldmath $a$}}
\newcommand{\vecd}{\boldsymbol{d}}
\newcommand{\vecell}{\mbox{\boldmath $\ell$}}
\newcommand{\vecsigma}{\boldsymbol{\sigma}}
\newcommand{\vecpi}{\boldsymbol{\pi}}
%\newcommand{\vecv}{\mbox{\boldmath $v$}}
\newcommand{\vecxi}{\boldsymbol{\xi}}
\newcommand{\vece}{\mbox{\boldmath $e$}}
\newcommand{\vecB}{\mbox{\boldmath $B$}}
\newcommand{\vecD}{\mbox{\boldmath $D$}}
\newcommand{\vecI}{\mbox{\boldmath $I$}}
%\newcommand{\vecpi}{\mbox{\boldmath $\pi$}}
\newcommand{\tr}{\mathrm{tr}}
\newcommand{\vecG}{\mbox{\boldmath $G$}}
\newcommand{\vecF}{\mbox{\boldmath $F$}}
\newcommand{\tvecu}{\tilde{\mbox{\boldmath $u$}}}
\newcommand{\tvecw}{\tilde{\mbox{\boldmath $w$}}}
\newcommand{\tvecx}{\tilde{\mbox{\boldmath $x$}}}
\newcommand{\tw}{\tilde{w}}
\newcommand{\tx}{\tilde{x}}
\newcommand{\haty}{\hat{y}}
\newcommand{\hata}{\hat{a}}
\newcommand{\vecf}{\mbox{\boldmath $f$}}
\newcommand{\vectheta}{\mbox{\boldmath $\theta$}}
\newcommand{\vecalpha}{\boldsymbol{\alpha}}
\newcommand{\vecbeta}{\mbox{\boldmath $\beta$}}
\newcommand{\vectildealpha}{\widetilde{\vecalpha}}
\newcommand{\vectildebeta}{\widetilde{\vecbeta}}
\newcommand{\tildealpha}{\widetilde{\alpha}}
\newcommand{\tildebeta}{\widetilde{\beta}}
\newcommand{\vechatalpha}{\widehat{\vecalpha}}
\newcommand{\vechatbeta}{\widehat{\vecbeta}}
\newcommand{\hatalpha}{\widehat{\alpha}}
\newcommand{\hatbeta}{\widehat{\beta}}
\newcommand{\vectau}{\mbox{\boldmath $\tau$}}
\newcommand{\veclambda}{\bm{\lambda}}
\newcommand{\vecu}{\mbox{\boldmath $u$}}
\newcommand{\vecv}{\mbox{\boldmath $v$}}
\newcommand{\vecp}{\boldsymbol{p}}
\newcommand{\vecq}{\mbox{\boldmath $q$}}
\newcommand{\vecr}{\boldsymbol{r}}
\newcommand{\vecc}{\boldsymbol{c}}
\newcommand{\fp}{\mathrm{fp}}
\newcommand{\fn}{\mathrm{fn}}
\newcommand{\ouralg}{{Our algorithm}~}
\newcommand{\Ouralg}{PUMMA~}%{Modified ROMMA~}

\newcommand{\bn}{\Delta_2} % binary entropy
\newcommand{\psimp}{\mathcal{P}} %
\newcommand{\hatgamma}{\hat{\gamma}}
%\newcommand{\myexample}{\langle x,f(x) \rangle}
\newcommand{\indctr}[1]{I(#1)}
\newcommand{\CLASS}{\mathcal{C}}
\newcommand{\VC}{\mathrm{VC}}

\newcommand{\reg}{\mathcal{R}}
\newcommand{\breg}{D}

\newcommand{\filtex}{\mathrm{GenD_t}}
\newcommand{\gensamp}{\mathrm{GenSample}}

\newcommand{\argmax}{\mathop{\rm arg~max}\limits}
\newcommand{\argmin}{\mathop{\rm arg~min}\limits}
%\newcommand{\Expo}{\mathop{\rm  E}\limits}
\newcommand{\Expo}{\mathop{\mathbb{E}}\limits}
%\newcommand{\Expo}{\mathbb{E}}

\newcommand{\half}{\frac{1}{2}}
\newcommand{\eps}{\varepsilon}

\newcommand{\hp}{\hat{p}}
\newcommand{\hmup}{\hat{\mu}[+]}
\newcommand{\hmun}{\hat{\mu}[-]}
\newcommand{\hgp}{\hat{\gamma}[+]}
\newcommand{\hgn}{\hat{\gamma}[-]}
\newcommand{\gain}{\Delta}
\newcommand{\hgain}{\hat{\Delta}}
\newcommand{\vecdelta}{\boldsymbol{\delta}}

\newcommand{\tildeO}{\Tilde{O}}
\newcommand{\permset}{S}
\newcommand{\base}{\boldsymbol{B}}
\newcommand{\calC}{\mathcal{C}}
\newcommand{\calP}{\mathcal{P}}
\newcommand{\calX}{\mathcal{X}}
\newcommand{\calZ}{\mathcal{Z}}
\newcommand{\calV}{\mathcal{V}}
\newcommand{\calH}{\mathcal{H}}
\newcommand{\calE}{\mathcal{E}}
\newcommand{\calD}{\mathcal{D}}
\newcommand{\Rdm}{\mathfrak{R}}
\newcommand{\GC}{\mathfrak{G}}
\newcommand{\hullC}{\mathrm{conv}(\calC)}
\newcommand{\hullH}{\mathrm{conv}(H)}
\newcommand{\conv}{\mathrm{conv}}

\newcommand{\calHMC}{\mathcal{H}^{\mathrm{MC}}}
\newcommand{\calHLC}{\mathcal{H}^{\mathrm{LC}}}
\newcommand{\calHTR}{\mathcal{H}^{\mathrm{TR}}}
\newcommand{\calHMI}{\mathcal{H}^{\mathrm{MI}}}
\newcommand{\calhatHMC}{\widehat{\mathcal{H}}^{\mathrm{MC}}}
\newcommand{\calhatHLC}{\widehat{\mathcal{H}}^{\mathrm{LC}}}
\newcommand{\calhatHTR}{\widehat{\mathcal{H}}^{\mathrm{TR}}}
\newcommand{\calhatHMI}{\widehat{\mathcal{H}}^{\mathrm{MI}}}
\newcommand{\calhatH}{\widehat{\mathcal{H}}}
\newcommand{\calhatG}{\widehat{\mathcal{G}}}

\newcommand{\ellMI}{\ell^{\mathrm{MI}}}
\newcommand{\ellb}{\ell_{\mathrm{b}}}
\newcommand{\ellsb}{\ell_{\mathrm{sb}}}

\newcommand{\SMI}{S_{\mathrm{MI}}}

\newcommand{\Rmin}{R_{\mathrm{min}}}
\newcommand{\Rmax}{R_{\mathrm{max}}}
\newcommand{\RMC}{R^{\mathrm{MC}}}
\newcommand{\RLC}{R^{\mathrm{LC}}}
\newcommand{\RTR}{R^{\mathrm{TR}}}
\newcommand{\emRmin}{\widehat{R}_{\mathrm{min}}}
\newcommand{\emRmax}{\widehat{R}_{\mathrm{max}}}
\newcommand{\emRMC}{\widehat{R}^{\mathrm{MC}}}
\newcommand{\emRLC}{\widehat{R}^{\mathrm{LC}}}
\newcommand{\emRTR}{\widehat{R}^{\mathrm{TR}}}
\newcommand{\emR}{\widehat{R}}
\newcommand{\MLMI}{\ell^{\mathrm{MI}}}
\newcommand{\MLTR}{\ell^{\mathrm{TR}}}
\newcommand{\MLMC}{\ell^{\mathrm{MC}}}
\newcommand{\RMI}{R^{\mathrm{MI}}}
\newcommand{\emRMI}{\widehat{R}^{\mathrm{MI}}}
\newcommand{\ROMI}{R^{\mathrm{OMI}}}
\newcommand{\emROMI}{\widehat{R}^{\mathrm{OMI}}}
\newcommand{\fv}{f^{(v)}}
\newcommand{\Fv}{F^{(v)}}
\newcommand{\Fvi}{F^{(v_i)}}

\newcommand{\nP}{{n_{\mathrm{P}}}}
\newcommand{\nN}{{n_{\mathrm{N}}}}
\newcommand{\nL}{{n_{\mathrm{L}}}}
\newcommand{\nC}{{n_{\mathrm{C}}}}
%
%
%
\def\ceil#1{%
\left\lceil #1 \right\rceil}

\def\defeq{%
\stackrel{\mathrm{def}}{=}}

\def\floor#1{%
\lfloor #1 \rfloor}

\def\myhang{%
    \par\noindent\hangindent20pt\hskip20pt}
\def\nitem#1{%
    \par\noindent\hangindent40pt
    \hskip40pt\llap{#1~}}


\newcommand{\E}{\boldsymbol{E}}
%\newcommand{\note}{}

\newcommand{\pdiff}{\Phi_{\mathrm{diff}}(\multiallins)}
\newcommand{\calI}{\mathcal{I}}

\newcommand{\GMIL}{MIL\xspace}






%\newcommand{\dataset}{{\cal D}}
%\newcommand{\fracpartial}[2]{\frac{\partial #1}{\partial  #2}}

% Heading arguments are {volume}{year}{pages}{submitted}{published}{author-full-names}

%\jmlrheading{1}{2000}{1-48}{4/00}{10/00}{Marina Meil\u{a} and Michael I. Jordan}

% Short headings should be running head and authors last names

%\ShortHeadings{Learning with Mixtures of Trees}{Meil\u{a} and Jordan}
%\firstpageno{1}





%\title{Reduction Scheme for Empirical Risk Minimization and Its Applications to Multiple-Instance Learning}
\title{Simplified and Unified Analysis of Various Learning Problems \\by Reduction to Multiple-Instance Learning}

\author[1, 2]{Daiki Suehiro}

\affil[1]{Kyushu University, Department of Advanced Information Technology, %Department and Organization
            744 Motooka, 
            Fukuoka, Japan}
\affil[2]{RIKEN, Center for Advanced Intelligence Project, %Department and Organization
            Nihonbashi 1-chome Mitsui Building, 15th floor,1-4-1 Nihonbashi, Chuo-ku, 
            Tokyo, Japan}

\author[3]{Eiji Takimoto}
\affil[3]{Kyushu University, Department of Informatics,%Department and Organization
            744 Motooka, 
            Fukuoka,
            Japan}


\begin{document}
\maketitle

\begin{abstract}
%% Text of abstract
In statistical learning, many problem formulations have been proposed so far,
such as multi-class learning, complementarily labeled learning, multi-label learning, multi-task learning, which provide theoretical models for various real-world tasks. 
Although they have been extensively studied, the relationship among them
has not been fully investigated.
In this work, we focus on a particular problem formulation called
Multiple-Instance Learning (MIL), and show that various learning problems
including all the problems mentioned above with some of new problems can be
reduced to MIL with theoretically guaranteed generalization bounds,
where the reductions are established under a new reduction scheme
we provide as a by-product. 
The results imply that the MIL-reduction gives a simplified and
unified framework for designing and analyzing algorithms for various
learning problems.
Moreover, we show that the MIL-reduction framework can be kernelized.
\end{abstract}

\section{Introduction}
\label{sec:intro}
In this study, we explore how a large class of learning problems can be reduced to the Multiple-Instance Learning (MIL) problem.
This is strongly motivated by the results of~\citep{Sabato:2012:MLA} and~\citep{suehiro2020multiple}.
\citet{suehiro2020multiple} showed that some local-feature-based learning
problems can be reduced to a MIL problem, which gave us an insight that MIL would
have a high capability of representing various learning problems.
Indeed, the reduced problem is too specific whereas \citet{Sabato:2012:MLA} proposed
a much more general formulation of MIL,
and thus we believe that a wider class of learning problems can be reduced to MIL.
\par
We provide a MIL-reduction scheme and reveal that various learning problems,
such as multi-class learning, complementarily labeled learning, multi-label learning, and multi-task learning, can be reduced to MIL.
By the reduction, we immediately derive generalization bounds from~\citep{Sabato:2012:MLA}, as well as learning algorithms. 
That is, our reduction scheme greatly \emph{simplifies} the analyses of generalization bounds as compared with the analyses in the previous works~\citep[e.g.,][]{Lei19,ishida2017learning,pmlr-v32-yu14,pontil2013excess}.
Some of the obtained generalization bounds are competitive or incomparable to the existing results. 
In particular, for multi-label learning, we derive an improved generalization bound, and for complementarily labeled learning, we derive a novel learning algorithm, which is the first polynomial-time algorithm in a certain setting.
Moreover, we propose three new learning problems, \emph{multi-label learning with perfectionistic loss}, \emph{top-1 ranking learning} and \emph{top-1 ranking learning with negative feedback},
and we demonstrate that they can be reduced to MIL as well.
The results imply that our MIL-reduction gives a \emph{unified scheme} for designing and analyzing algorithms for various learning problems.
\par
To provide the MIL-reduction scheme, we propose a general reduction scheme among learning problems.
Our scheme has two remarkable features as described below.
First, our reduction transforms every instance-label pair $(x,y)$
in the given sample of the original learning problem to an instance-label pair $(x',y')$ to form a sample of the reduced learning problem. 
In contrast, standard reduction schemes employ an instance
transformation and an label transformation separately, to construct
$x'$ from $x$ and $y'$ from $y$, respectively.
Therefore, our scheme enables us to design reduction algorithms
among a wider class of learning problems, e.g., learning-to-rank
to classification, and supervised learning to weakly supervised learning.
Second, our reduction scheme ensures that
the Empirical Risk Minimization (ERM) of the reduced problem implies the ERM of the original one, while the empirical Rademacher complexity
of the hypothesis (composed with loss function) classes are preserved through the reduction.
This means that we can employ an existing ERM algorithm for the
reduced problem to obtain an ERM algorithm for the original problem
with a theoretical guaranteed generalization bound, which is
immediately derived from a known generalization bound for the reduced problem.
We also show that the MIL-reduction scheme can be kernelized. 
\par
The main contributions are summarized as follows:
\begin{itemize}
\item We propose a general reduction scheme based on the ERM, which allows us to
derive a generalization risk bound of the original problem immediately.
\item We demonstrate that several learning problems, from traditional to new problems, can be reduced to MIL. The results imply that our MIL-reduction gives a simplified and unified scheme for the analyses for various learning problems.
\item We obtain novel theoretical results for some learning problems.
\item We show that the MIL-reduction scheme can be kernelized.
\end{itemize}
Several proofs are shown in supplementary materials.

%\input{2_prelim}
\section{Preliminaries}
\label{sec:prelim}
For an integer $u$, $[u]$ denotes the set $\{1,\ldots, u\}$.
$I(\mathrm{e})$ denotes the indicator function of the event $\mathrm{e}$,
that is, $I(\mathrm{e})=1$ if $\mathrm{e}$ is true and $I(\mathrm{e})=0$ otherwise.

A learning problem is represented by a pair
$(\calH, \ell)$ of a hypothesis class
$\calH \subseteq \{h: \calX \rightarrow \calY\}$
and a loss function 
$\ell: \calX \times \calY \times \calH \rightarrow \Real$
for some input space $\calX$ and output space $\calY$.
A learner receives a sample $S=((x_1, y_1), \ldots, (x_n, y_n))$ where
each input-output pair $(x_i, y_i)$ is drawn i.i.d. according to an unknown distribution $\calD$ over $\calX \times \calY$.
The goal of the learner is to find, with high probability, a hypothesis $h \in \calH$ so that the generalization
risk $R_\calD(h) = \mathbb{E}_{(x,y)\sim \calD} \ell(x, y, h)$ is small. 
For a learning problem $(\calH, \ell)$, we define a class of loss functions as
$\calhatH = \{(x,y) \mapsto \ell(x,y,h) \mid h \in \calH \}$
when the underlying loss function $\ell$ is clear from the context.
We give the definition of the empirical Rademacher complexity,
which is used to bound the generalization risk.
\begin{defi}
[Empirical Rademacher complexity~\citep{Bartlett:2003:RGC}]
Given a sample $S=((x_1,y_1), \dots, (x_n,y_n)) \in (\calX \times \calY)^n$, 
the empirical Rademacher complexity $\Rdm_S(\calhatH)$ of a class
$\calhatH$ w.r.t.~$S$ is defined as
 $\Rdm_S(\calhatH)=\frac{1}{n}\mathbb{E}_{\vecsigma}\left[
\sup_{g \in \calhatH}\sum_{i=1}^n \sigma_i g(x_i,y_i)
 \right]$,
 where $\vecsigma \in \{-1,1\}^n$ and each $\sigma_i$ is an independent uniform random variable taking values in $\{-1,+1\}$.
\end{defi}
\paragraph{Generalization risk bound~\citep{mohri2018foundations}}
Let $(\calH,\ell) $ be a learning problem and $S$ be a sample of size $n$
drawn according to a distribution $\calD$.
Then, it holds with probability at least $1-\delta$ that
for all $h \in \calH$,
\begin{align}
    \label{align:genbound}
    R_{\calD}(h) \leq \emR_{S}(h) + 2\Rdm_S(\calhatH) + 3\sqrt{\nicefrac{\log (\nicefrac{2}{\delta})}{2n}},
\end{align}
where $\emR_{S}(h) = \frac{1}{n}\sum_{i=1}^n \ell(x_i,y_i,h)$ denotes the empirical risk of $h$ for sample $S$.

%\input{3_ERM_reduction_scheme}
\section{Reduction scheme for ERM}
We propose a general reduction scheme for empirical risk minimization and provide useful theoretical results.
\par
\begin{defi}[{ERM-reduction}]
\label{def:erm-reducible}
  A learning problem $(\calH, \ell)$ over input-output space $\calX \times \calY$ is \emph{ERM-reducible} to another learning problem $(\calH', \ell')$
  over input-output space $\calX' \times \calY'$
  if there exist polynomial-time computable functions
  $\alpha: \calX \times \calY \rightarrow \calX' \times \calY'$ and
  $\beta: \calH' \rightarrow \calH$
  such that for any $(x, y) \in \calX \times \calY$ and for any $h' \in \calH'$,
  \begin{align}
    \label{align:reduce_condition}
    \ell(x, y, h) = \ell'(x', y', h'),
  \end{align}
  where $(x', y') = \alpha(x, y)$ and $h=\beta(h')$.
\end{defi}

Here we show the remarkable relationship between the original problem and the reduced problem.
\begin{prop}%[{Transferable results}]
  \label{prop:main}
  Suppose that $(\calH, \ell)$ is ERM-reducible to $(\calH', \ell')$ with transformations $\alpha$ and $\beta$.
  For any sample $S=((x_1, y_1), \ldots, (x_n, y_n)) \in (\calX \times \calY)^n$,
  the following holds:
  \begin{enumerate}[label=(\roman*)]
  \item (In)equality of the ERMs: 
  \begin{align}
    \label{align:erm_reduction_standard}
    \min_{h \in \calH} \emR_S(h) \leq &\min_{h \in \calH_\beta} \emR_S(h) \\ 
    = & \min_{h' \in \calH'}\emR_{S'}(h'),
  \end{align}
  where $\calH_\beta = \{\beta(h') \mid h' \in \calH'\}$ and
  $S' = ((x'_1,y'_1),\ldots,(x'_n,y'_n))$ with
  $(x_i', y_i') = \alpha(x_i, y_i)$ for $i \in [n]$.
\item Empirical Rademacher complexity preserving:
  \begin{align}
    \Rdm_S(\calhatH_\beta) = \Rdm_{S'}(\calhatH'). 
  \end{align}
  \end{enumerate}
\end{prop}
We can design a reduction scheme in a straightforward way as follows.
When given a sample $S$ of the original problem, 
we construct $S'$ of the reduced problem by $\alpha$ and obtain $h'$ by solving the ERM of the reduced problem. Then, we obtain the final hypothesis $h$ by $\beta$.
\par

We derive the following generalization risk bound using the propositions
on the empirical Rademacher complexity.
\begin{coro}
\label{coro:risk_bound_reduced}
Let $S = ((x_1,y_1),\ldots, (x_n, y_n))$ be a sample 
i.i.d. drawn according to unknown distribution $\calD$ in an original problem $(\calH, \ell)$.
If $(\calH, \ell)$ is ERM-reducible to $(\calH', \ell')$,
for $S'= (\alpha(x_1, y_1), \ldots, \alpha(x_n, y_n))$ and $h = \beta(h')$,
the following generalization risk bound holds
with a probability at least $1-\delta$ for all $h \in \calH_\beta$:
\begin{align}
    R_\calD(h) \leq \emR_{S'}(h') 
     + 2\Rdm_{S'}(\calhatH')
     + 3\sqrt{\nicefrac{\log (\nicefrac{2}{\delta})}{2n}}.
\end{align}
\end{coro}
That is, we can guarantee the generalization bound of the original problem
because of the preservation of the empirical Rademacher complexity.

%\input{4_mil_formulation}
\section{MIL-Reduction framework}
This section is the highlight of this paper.
We define the ERM-reducibility to \GMIL and show the reducible condition.
Moreover, we show that some theoretical analyses can be simplified.
We use some symbols with prime (e.g., $\calX'$)
to indicate that the MIL is the reduced problem.
\subsection{Problem formulation of MIL}
Let $\calZ \subseteq \Real^{d'}$ be the instance space. $\calX' \subseteq 2^\calZ$ is an input space and
a \emph{bag} $x' \in \calX'$ is a finite set of instances chosen from $\calZ$.
Let $\calY' = \{-1, 1\}$ be an output space.
Following the formulation by~\citep{Sabato:2012:MLA}, we define, for the rest of the paper, a MIL problem as a pair $(\calH', \ell')$ of a
hypothesis class $\calH'$ and a loss function $\ell'$ of the form:
\begin{align}
\label{align:mil_H}
\calH'\!&=\!\{h': x'\!\mapsto \Psi_p(\{f_2(g(z))\!\mid z\!\in x'\})\! \mid\! g \in \calG\}, \\
\label{align:mil_l}
\ell' &: (x',y',h') \mapsto f_1(y'h'(x')),
\end{align}
where $\calG \subseteq \{g: \calZ \to \Real\}$,
$f_1: \Real \to [0,1]$ is an $a$-Lipschitz function, 
$f_2: \Real \rightarrow [-1, 1]$ is a $b$-Lipschitz function, and 
$\Psi_p: 2^{[-1,1]} \rightarrow [-1,1]$ is a $p$-norm like function, which is defined for any $p \in [1,\infty)$ as
\begin{align}
\Psi_p(V) = \left(\frac{1}{m} \sum_{i=1}^m \left(v_i + 1 \right)^p \right)^{1/p}\! -1
\end{align}
for every finite set $V=\{v_1,v_2,\dots,v_m\} \subseteq [-1,1]$.
We define $\Psi_\infty$ as $\lim_{p\rightarrow \infty} \Psi_p$. 
Note that $\Psi_p$ is $1$-Lipschitz for any $p$ \citep[see,][]{Sabato:2012:MLA}. 
In MIL tasks, $\Psi_p$ is a user-defined function and behaves as an aggregation of some bag information. Typical $\Psi_p$ are the $\max$ operator ($p=\infty$) and average ($p=1$).

The only difference in the hypothesis of~\citep{Sabato:2012:MLA}
is $f_2$. $f_2$ appears redundant (because $f_2 \circ g$ can be replaced by a single function) but plays an important role in the reduction (the examples are shown in Section~\ref{sec:examples}).

Here we give the definition of ERM-reducibility in a straightforward way.
\begin{defi}[MIL-reducibility]
A learning problem $(\calH, \ell)$ is said to be \GMIL-reducible if there exists a
MIL problem $(\calH', \ell')$ such that $(\calH, \ell)$ is ERM-reducible to $(\calH', \ell')$.
\end{defi}
Hereinafter, the scheme for ERM-reduction to MIL is called \emph{MIL-reduction scheme}.  

%\input{4-1_error_bound}
\subsection{Rademacher complexity bound}
We show the empirical Rademacher complexity bound for 
the \GMIL-reducible problems using our reduction scheme.
As aforementioned, the main advantage of our reduction scheme is to 
allow us to apply the empirical Rademacher complexity bound of 
the reduced problem to the original problems.
In this paper, we utilize the bound provided by~\citet{Sabato:2012:MLA}.
\begin{theo}[An application of Theorem 20 of~\citep{Sabato:2012:MLA}]
\label{theo:sabato}
Let $(\calH', \ell')$ be a MIL problem defined in Eq.\eqref{align:mil_H} and
\eqref{align:mil_l}.
Let $S' = ((x'_1, y'_1), \ldots, (x'_n, y'_n))$ be a sample
with average bag size $r_{S'}$.
%i.i.d. drawn according to an unknown distribution $\calD$ over $\calX' \times \calY'$. 
% Let $\calG \subseteq [-Q, Q]^\calZ$ be a function class . Assume that $f_1: \Real \to \Real$ is $a$-Lipschitz and let $f_2: \Real \to [-1,1]$.
Let $\calhatG = \{f_2 \circ g \mid g \in \calG\}$.
% \ell has a full range?
%For any $p \in [1,\infty]$, 
If there exist $C, \rho \geq 0$ such that for all sufficiently large $n$,
\begin{align}
\Rdm_{S'}(\calhatG) \leq \frac{C\ln^\rho(n)}{\sqrt{n}},
\end{align}
then 
%then there exists a number $N \geq 0$ that depends only on
%$C, \rho$, and $L$ such that 
%of the training sample $S'$,
%\footnotesize
\begin{align}
    \Rdm_{S'}(\calhatH') = O \left(
    \frac{
    \log \left(a^2 n^2 r_{S'} \right) 
    \left(\frac{aC}{\rho+1}\ln^{\rho+1}(a^2n) \right)
    }
    {\sqrt{n}}
    \right),
    % \Rdm_{S'}(\calhatH') \leq 
    % \frac{
    % 4+10\log \left(4e a^2 Q^2 n^2 r_{S'} \right) 
    % \left(N+ \frac{aC}{\rho+1}\ln^{\rho+1}(16a^2n) \right)
    % }
    % {\sqrt{n}},
\end{align}
%\normalsize
where $\calhatH' = \{\hat{h}': x' \mapsto f_1(y' h'(x')) \mid h' \in \calH' \}$.
% and $\calH' = \{h': x' \mapsto \Psi_p\left(
%     \{
%     f_2\left(
%     g(z)
%     \right)
%     \mid z \in x'
%     \}
%     \right)\} $.
%where $\calhatH' = \{\hat{h}': x' \mapsto f_1(y' \Psi_p(\{f_2(g(z)) \mid z \in x' \})) \}$.
\end{theo}
%For the 
As mentioned in~\citep{Sabato:2012:MLA}, 
we obtain the following bound when $\calG$ is a set of linear functions.
\begin{coro}
Let $\calG = \{g: z \mapsto \langle {w'}, z \rangle \mid w' \in \Real^{d'}, \|w'\| \leq C_1\}$ and assume that $\|z\| \leq C_2$.
Then, the following bound holds:
%\footnotesize
\begin{align}
\label{coro:risk_bound_linear}
    \Rdm_{S'}(\calhatH) = O \left(
    \frac{
    \log \left(a^2  n^2 r_{S'} \right) 
    \left({abC_1C_2}\ln(a^2n) \right)
    }
    {\sqrt{n}}
    \right).
    % \Rdm_{S'}(\calhatH) \leq 
    % \frac{
    % 4+10\log \left(16e a^2 C_1^2 r_{S'} n^2 \right) 
    % \left(N+ {aC_1C_2}\ln(16a^2n) \right)
    % }
    % {\sqrt{n}}.
\end{align}
%\normalsize
\end{coro}
The above bound is easily derived from 
%$\Rdm_{S'}(\calG) \leq \nicefrac{2\ln(n)}{\sqrt{n}}$,
the result of $\Rdm_{S'}$ \citep[see the proof of Theorem 20 of][]{Sabato:2012:MLA}) and
$\Rdm_{S'}(\calhatG) \leq b\Rdm_{S'}(\calG) \leq \nicefrac{bC_1C_2}{\sqrt{n}} = \nicefrac{bC_1C_2 \ln^0(n)}{\sqrt{n}}$ \citep[see, e.g., Theorem 5.8 and 5.10 of ][]{mohri2018foundations}.
\par
Using Theorem~\ref{theo:sabato} and Corollary~\ref{coro:risk_bound_reduced}, we obtain a generalization risk bound for \GMIL-reducible problems.

%\input{4-2_complexity_and_opt}
\subsection{Learning algorithm}
\label{subsec:algo}
%We show there are two subclass of \GMIL-reducible problems from the perspective 
%of the learning algorithms.
%One is the class that can be solved by convex optimization, 
%another is the class that can be solved by DC (Difference of Convex) optimization.
% For the reduced MIL problems that satisfy some conditions,
% we can immediately design a learning algorithm according to the condition. 
We show that, under mild conditions, the ERM of MIL becomes a convex  or a DC (Difference of Convex) programming problem.
Suppose that 
$\calG$ is a set of linear functions:
\begin{align}
\label{align:class_g_linear}
\calG = \{g: z \mapsto \langle w', z \rangle \mid w' \in \Real^{d'}, \|w'\| \leq C_1\}.    
\end{align}
%and a mapping function $\Phi: \calX \rightarrow \Hilbert$: 
Let $S' = ((x'_1, y'_1), \ldots, (x'_n, y'_n))$.
The ERM of \GMIL is formulated as follows:
\begin{align}
\label{align:erm_optprob}
    \min_{\|w'\| \leq C_1} \! \lambda \|w'\|^2
    \!+\! \sum_{i=1}^n f_1
    \left(
    y'_i
    \Psi_p \left( \left\{
    f_2\left(
    \langle w', z \rangle \mid z \in x'_{i}
    \right)
    \right\} \right)
    \right).
\end{align}
For the optimization problem~\eqref{align:erm_optprob}, 
we show that the following propositions hold.
\begin{prop}%[SP-\GMIL-reducible]
\label{prop:poly}
%We assume that the kernel function $K$ is polynomially computable.
If %$(\calH, \ell)$ is \GMIL-reducible to $(\calH', \ell')$ and 
$y_i'=-1$ for any $i \in [n]$
for sample $S'$,
$f_1$ is convex and nonincreasing~\footnote{More precisely, the extended-value extension $f_1$ also must be nonincreasing (See details in~\citep{boyd-vandenberghe:book04}).}, 
% and its extended-value extension 
% $\tilde{f_1}$ is nonincreasing~\footnote{Roughly speaking, it means that $f_1$ is still nonincreasing outside the domain, e.g., $f(c)=c^2$ over $\Real_{-}$ is nondecreasing convex over the domain but its extended-value extension is not nondecreasing. See details in~\citep{boyd-vandenberghe:book04}}, 
%and homogeneous function of degree $1$ for $c \in [-1, 1]$, \UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{0093}\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}¡ò\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{00A1}\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{00AF}homogeneous\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{0084}\UTF{00C3}¡ò\UTF{00C2}\UTF{0082}\UTF{00C2}\UTF{0089}\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{00AA}\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{0084}\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}¡ò\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C5}\UTF{0093}\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{0084}%
and $f_2$ is a nondecreasing convex function,
and $\calG$ is given as~Eq.\eqref{align:class_g_linear},
then the ERM of
$(\calH', \ell')$ is a convex programming problem.
%is SP (Solvable in Polynomial time)-\GMIL-reducible.
\end{prop}

\begin{prop}%[SDC-\GMIL-reducible]
\label{prop:DC}
%We assume that the kernel function $K$ is polynomially computable.
If %$(\calH, \ell)$ is \GMIL-reducible to $(\calH', \ell')$ and
$f_1$ is a nonincreasing convex~\footnotemark[1] 
% and its extended-value extension 
% $\tilde{f_1}$ is nonincreasing, 
and $f_1(c)$ is a homogeneous function of degree $1$
for $c \in [-1, 1]$\footnote{For example, hinge-loss function $f(c)= \max\{0, 1-c\}$ satisfies this condition.}, and
$f_2$ is a nondecreasing convex function,
and $\calG$ is given as~Eq.\eqref{align:class_g_linear},
then ERM of
$(\calH', \ell')$ is a DC programming problem.
%is SDC (Solvable by DC algorithm)-\GMIL-reducible.
\end{prop}
% \begin{proof}[Proof of Proposition~\ref{prop:poly}]
% First we have that $\hat{f}=f_2 \circ g$ is convex function of $w$ bacause
% $f_2$ is a nondecreasing convex and $\langle w, z \rangle$ is a convex function
% (see, e.g., Eq. (3.11) in~\cite{boyd-vandenberghe:book04}).
% Next, we show that $\Psi_p \circ \hat{f}$ is a convex function.
% Without loss of generality, we can consider $\Psi_p$ as a function $\Real^m \to \Real$ where $m$ is the size of the set $x'$. $\Psi_p$ is nondecreasing function in each argument and $\hat{f}$ is convex and thus $\Psi_p \circ \hat{h}$ is convex.
% Finally, since $f_1$ is nondecreasing convex, $f_1(-\Psi_p(\{f_2(\langle w, z \rangle) \mid z \in x' \})$ is convex.
% \end{proof}
% \begin{proof}[Proof of Proposition~\ref{prop:DC}]
% Since $f_1$ is a homogeneous function of degree $1$, we have
% $f_1(-\Psi_p(\{f_2(\langle w, z \rangle) \mid z \in x' \})) = -f_1(\Psi_p(\{f_2(\langle w, z \rangle) \mid z \in x' \}))$.
% By we proved in Proof of Proposition~\ref{prop:poly}, $f_1(-\Psi_p(\{f_2(\langle w, z \rangle) \mid z \in x' \}))$ is convex. Moreover, we have $f_1(\Psi_p(\{f_2(\langle w, z \rangle) \mid z \in x' \})) = -f_1(-\Psi_p(\{f_2(\langle w, z \rangle) \mid z \in x' \}))$ 
% and thus $f_1(\Psi_p(\{f_2(\langle w, z \rangle) \mid z \in x' \}))$ is concave.
% Therefore, we have that $f_1(\Psi_p(\{f_2(\langle w, z \rangle) \mid z \in x' \})) + f_1(-\Psi_p(\{f_2(\langle w, z \rangle) \mid z \in x' \}))$ is a DC function.
% \end{proof}
Generally, it is difficult to find a global minimum
for a DC programming problem; however, it is known that we can find a solution with $\epsilon$-approximation of local optima~\citep[see, e.g.,][]{le2018dc}.
We introduce a standard DC algorithm to solve \eqref{align:erm_optprob} in Algorithm~\ref{alg:DCA} in Sec.~\ref{sec:appendix_DCA}.
\par
The propositions indicate that, if $(\calH, \ell)$ is \GMIL-reducible to 
$(\calH', \ell')$ and satisfies either of the above conditions, 
then the solution $h \in \calH_\beta$
in the original problem can be obtained by a unified learning algorithm.

%\input{5_reducible_problems}
\section{\GMIL-reducible Examples}
\label{sec:examples}
In this section, we demonstrate that various learning problems
can be reduced to MIL by the proposed reduction scheme.
The results imply that our MIL-reduction gives a unified scheme
for designing and analyzing learning algorithms for various learning problems~\footnote{The reduction of multi-task learning and top-1 ranking learning negative feedback are shown in~Sec.\ref{sec:multi-task} and \ref{sec:trl_neg}
owing to space limitations.}.
%First, we show that some existing learning problems can be MIL-reducible.
\subsection{The existing problems}
\subsubsection{Multi-class learning problem}
\label{subsec:mcl}
{\bf Problem setting:}
%\paragraph{Problem setting}
Let $\calX \subseteq \Real^d$ be an instance space, and $\calY=[k]$ be an output space.
The learner receives the set of labeled instances
$S = ((x_1,y_1), \ldots, (x_n, y_n)) \in (\calX \times \calY)^n$,
where each instance is drawn i.i.d. according to some unknown distribution $\calD$.
The learner predicts the label of $x$ using the hypothesis
$h \in \calH = \{x \mapsto \arg\max_{j \in [k]} \langle w_j, x \rangle \mid \forall j\in[k],w_j \in \Real^d \}$. %where $W=\{w_1, \ldots, w_k\}$.
Let $\ell: (x, y, h) \mapsto \Gamma(\langle w_y, x\rangle - \max_{j \in \calY \backslash y} \langle w_j, x \rangle)$ be a loss function, 
where $\Gamma: \Real \rightarrow [0,1]$ is a convex, 
nonincreasing and $a$-Lipschitz function.
%$I(y \neq h_W(x)) = I(\langle w_y, \Phi(x) - \max_{y' \in \calY \backslash y} \langle w_y', \Phi(x) \rangle \leq 0)$.
%The goal of the learner is to find $h \in \calH$ with small expected risk:
The generalization risk and empirical risk of $h$ are defined as:
\begin{align}
  R_\calD(h)\!=\!\Expo_{(x,y)\sim \calD}\! \ell \left(x,y, h \right),
  \emR_S(h)\!=\! \frac{1}{n}\sum_{i=1}^n\! \ell \left(x_i,y_i,\! h \right).
\end{align}
We obtain the following by using MIL-reduction scheme:
%\paragraph{Reduction to \GMIL}
\begin{theo}
\label{theo:mcl_reduction}
Multi-class learning problem is \GMIL-reducible.
\end{theo}
\begin{proof}
For any $(x,y)$, we define
\begin{align}
  \label{align:dk_z}
\eta_{(x,y)} = (\bzero, \ldots, \bzero, \underbrace{x}_{y\mathrm{-th~block}}, \bzero, \ldots, \bzero),
\end{align}
where $\bzero$ is a $d$-dimensional vector, the elements of which are all $0$.
On the \GMIL-reduction scheme, 
suppose that
$p=\infty$; $f_1(c)=\Gamma(2cC_1C_2)$, $f_2(c)=c/2C_1C_2$ (shifting function to $[-1,+1]$); $\alpha(x,y) = (x'_{(x,y)}, y')$ where $x'_{(x,y)}=\{\eta_{(x,j)} - \eta_{(x,y)} \mid \forall j \in \calY \backslash y\}$; $y'=-1$; for any $z \in \Real^{kd}$,
$\calG=\{g: z \mapsto \langle (w'_1, \ldots, w'_k), z \rangle \mid 
w'_j \in \Real^d, \forall j\in [k], \|W'\| \leq C_1 \}$ where 
$W'=(w'_1, \ldots, w'_k)$ and
$\|W'\| = \sqrt{\sum_{j=1}^k\|w'_j\|^2}$; $\beta(h'): x \mapsto \arg\max_{j \in [k]} \langle w'_j, x \rangle$.
Then, for any $(x,y)$ and $h \in \calH$,
\begin{align}
&\ell'(x',y',h')
=
 f_1\left(
 y' \Psi_p\left( 
 \left\{
 f_2 \left(
 g(z) \mid z \in x'_{(x,y)}
 \right\}
 \right)
 \right)
 \right)
 \\
 &=
\Gamma \left(-\frac{1}{2C_1C_2} \Psi_\infty \left(\left\lbrace 2C_1C_2 \left( g(z) \mid z \in x_{(x,y)}'\right\rbrace \right)\right)\right)\\
&=  
\Gamma \left(-\frac{1}{2C_1C_2} \max \left(2C_1C_2 \left\lbrace \left( g(z) \mid z \in x_{(x,y)}' \right\rbrace \right) \right) \right)\\
&=
\Gamma\left(-\frac{2C_1C_2}{2C_1C_2} \max \left(\left\lbrace \left( g(z) \mid z \in x_{(x,y)}'\right\rbrace\right)\right)\right)\\
&= 
\Gamma\left(
 -\left(\max
 \left\{ \langle w', \eta_{(x,j)} - \eta_{(x,y)}\rangle \mid \forall j \in \calY \backslash y \right\} \right) 
 \right)\\
% \Gamma\left(
% - \Psi_\infty\left( 
% \{
% g(z) \mid z \in x'_{(x,y)}
% \}
% \right)
% \right)
% \\
 &= \Gamma\left(
 -\left(\max_{j \in \calY\backslash y}
 \left(\langle w_j, x \rangle - \langle w_y, x \rangle\right) \right)
 \right)\\
 &= 
 \ell(x, y, h) 
 \end{align}
\end{proof}
% The key idea of the proof is to construct $\alpha(x,y)=(x',y')$ 
% such that the bag $x'$ corresponding to $(x,y)$
% has the number $k-1$ of $kd$-dimensional instances as
% $x' = \{z_1, \ldots, z_{k-1}\} = \{(x, \bzero, \ldots, -x, \bzero \ldots, \bzero), \ldots, (\bzero, \ldots, -x, \bzero \ldots, x)\}$,
% where $\bzero$ is an all-zero $d$-dimensional vector, 
% $-x$ locates $y$-th block and $x$ is another block in each $z$.
% Moreover, $y'_i = -1$ for all $i \in [n]$ in the reduced problem.
% Although the obtained solution $w'$ is $d'=kd$-dimensional vector,
% we can easily construct $\beta$ which decomposes $w'$ into $k$ of $d$-dimensional vector.
% The details are shown in Sec.\ref{sec:mcl_reduction}.
% \begin{proof}
% For any $(x,y)$, we define
% \begin{align}
%   \label{align:dk_z}
% \eta_{(x,y)} = (\bzero, \ldots, \bzero, \underbrace{x}_{y\mathrm{th~block}}, \bzero, \ldots, \bzero),
% \end{align}
% where $\bzero$ is a vector, the elements of which are all $0$.
% On the \GMIL-reduction framework, 
% suppose that
% $p=\infty$; $f_1=\Gamma$, $f_2(c)=c$ (identity function); $\alpha(x,y) = (x'_{(x,y)}, y')$ where $x'_{(x,y)}=\{\eta_{(x,j)} - \eta_{(x,y)} \mid \forall j \in \calY \backslash y\}$; $y'=-1$; for any $z \in \Real^{kd}$,
% $\calG=\{g_W: z \mapsto \langle (w_1, \ldots, w_k), z \rangle \mid \|W\| \leq 1 \}$ where $\|W\| = \sqrt{\sum_{j=1}^k\|w_j\|^2}$; $\beta(h'_W) = h_W$.
% Then, for any $(x,y)$ and $h_W \in \calH$,
% \begin{align}
% \ell'(x', y', h'_W)
% =&
% f_1\left(
% y' \Psi_p\left( 
% \{
% f_2 \left(
% g(z) \mid z \in x'_{(x,y)}
% \}
% \right)
% \right)
% \right)
% \\
% =&
% \Gamma\left(
% - \Psi_\infty\left( 
% \{
% g(z) \mid z \in x'_{(x,y)}
% \}
% \right)
% \right)
% \\
% =& \Gamma\left(
% -\left(\max_{j \in \calY\backslash y}
% \left(\langle w_j, x \rangle - \langle w_y, x \rangle\right) \right)
% \right)\\
% = &
% \ell(x, y, h_W) 
% \end{align}
% \end{proof}
\par
The empirical Rademacher complexity is immediately derived as follows by
observing the reduction process.
%\paragraph{Generalization bound}
\begin{coro}
\label{coro:mc_bound}
% Let $S'= (x'_1=\alpha(x_1, y_1), \ldots, x'_n=\alpha(x_n, y_n))$. Let $\calH' = \Psi_\infty(\{f_2(g(z))\mid z \in x'\})$ and let $\calH_\beta = \{\beta(h') \mid h' \in \calH'\}$. Let $\calhatH' = \{f_1(y'\Psi_\infty(\{f_2(g(z))\mid z \in x\})) \mid g \in \calG \}$. 
% The following bound holds with a high probability of at least $1-\delta$ for all $h_W \in \calH_\beta$:
% \begin{align}
%     R_{\calD}(h_W) \leq \emR_{S'}(h') + 2\Rdm_{S'}(\calhatH')+ 3\sqrt{\frac{\log \frac{2}{\delta}}{2n}}.
% \end{align}
% where
We assume that $\|x_i\| \leq C_2$ for any $i \in [n]$.
In the reduced MIL problem from multi-class learning problem,
the empirical Rademacher complexity of $\calhatH'$ is given as:
\begin{align}
\Rdm_{S'}(\calhatH') = O \left(
    \frac{
    \log \left(\hata^2 2n^2 (k-1) \right) 
    \left({2\hata}\ln(\hata^2n) \right)
    }
    {\sqrt{n}}
    \right),
    % \frac{
    % 4+10\log \left(16e a^2 n^2 (k-1) \right) 
    % \left(N+ {2a}\ln(16a^2n) \right)
    % }
    % {\sqrt{n}}.
%\quad \mathrm{or} \quad 
\end{align}
where $\hata=2aC_1C_2$ and we assume $\|w'\| \leq C_1$ in the reduced MIL.
\end{coro}
We used the fact that the bag size is $(k-1)$ for all $x_i'$ (i.e., $r_{S'}=k-1$)
and $\Rdm(\calhatG) \leq \nicefrac{2}{\sqrt{n}}$ by setting $f_2(c)=\nicefrac{c}{C_1C_2}$. %and $\|z\| \leq 2C_2$ for any $z \in x_i', \forall i \in [n]$.
Using Corollary~\ref{coro:risk_bound_reduced}, we can obtain the generalization
risk bound for the multi-class learning.
%and thus $|\bigcup_{i=1}^n x'_i| \leq \sum_{i=1}^n|x'_i| = n(k-1)$. 
%Note that 
%we have $\|\tau_{(\bx, y')} - \tau_{(\bx, y)}\| \leq 1$ via kernel normalization without loss of generality.
%Moreover, we used the fact that $\|\tau_{(\bx, y')} - \tau_{(\bx, y)}\| \leq \sqrt{2}\Lambda$.
\par
The learning algorithm is obtained by the following result.
%\paragraph{ERM algorithm}
\begin{coro}
The reduced ERM of the MIL from multi-class learning is a convex programming problem.
\end{coro}
The proof of Theorem~\ref{theo:mcl_reduction} shows that
$f_2$ is nondecreasing convex and $y_i'=-1$ for all $i \in [n]$.
Therefore, by Proposition~\ref{prop:poly}, if we consider $\Gamma$ that is a nonicreasing and convex function, the ERM of the reduced MIL problem is 
a convex programming problem and solved in polynomial time.





\subsubsection{Complementarily labeled learning problem}
\label{subsubsec:complementarily}
  Complementarily labeled learning was proposed by~\citet{ishida2017learning}.
 In this problem, some training instances are complementarily labeled (e.g., instance $x_i$ is NOT $y_i$).
 We essentially follow the problem setting and some assumptions provided by~\citet{ishida2017learning}. 
%\paragraph{Problem setting}
\par
{\bf Problem setting:}
Let $\calX \subseteq \Real^d$ be an instance space and $\calY=[k]$ be an output space.
Let $\calD$ be an unknown distribution over $\calX \times \calY$.
We assume that the learner receives a sample $S$ drawn i.i.d. according to the distribution $\calD'$
which provides the true label with unknown probability $\theta$ and the complementary label with unknown probability $1-\theta$.
Moreover, we assume that the complementary label is chosen with a uniform probability
(i.e., all complementary labels are equally chosen with the probability $1/(k-1)$).~\footnote{This assumption was proposed by~\citet{ishida2017learning} as a reasonable scenario in some practical tasks (e.g., crowdsourcing).}
More formally, we assume that the sample is given as
$S = ((x_1,y_1, \gamma_1) \ldots, (x_n, y_n, \gamma_n))$ 
which is drawn i.i.d. according to the distribution $\calD'$ over $\calD \times \{\F, \T\}$,
where $\gamma_i=\T$ means that $y_i$ is the true label
and $\gamma_i=\F$ means that $y_i$ is the complementary label (i.e., it indicates that $x_i$ is NOT $y_i$).
For any $(x, y) \sim \calD$, $\calD'(x, y, \T) = \theta$ and $\calD'(x, \bar{y}, \F) = \frac{1-\theta}{k-1}$ for any $\bar{y} \neq y$ (i.e., the complementary label is chosen with a uniform probability).
The other basic settings are the same as those for the aforementioned multi-class learning.
The learner predicts the label of $x$ using the hypothesis
$h \in \calH = \{x \mapsto \arg\max_{j \in [k]} \langle w_j, x \rangle \mid \forall j\in[k], w_j \in \Real^d \}$. %where $W=\{w_1, \ldots, w_k\}$.
The final goal of the learner is to find $h \in \calH$ with a small multi-class classification risk:
\begin{align}
  \RMC_\calD(h) = \Expo_{(x,y)\sim \calD} I \left(y \neq h(x) \right).
\end{align}
%\par
However, it is difficult to minimize the empirical multi-class classification risk
directly using the complementarily labeled data.
Therefore, we consider the following risk\footnote{\citet{ishida2017learning} used a different surrogate risk. However, they and we have a common goal: to minimize $\RMC_{\calD}(h)$.}.
%it is reasonable to use the following empirical risk
\begin{align}
  \label{align:|cl_origin}
  \RLC_{\calD'}(h)= 
  \Expo_{(x, y, \gamma)\sim \calD'}
  \left[I\left( \gamma = (y \neq h(x))  \right)  \right].
 % \RLC_{\calD}(h)=\Expo_{(\bx, y)\sim \calD}
 % \left[\frac{1}{k-1}\sum_{\bar{y}\neq y}I\left(\max_{y'\neq \bar{y}}h(\bx) \neq \bar{y}  \right)  \right].
\end{align}
This risk implies that when $\gamma = \T$,
the learner does not incur a risk if it predicts the true label.
When $\gamma = \F$, the learner does not incur a risk if it
predicts an assigned nontrue label.
Thus, the risk measure is defined using the pair $(y, \gamma) \in (\calY \times \{\F, \T\})$.
%The zero-one loss can be formulated as $\ell: ((y, \gamma), \hat{y}) \mapsto I(\gamma =(y \neq \hat{y}))$. 
%Note that again we assume that the complementarily label is chosen with uniform probability.
We can show that achieving a small $\RLC_{\calD'}(h)$ is consistent with 
achieving small $\RMC_{\calD}(h)$ as follows:
\begin{lemm}
  \label{lemm:comp_gen}
For any $h \in \calH$, $\RMC_{\calD}(h) = \frac{k-1}{\theta(k -2)+1}\RLC_{\calD'}(h)$ holds.
\end{lemm}
% \begin{proof}
%   By the assumption of $\calD'$, the expected risk $\RLC_{\calD'}(h_{W})$ is represented using $\calD$, $k$, and $\theta$
%   as follows:
%   \begin{align}
%     \label{align:rlcd}
%     \RLC_{\calD'}(h_{W}) = \Expo_{(x, y)\sim \calD}
%      \left[\theta I\left((y \neq h_{W}(x))  \right) + 
%       (1- \theta)\sum_{\bar{y}\neq y}\frac{1}{k-1}
%       I\left(\bar{y}=h_{W}(x)  \right).
%       \right]    
%   \end{align}
%   Let  $\rho_1=I \left(y \neq h_W(x) \right)$ in $\RMC_{\calD}(h_{W})$ and
%   let $\rho_2=\theta I\left((y \neq h_{W}(x))  \right) +  (1- \theta)\sum_{\bar{y}\neq y}\frac{1}{k-1}  I\left((\bar{y}=h_{W}(x))  \right)$ in $\RLC_{\calD'}(h_{W})$.
%   We consider two cases of $h_{W}$ for any $h_{W} \in \calH$ as follows:
%   For a fixed $(x, y)$,
%   (i) If $h_{W}(x) = y$: $\rho_1=0$ and $\rho_2=0$, and thus there is no gap. %$\rho=0$.
%   (ii) If $h_{W}(x) \neq y$:  The first term of $\rho_2$
%   is $\theta$ and the second term is equal to $(1-\theta)/(k-1)$,
%   %$\rho \leq 1$
%   because there exists a unique $\hat{y}:\hat{y} \neq y$ that satisfies $\hat{y} = h_{W}(x)$.
%   Therefore, $\rho_2$ is equal to $\theta + \frac{1-\theta}{k-1}$.
%   Moreover, in this case, $\rho_1 = 1$.
%   Thus, we have the bound $\frac{k-1}{\theta(k -2)+1}\RLC_{\calD'}(h_{W}) =  \RMC_{\calD}(h_{W})$.
% \end{proof}
%The proof is in the supplementary materials.
Thus, minimizing $\RLC_{\calD'}(h)$ is a reasonable way to
achieve a high multi-class classification accuracy.
\par
Generally, there is no loss function 
$\ell((x,\gamma),y,h)$ which is a convex upper bound on 
the zero-one loss $I\left( \gamma = (y \neq h(x))  \right)$
over the domain $w$. This is because if $I(\gamma=\T)=1$ then $\max$
is convex w.r.t. $w$; however, if $I(\gamma=\T)=-1$ then $-\max=\min$ is concave w.r.t. $w$.
Therefore, we consider the convex upper bounded loss only on the risk for complementarily labeled data (i.e., the concave risk for the normally labeled data)
using $\Gamma: \Real \rightarrow [0,1]$ as $\Gamma\left(\max_{j \in \calY\backslash y}\langle (w_{j} - w_y), x\rangle\right)$.
We then define the nonconvex risk
%\footnotesize
%\begin{align}
$\ell(x,(\gamma,y),h) = \Gamma\left(I(\gamma=\T) \times \left(\max_{j \in \calY\backslash y}\langle (w_{j} - w_y), x\rangle\right) \right).$
%\end{align}
%\normalsize
%An example of $\Gamma$ is hinge-loss, that is, $\Gamma(c)=\max(0, 1-c)$.
The empirical risk is formulated as:
\begin{align}
  \label{align:er_lcl_origin}
  \emRLC_{S}(h) = \frac{1}{n}\sum_{i=1}^n
  \ell \left(x_i,(\gamma_i, y_i), h \right).  
%  I\left(\gamma_i = (y_i \neq h_\bW(\bx_i))  \right).
\end{align}
%\paragraph{Reduction to \GMIL}
\par
The following is obtained by MIL-reduction scheme.
\begin{theo}
\label{theo:cll_reduction}
Complementarily labeled learning is \GMIL-reducible.
\end{theo}
% \begin{proof}
% We use $\tau_{(x,y)}$ defined in~(\ref{align:dk_z}).
% On \GMIL-reduction framework, suppose that 
% $p=\infty$; $f_1=\Gamma$; $f_2$ is identity function; $\alpha(x,(\gamma,y)) = (x'_{(x,y)}, y')$ where $x'_{(x,y)}=\{\tau_{(x,\hat{y})} - \tau_{(x,y)} \mid \forall \hat{y}\neq y\}$; $y'=I(\gamma=\T)$; for any $z \in \Hilbert^{k}$,
% $\calG=\{g_W: z \mapsto \langle (w_1, \ldots, w_k), z \rangle \mid \|W\|_\Hilbert \leq 1 \}$ where $\|W\| = \sqrt{\sum_{j=1}^k\|w_j\|^2}$; $\beta(h'_W) = h_W$.
% Then, for any $(x,y)$ and $h_W \in \calH$,
% \begin{align}
% \ell'(x', y', h'_W)
% =&
% f_1\left(
% y' \Psi_p\left( 
% \{
% f_2 \left(
% g(z) \mid z \in x'_{(x,y)}
% \}
% \right)
% \right)
% \right)
% \\
% =&
% \Gamma\left(
% I(\gamma=\T) \times \Psi_\infty\left( 
% \{
% g(z) \mid z \in x'_{(x,y)}
% \}
% \right)
% \right)
% \\
% =& \Gamma\left(
% I(\gamma=\T) \times \left(\max_{j \in \calY\backslash y}
% \left(\langle w_j, x \rangle - \langle w_y, x \rangle\right) \right)
% \right)\\
% =& 
% \ell(x, (\gamma, y), h_W).
% \end{align}
% \end{proof}
The difference from the reduction in multi-class learning is that only $y'$ takes $\{-1, 1\}$.
$y'$ behaves as a \emph{switch} that turns the loss of complementarily or normally labeled data.
%\paragraph{Generalization bound}
\par
The empirical Rademacher complexity is bounded as:
%The generalization bound for complementarily labeled learning is given as follows:
\begin{coro}
% Let $S'= ((x'_1, y'_1), \ldots, (x'_n, y'_n))$. Let $\calH' = \Psi_\infty(\{f_2(g(z))\mid z \in x'\})$ and let $\calH_\beta = \{\beta(h') \mid h' \in \calH'\}$. Let $\calhatH' = \{f_1(y'\Psi_\infty(\{f_2(g(z))\mid z \in x'\})) \mid g \in \calG \}$. 
% The following bound holds with a high probability of at least $1-\delta$ for all $h_W \in \calH_\beta$:
% \begin{align}
%     R_{\calD}(h_W) \leq \emR_{S'}(h'_W) + 2\Rdm_{S'}(\calhatH')+ 3\sqrt{\frac{\log \frac{2}{\delta}}{2n}}.
% \end{align}
% where
We assume that $\|x_i\| \leq C_2$ for any $i \in [n]$.
In the reduced MIL problem from complementarily labeled learning,
the empirical Rademacher complexity of $\calhatH'$ is given by:
\begin{align}
\Rdm_{S'}(\calhatH') = O \left(
    \frac{
    \log \left(\hata^2 n^2 (k-1) \right) 
    \left({2\hata}\ln(\hata^2n) \right)
    }
    {\sqrt{n}}
    \right),
    % \frac{
    % 4+10\log \left(16e a^2 n^2 (k-1) \right) 
    % \left(N+ {2a}\ln(16a^2n) \right)
    % }
    % {\sqrt{n}}.
%\quad \mathrm{or} \quad 
\end{align}
where $\hata=2aC_1C_2$ and we assume $\|w'\| \leq C_1$ in the reduced MIL problem.
\end{coro}
We use the same argument as in Corollary~\ref{coro:mc_bound}.
Using Corollary~\ref{coro:risk_bound_reduced} and Lemma~\ref{lemm:comp_gen}, we obtain the generalization bound for the complementarily labeled learning.
%We used the fact that the number of instances in a bag is $(k-1)$ for all $x_i'$,
%and thus $|\bigcup_{i=1}^n x'_i| \leq \sum_{i=1}^n|x'_i| = n(k-1)$. Note that 
%we have $\|\tau_{(\bx, y')} - \tau_{(\bx, y)}\| \leq 1$ via kernel normalization without loss of generality.
%Moreover, we used the fact that $\|\tau_{(\bx, y')} - \tau_{(\bx, y)}\| \leq \sqrt{2}$.
%\paragraph{ERM algorithm}
\par
The learning algorithm is derived by the following result:
\begin{coro}
The reduced ERM of the MIL from complementarily labeled learning
is a DC programming problem. If the sample contains only complementarily labeled data, the learning problem is a convex programming problem.
\end{coro}
Generally, $y' \in \{-1, 1\}$ in complementarily labeled learning.
Using the proof of Theorem~\ref{theo:cll_reduction} and 
by Proposition~\ref{prop:DC}, if we consider $\Gamma(c)$ which is a nondecreasing and homogeneous function of degree 1 for $c \in [-1,1]$ such as hinge-loss function, 
we can solve the problem by DC algorithm as shown in Algorithm~\ref{alg:DCA}.
Note that, if the sample contains only complementarily labeled data (i.e., $\forall i\in [n]$, $y_i=-1$), it becomes a convex programming problem.

\subsubsection{Multi-label learning problem}
\label{sec:multi-label}
\paragraph{Problem setting}
Let $\calX \subseteq \Real^d$ be an instance space and $\calY \in \{-1, 1\}^k$ be an output space,
and $\calD$ be an unknown distribution over $\calX$.
Unlike the standard multi-class learning setting introduced in Section~\ref{subsec:mcl}, each instance may have multiple labels 
(e.g., in text-categorization tasks, some texts have multiple topics such as IT and business).
%The positive labels in $\{-1,+1\}$ indicates the classes of an instance.
$y^j$ denotes the $j$-th element of $y_i$.
%For some $(x, y)$, $y^j=1$ means that $x$ 
The learner receives a labeled sample $S=(x_1, y_1), \ldots, (x_n,y_n) \in \calX \times \calY$ which is drawn i.i.d. according to the distribution $\calD$.
The learner predicts whether $x$ belongs to class $j \in [k]$ or not 
using the hypothesis
$h \in \calH = \{(x,j) \mapsto \sign (\langle w_j, x \rangle) \mid \forall w_j \in \Real^d \}$. %where $W=\{w_1, \ldots, w_k\}$.
Let $\ell: (x, y, h) \mapsto \frac{1}{k} \sum_{j=1}^k\Gamma(-y^j \langle w_j, x \rangle)$ where $\Gamma: \Real \rightarrow [0,1]$ is a convex, nondecreasing and $b$-Lipschitz function~\footnote{Note that we use the negative score $-y^j \langle w_j, x \rangle$ to employ a nondecreasing $\Gamma$.}. 
The generalization and empirical risk of $h$ are defined as:
\begin{align}
  R_\calD(h)\! =\!\Expo_{(x,y)\sim \calD} \left[\ell(x,\! y,\!h)\right],   \emR_S(h)\!=\!\frac{1}{n}\sum_{i=1}^n  \ell(x_i, y_i,\!h).
\end{align}
\normalsize
\paragraph{Reduction to \GMIL}
\begin{theo}
\label{theo:mll_reduction}
Multi-label learning is \GMIL-reducible.
\end{theo}
% The proof idea is that we set
% $\alpha(x,y)=(x', y')$ such that $x'=\{(y^1 x,1), \ldots, (y^k x,k)\}$ 
% and $y_i=-1$ for all $i \in [n]$.
% $\beta$ can be constructed by the same way as multi-class case.
\begin{proof}
On the \GMIL-reduction scheme, suppose that $p=1$; $f_1: f_1(a) = -a$ for $a \in \Real$;
$f_2$ is $\Gamma$; $\alpha(x,y)=(x'_{(x,y)}, y')$ where $x'_{(x,y)}=\{(-y^1 x,1), \ldots, (-y^k x,k)\}$; $y'=-1$; $\calG=\{g: (z,j) \mapsto \langle w'_j, z \rangle \mid 
w'_j \in \Real^d, \forall j\in [k], \|W'\| \leq C_1 \}$ where $W' = (w'_1, \ldots, w'_k)$;  $\beta(h'): (x,j) \mapsto \sign(\langle w'_j, x \rangle)$.
For any $(x,y)$ and $h\in \calH$, we have that
\begin{align}
\ell'(x', y', h')
=&
f_1\left(
y' \Psi_p\left( 
\left\{
f_2 \left(
g(z)\right) \mid z \in x'_{(x,y)}
\right\}
\right)
\right)
\\
=&
 \frac{1}{|x'_{(x,y)}|}\sum_{(y^j x,j) \in x'_{(x,y)}} 
\Gamma \left(-\langle w_j, y^j x \rangle 
\right)\\
=&
\ell(x, y, h) 
\end{align}
\end{proof}
%\paragraph{ERM algorithm}
\par
The empirical Rademacher complexity is bounded as:
\begin{coro}
We assume that $\|x_i\| \leq C_2$ for any $i \in [n]$.
In the reduced MIL problem,
the empirical Rademacher complexity of $\calhatH'$ is given as follows:
\begin{align}
\Rdm_{S'}(\calhatH') =O\left(
    \frac{
    \log \left(2n^2 k \right) 
    \left({bC_1C_2}\ln(n) \right)
    }
    {\sqrt{n}}
    \right),
% \Rdm_{S'}(\calhatH') =
%     \frac{
%     4+10\log \left(16e a^2 n^2 k \right) 
%     \left(N+ {2a}\ln(16a^2n) \right)
%     }
%     {\sqrt{n}}.
%\quad \mathrm{or} \quad 
\end{align}
where $\|w'\| \leq C_1$ in the reduced MIL.
\end{coro}
We used the fact that the size of each bag is $k$. 
%and $\|z\| \leq 2C_2$ for any $z \in x_i', \forall i \in [n]$ in the reduced MIL problem.
Using Corollary~\ref{coro:risk_bound_reduced}, we obtain the generalization
risk bound for the multi-label learning.
\par
The learning algorithm is obtained by the following result.
\begin{coro}
The reduced ERM of the MIL from multi-label learning is a convex programming problem.
\end{coro}
The proof of Theorem~\ref{theo:mll_reduction} shows that,
$f_1$ is nonincreasing and convex, and $y_i'=-1$ for all $i \in [n]$.
Therefore, by Proposition~\ref{prop:poly}, if we consider $\Gamma$ that is nondecreasing and convex, the reduced problem is 
a convex programming problem and it is solved in polynomial time.
%$y_i=-1$ for any $i$ and thus the ERM can be solved in polynomial time.
%\paragraph{Generalization bound}
%The generalization bound for the multi-label learning is given as:


% \subsubsection{Other MIL-reducible example}
% We demonstrate that \emph{multi-label learning and multi-task (classification) learning
% are also MIL-reducible} and derive the generalization bounds and learning algorithms.
% Remarkably, for multi-label learning, we derive a state-of-the-art generalization risk bound.
% We provide the details in~Sec.\ref{sec:multi-label} and \ref{sec:multi-task}
% owing to space limitations.


\subsection{Application to the new problems}

%\subsubsection{Multi-class learning with the information of the target classes}

\subsubsection{Multi-label learning with perfectionistic loss}
%\paragraph{Problem setting}
%Below we propose a new problem formulation of the multi-label learning.
%As mentioned in the previous subsection, 
{\bf Problem setting:}
% Let $\calX \subseteq \Real^d$ be an instance space and $\calY \in \{-1, 1\}^k$ be an output space,
% and $\calD$ be an unknown distribution over $\calX$.
% Unlike the standard multi-class learning setting introduced in Sec.\ref{subsec:mcl}, each instance may have multiple labels 
% (e.g., in text-categorization tasks, some texts have multiple topics such as IT and business).
% %The positive labels in $\{-1,+1\}$ indicates the classes of an instance.
% $y^j$ denotes the $j$-th element of $y_i$.
% %For some $(x, y)$, $y^j=1$ means that $x$ 
% The learner receives a labeled sample $S=(x_1, y_1), \ldots, (x_n,y_n) \in \calX \times \calY$ which is drawn i.i.d. according to the distribution $\calD$.
% The learner predicts that whether $x$ belongs to class $j \in [k]$ or not 
% using the hypothesis
% $h \in \calH = \{(x,j) \mapsto \sign (\langle w_j, x \rangle) \mid \forall w_j \in \Real^d \}$. %where $W=\{w_1, \ldots, w_k\}$.
In a standard multi-label learning (see Sec.\ref{sec:multi-label}),
we consider the average prediction error (loss) with the classes.
On the other hand, we consider a \emph{perfectionistic} error in multi-label learning problem. 
More formally, we consider the following loss in a multi-label 
learning:
\begin{align}
    \ell: (x, y, h) \mapsto \max_{j \in [k]}\Gamma(-y^j \langle w_j, x \rangle),
\end{align}
 where $\Gamma: \Real \rightarrow [0,1]$ is a convex, nondecreasing and $b$-Lipschitz function.
This loss means that the learner incurs the risk
unless the learner perfectly predict the correct labels.
The generalization and empirical risks of $h$ are given as $R_\calD(h) = \mathbb{E}_{(x,y)\sim \calD} \left[\ell(x, y, h)\right]$, $\emR_S(h) =  \frac{1}{n}\sum_{i=1}^n  \ell(x_i, y_i, h)$, respectively.
% \begin{align}
%   R_\calD(h) = \Expo_{(x,y)\sim \calD} \left[\ell(x, y, h)\right], \emR_S(h) =  \frac{1}{n}\sum_{i=1}^n  \ell(x_i, y_i, h).
% \end{align}
%The other settings are same in the previous section.
%\paragraph{Reduction to \GMIL}
\par
Using MIL-reduction scheme, we obtain the following:
\begin{theo}
\label{theo:mllp_reduction}
Multi-label learning with perfectionistic loss is \GMIL-reducible.
\end{theo}
This can be derived by the same argument with multi-label learning except for $p=\infty$ (see Sec.\ref{sec:proof_mllp_reduction}).
% \begin{proof}
% On \GMIL-reduction framework, suppose that $p=\infty$; $f_1: f_1(a) = -a$ for $a \in \Real$;
% $f_2$ is $\Gamma$; $\alpha(x,y)=(x'_{(x,y)}, y')$ where $x'_{(x,y)}=\{(y^1 x,1), \ldots, (y^k x,k)\}$; $y'=-1$; $\calG=\{g_W: (z,j) \mapsto \langle w_j, z \rangle \mid \|W\|_\Hilbert \leq 1 \}$ where $W = (w_1, \ldots, w_k)$; $\beta(h'_W) = h_W$.
% For any $(x,y)$ and $h_W\in \calH$, we have that
% \begin{align}
% \ell'(x', y', h'_W)
% =&
% f_1\left(
% y' \Psi_p\left( 
% \left\{
% f_2 \left(
% g(z)\right) \mid z \in x'_{(x,y)}
% \right\}
% \right)
% \right)
% \\
% =&
%  \max_{(y^j x,j) \in x'_{(x,y)}} 
% \Gamma \left(\langle w_j, y^j x \rangle 
% \right)\\
% =&
% \ell(x, y, h_W) 
% \end{align}
% \end{proof}
%\paragraph{Generalization bound}
\par
The empirical Rademacher complexity is bounded as:
%The generalization bound for the multi-label learning is given as:
\begin{coro}
We assume that $\|x_i\| \leq C_2$ for any $i \in [n]$.
In the reduced MIL problem,
the empirical Rademacher complexity of $\calhatH'$ is given as follows:
% Let $S'= ((x'_1, y'_1), \ldots, (x'_n, y'_n))$. Let $\calH' = \Psi_\infty(\{f_2(g(z))\mid z \in x'\})$ and let $\calH_\beta = \{\beta(h') \mid h' \in \calH'\}$. Let $\calhatH' = \{f_1(y'\Psi_\infty(\{f_2(g(z))\mid z \in x'\})) \mid g \in \calG \}$. 
% The following bound holds with a high probability of at least $1-\delta$ for all $h_W \in \calH_\beta$:
% \begin{align}
%     R_{\calD}(h_W) \leq \emR_{S'}(h'_W) + 2\Rdm_{S'}(\calhatH')+ 3\sqrt{\frac{\log \frac{2}{\delta}}{2n}}.
% \end{align}
% where
\begin{align}
\Rdm_{S'}(\calhatH') = O\left(
    \frac{
    \log \left(2n^2 k \right) 
    \left({bC_1C_2}\ln(n) \right)
    }{\sqrt{n}}
    \right),
    % \frac{
    % 4+10\log \left(16e a^2 n^2 k \right) 
    % \left(N+ {2a}\ln(16a^2n) \right)
    % }
    % {\sqrt{n}}.
%\quad \mathrm{or} \quad 
\end{align}
where we assume $\|w'\| \leq C_1$.
\end{coro}
Interestingly, we can have the same generalization risk bound with the standard
multi-label learning. %(see Sec.~\ref{sec:multi-label}).
%By using Corollary~\ref{coro:risk_bound_reduced}, we can obtain the generalization
%risk bound for the multi-label learning with perfectionistic loss.
%\paragraph{ERM algorithm}
\par
The learning algorithm is derived by the following result.
\begin{coro}
The reduced ERM of the MIL from multi-label learning with perfectionistic loss 
is a convex programming problem.
\end{coro}
This is easily obtained by observing the reduction process shown in Sec.\ref{sec:proof_mllp_reduction} and using Prpoposition~\ref{prop:poly}.
\par
A naive approach for the multi-label learning with perfectionistic loss
is to reduce to multi-class learning.
That is, we consider all combinations of the multi-label as multi-classes 
and solve $2^k$-class learning problem with high computational cost.
However, by the above corollary, multi-label learning with perfectionistic loss
can be solved efficiently.
% \subsection{Multi-label learning with missing labels (Learning with multiple complementary labels)}
% %% multi-label \UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C2}\UTF{00A9}\UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C2}\UTF{00B3}\UTF{00C3}¡ò\UTF{00C2}\UTF{0082}\UTF{00C2}\UTF{00AD}\UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C2}\UTF{00B3}\UTF{00C3}¡ò\UTF{00C2}\UTF{0082}\UTF{00C2}¡ë\UTF{00C3}¡ò\UTF{00C2}\UTF{0082}\UTF{00C2}\UTF{0082}\UTF{00C3}?\UTF{00C2}\UTF{008F}\UTF{00C2}\UTF{00AF}\UTF{00C3}\UTF{0161}\UTF{00C2}\UTF{0083}\UTF{00C5}\UTF{0093}\UTF{00C3}\UTF{0160}\UTF{00C2}\UTF{0080}\UTF{00C2}¡ø\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{0082}\UTF{00C3}¡ò\UTF{00C2}\UTF{0082}\UTF{00C2}\UTF{008A}\UTF{00C3}\UTF{00AF}\UTF{00C5}\UTF{0092}\UTF{00C2}\UTF{009F}%
% Multi-label learning with missing labels is originally proposed by~\cite{wu2014multi}.
% Different from a standard multi-label learning we introduced in Section~\ref{subsec:multi-label}, the $y_i$ in the training sample does not fully contain the label information. 
% Formally, the output space $\calY$ in training sample $S = (x_1, y_1), \ldots, (x_n, y_n)$ is represented as $\{-1,0,+1\}^k$ where $y^j_i=0$ means that the instance $x_i$ has no information on the class $j$.
% Note that if there is no positive labels ($+1$) in the training sample,
% the problem is equivalent to learning with multiple complementary labeles~\cite{}.
% \UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C2}\UTF{008E}\UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C5}\UTF{0092}\UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C2}\UTF{009E}\UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C2}\UTF{00AB}\UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C2}\UTF{00A9}\UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C2}\UTF{0099}\UTF{00C3}¡ò\UTF{00C2}\UTF{0083}\UTF{00C2}\UTF{00AB}\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{008C}\UTF{00C3}?\UTF{00C2}%\UTF{0085}\UTF{00C5}\UTF{00A1}\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C5}\UTF{00A0}missing \UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{00AE}\UTF{00C3}?\UTF{00C2}\UTF{00A0}\UTF{00C5}\UTF{0153}\UTF{00C3}?\UTF{00C2}\UTF{0090}\UTF{00C2}\UTF{0088}\UTF{00C3}¡ò\UTF{00C2}\UTF{0081}\UTF{00C2}\UTF{00AF} multiple complementarily labeled learning?
%The final goal 

\subsubsection{Top-1 ranking learning}
\label{subsec:trl}
Learning to rank is a fundamental problem, and many applications, such as recommendation systems, exist.
%We consider the following scenario.
We consider the following natural scenario in a recommendation problem;
a learner has a set that contains several items, and it wishes to
recommend an item to a target user from the set.
% Assume that we have a sample, that is, some sets of items and the item selected by the user
% in each set.
% The goal is to find a ranking function that scores the highest value
% for the item that the target user chooses.
%\paragraph{Problem setting}
\par
{\bf Problem setting:} 
Let $\calX \subseteq \Real^d$ be an instance space,
and $s: \calX \rightarrow \Real$ be a target scoring function.
Set $A$ is a finite set of instances selected from $\calX$. 
The learner receives the sequence of the sets of items and the chosen item
$S=(A_1, x^*_1), \ldots, (A_n, x^*_n)$, where each $x^*_i \in A_i$ is the highest-valued item
determined by the target function $s$. 
$k$ denotes the average size of the item sets in $S$, that is, $k=\frac{1}{n}\sum_{i=1}^n |A_i|$. 
Each sample set of items is drawn i.i.d. from $\calX$ according to an unknown
distribution $\calD$ over $2^{\calX}$.
Assume that the learner predicts the item from the item set using the hypothesis
$h \in \calH= \{A \mapsto \arg\max_{x \in A}\langle w, x \rangle \mid w \in \Real^d\}$.\footnote{We consider an $\arg\max$ with a fixed tie-breaking rule.}
%Let $\calH = \{h \mid w \in \Real^d \}$ be a hypothesis class.
%Let $\ell: (\bx^*, h_\bw(A)) \mapsto I(h(A) \neq \bx^*)$ be a zero-one loss function.
Let $\ell (A, x^*, h)$ is a convex upper bound on the zero-one loss function
$I(y \neq \hat{y})$. Equivalently, we consider the zero-one loss 
$I(\langle w, x^* \rangle - \max_{x \in A\backslash x^*} \langle w, x \rangle \leq 0)$ and its convex upper bounded loss $\ell: (A,x^*, h) \mapsto \Gamma(\langle w, x^* \rangle - \max_{x \in A\backslash x^*} \langle w, x \rangle)$ where $\Gamma: \Real \rightarrow [0,1]$ is a convex, nonincreasing and $a$ Lipschitz function.
The goal of the learner is to find $h \in \calH$ with a small misranking risk
w.r.t. the target $s$. Thus, the generalization and empirical risks are formulated as follows:
%\footnotesize
\begin{align}
  R_\calD(h) \!=\!\Expo_{A \sim \calD}\!
  \left[
    \ell \left(A, x^*\!, h \right)
    \right],
  \emR_{S}(h)\!=\! \frac{1}{n} \sum_{i=1}^n \ell \left(A, x_i^*\!, h \right),
\end{align}
%\normalsize
where $x^* = \arg\max_{x \in A}s(x)$.
%\paragraph{Reduction to \GMIL}
\par
We obtain the following by using MIL-reduction scheme:
\begin{theo}
\label{theo:trl_reduction}
 Top-1 ranking learning is \GMIL-reducible.
\end{theo}
The reducible condition is satisfied when we set
$\alpha(A, x^*) = (x', y')$ where $x'= \{x - x^* \mid x \in A\backslash x^*  \}$
$y_i'=-1$ for all $i \in [n]$.
The details of the reduction process is in~Sec.\ref{sec:proof_trl_reduction}.
% \begin{proof}
% On \GMIL-reduction framework, suppose that $p=\infty$; $f_1$ is 
% $\Gamma$; 
% $\alpha(A, x^*) = (x', y')$ where $x'= \{x - x^* \mid x \in A\backslash x^*  \}$; 
% $y'=-1$; 
% $\calG = \{g_w: z \mapsto \langle w, z \rangle \mid \|w\| \leq 1 \}$;
% $\beta(h'_w) =h_w$.
% For any $(A, x^*)$ and $h_w \in \calH$, the following holds:
% \begin{align}
% \ell'(x', y', h'_W)
% =&
% f_1\left(
% y' \Psi_p\left( 
% \{
% f_2 \left(
% g(z) \mid z \in x'_{(x,y)}
% \}
% \right)
% \right)
% \right)
% \\
% =&
% \Gamma\left(
% - \Psi_\infty\left( 
% \{
% g(z) \mid z \in x'_{(x,y)}
% \}
% \right)
% \right)
% \\
% =& \Gamma\left(
% -\left(\max_{j \in A\backslash x^*}
% \left(\langle w, x \rangle - \langle w, x^* \rangle\right) \right)
% \right)\\
% =& 
% \ell(A, x^*, h_w) 
% \end{align}
% \end{proof}
%The generalization bound for the Top-1 rank learning is given as:
%\paragraph{Generalization bound}
\par
The empirical Rademacher complexity bound is as follows:
\begin{coro}
% Let $S'= ((x'_1, y'_1), \ldots, (x'_n, y'_n))$. 
% Let $\calH' = \Psi_\infty(\{f_2(g(z))\mid z \in x'\})$ and let $\calH_\beta = \{\beta(h') \mid h' \in \calH'\}$. Let $\calhatH' = \{f_1(y'\Psi_\infty(\{f_2(g(z))\mid z \in x'\})) \mid g \in \calG \}$. 
% The following bound holds with a high probability of at least $1-\delta$ for all $h_W \in \calH_\beta$:
% \begin{align}
%     R_{\calD}(h_W) \leq \emR_{S'}(h'_W) + 2\Rdm_{S'}(\calhatH')+ 3\sqrt{\frac{\log \frac{2}{\delta}}{2n}}.
% \end{align}
% where
We assume that $\|x\| \leq C_2$ for any $x \in A_i,  \forall i \in [n]$.
In the reduced MIL problem,
the empirical Rademacher complexity of $\calhatH'$ is given as follows:
\begin{align}
\Rdm_{S'}(\calhatH') = O\left(
    \frac{
    \log \left(\hata^2 n^2 (k-1) \right) 
    \left({\hata}\ln(2\hata^2n) \right)
    }
    {\sqrt{n}}
    \right),
    % \frac{
    % 4+10\log \left(16e a^2 n^2 (k-1) \right) 
    % \left(N+ {2a}\ln(16a^2n) \right)
    % }
    % {\sqrt{n}}.
%\quad \mathrm{or} \quad 
\end{align}
where $\hata=2aC_1C_2$ and we assume $\|w'\| \leq C_1$.
\end{coro}
The generalization bound can be derived by applying
$r_{S'} = k-1$ and using the fact that $\|z\| \leq 2C_2$ for any $z \in x_i', \forall i \in [n]$ in the reduced MIL.
By using Corollary~\ref{coro:risk_bound_reduced}, we can obtain the generalization
risk bound for the Top-1 ranking learning.
%\paragraph{ERM algorithm}
\par
The learning algorithm is designed by the following result:
\begin{coro}
The reduced ERM of MIL from top-1 ranking learning
is a convex programming problem.
\end{coro}
The corollary can be easily derived from the reduction process
detailed in~\ref{sec:proof_trl_reduction}.
\par
{\bf Extension:}
We consider \emph{top-1 ranking learning with negative feedback}
which is an extension of top-1 ranking learning.
We show the details in Sec.\ref{sec:trl_neg}.
Remarkably, the ERM problem of the reduced MIL is a DC programming problem.


%\input{6_kernelize}
\section{Kernelized extension}
Although we consider a linear function set as $\calG$;
in practice, a nonlinear kernel is required for various learning tasks.
A straightforward method is to employ 
a kernel-approximation technique~\citep[see, e.g., Sec.6.6 in][]{mohri2018foundations},
which constructs feature vectors $\Phi(x) \in \Real^D$ with the theoretical guarantee that $\langle \Phi(x_1), \Phi(x_2)\rangle \approx K(x_1, x_2)$ for a user-determined dimension $D$.
However, we can use only a limited number of kernels via the approximation technique.
Therefore, we show the kernelized version of the reduction.
\subsection{Settings}
%\par
%{\bf Settings:}
We assume that an original problem is defined by $\calH, \ell, \calX, \calY$, and
$\Phi: \calX \to \Hilbert$, where $\Hilbert$ is a reproducing kernel Hilbert space 
associated to $K(x_1,x_2) = \langle \Phi(x_1), \Phi(x_2) \rangle$.
Aside from the computability, we can virtually consider the sample as 
$S = ((\Phi(x_1), y_1), \ldots, (\Phi(x_n), y_n))$.
The ERM-reducible condition is that there exist $(x', y')=\alpha(\Phi(x), y)$, 
$h=\beta(h')$ and $\ell'$ that satisfies  
$\ell(\Phi(x),y, h) = \ell'(x_i', y_i', h')$
for any $(x,y) \in \calX \times \calY$.
% In the reduced MIL problem, 
% for the sample $S'$, we consider
% each bag in $S'$ as $x'_i = \{z_1, \ldots (z_|x'_i|)\}$.
% In general, the domain of $\tau(z) \in x'_i$ become another Hilbert space $\Hilbert'$, that is, $\Hilbert \neq \Hilbert'$.

Let $S' = ((x'_1, y'_1), \ldots, (x'_n, y'_n))$ and let
$\calG = \{g: z \mapsto \langle w', z \rangle \mid w' \in \Hilbert'\}$.
%in the reduced MIL problem.
We assume that $(\calH, \ell)$ is MIL-reducible to $\calH', \ell'$.
The ERM of the reduced MIL is formulated as:
%\scriptsize
\begin{align}
\label{align:erm_optprob_ker}
    \min_{w' \in \Hilbert'} \lambda \|w'\|_{\Hilbert'} 
    + \calL_{w'},
    % + \sum_{i=1}^n f_1
    % \left(
    % y'_i
    % \Psi_p \left( \left\{
    % f_2\left(
    % \langle w', \tau(z) \rangle \mid \tau(z) \in x'_{i}
    % \right)
    % \right\} \right)
    % \right).
\end{align}
where $\calL_{w'}=
    \sum_{i}^n f_1
    \left(
    y'_i
    \Psi_p \left( \left\{
    f_2\left(
    \langle w', z \rangle \mid z \in x'_{i}
    \right)
    \right\} \right)
    \right)$.
%\normalsize
\subsection{Computability}
%\par
%{\bf Computability:}
We show that the representer theorem holds for the 
optimization problem~\eqref{align:erm_optprob_ker}.
\begin{theo}[Representer theorem]
\label{theo:representer}
An optimal solution of the ERM problem~(\ref{align:erm_optprob_ker})
has the form $\tilde{w}' = \sum_{z \in P_{S'}} \mu_z z$,
where $P_{S'} = \bigcup_{i=1}^n x'_i$.
\end{theo}
Thus, the ERM problem~\eqref{align:erm_optprob_ker}
is equivalently formulated as:
%\scriptsize
\begin{align}
\label{align:erm_optprob_ker2}
    \min_{\bmu \in \Real^{|P_{S'}|}} 
    \lambda 
    \sum_{z, \hat{z} \in P_{S'}}\mu_{z}\mu_{\hat{z}}\langle z,\hat{z} \rangle
    + \calL_{\bmu},
    % +& \sum_{i=1}^n f_1
    % \left(y_i
    % \Psi_p \left( 
    % \left\{f_2 \left(\sum_{v \in P_{S'}}\mu_{v} K'\left(v, z\right) \right) \mid z \in x'_i \right\}
    % \right)
    % \right).   
\end{align}
where $\calL_{\bmu}\!=\!
    \sum_{i=1}^n f_1
    (y_i
    \Psi_p ( 
    \{f_2 (\sum_{z \in P_{S'}}\mu_{z} \langle z, \hat{z} \rangle ) \mid \hat{z} \in x'_i \}
    )
    )$.
%\normalsize
\par
Therefore, if $\langle z_1,z_2 \rangle$ is polynomial-time computable for any $z_1, z_2 \in x'$ using 
the original kernel function $K$ as an oracle,
the ERM of the MIL can be solved similar to linear case according to the condition in Proposition~\ref{prop:poly} and \ref{prop:DC}
(DC algorithm for the kernel version is in Sec.~\ref{sec:DC_algorithm_ker}).
For all MIL-reducible problems introduced in the paper,
$\langle z_1,z_2 \rangle$ is polynomial-time computable using 
$K$ (see details in Sec.\ref{sec:example_ker}).
Moreover, we can construct $\beta$ in polynomial time.

%\input{7_discussion}
\section{Discussion}
\subsection{Related work}
\label{sec:related}
{\bf Other reduction techniques:}
Several machine-learning reduction schemes exist
\citep[see, e.g.,][]{beygelzimer2015learning},
and we found general reduction schemes, such as~\citep{pitt1990prediction,beygelzimer2005error}.
A major difference between the proposed scheme and existing approaches is that 
we focus on the reduction of ERM.
Various applications of machine learning reductions, such as
reduction from multi-class learning to binary classification~\citep{james1998error, ramaswamy2014consistency},
% from cost-sensitive multi-class learning to binary classification~\citep{beygelzimer2005weighted, beygelzimer2005error, langford2005sensitive},
and from ranking to binary classification~\citep{balcan2008robust,ailon2010preference,agarwal2014surrogate},
exist.
To the best of our knowledge, the reduction to MIL has not yet been discussed.
%We discovered that MIL can be connected with various learning problems
%via our reduction scheme.
\par
%\paragraph{Multi-Class Learning}
% For the basic result of the generalization performance, 
% the generalization risk can be
% upper bounded by the term that linearly depends
% on the class size $k$~\citep{mohri2018foundations}.
{\bf Multi-Class Learning:}
Recently, \citet{Lei19} achieved $\log(k)$-dependent generalization
bound.
%which is logarithmically dependent on the class size $k$.
The proposed generalization bound is competitive with the bound.
However, our derivation is highly simpler than
the analysis of~\citep{Lei19} because the reduction allows us to
apply the existing MIL bound of~\citep{Sabato:2012:MLA}.
\par
{\bf Complementarily-labeled learning:}
\citet{ishida2017learning} provided the generalization risk bound in the case in which the training sample
contains only complementarily labeled instances (i.e., $\theta=0$). 
The proposed generalization bound is incomparable to 
the bound (see details in Sec.\ref{sec:ishida-comparison}).
% However, we can say that if we achieve a small empirical risk
% close to zero,
% our provided risk bound
% is $k$ times tighter than the existing bound.
%By the definition of their empirical risk $\widehat{R}(h)$,
\citet{ishida2017learning} selected nonconvex loss functions
and optimized the empirical risks using a gradient-based algorithm in practice.
However, there is no guarantee of the optimality of the solution.
We show that the learning problem can be solved by DC algorithm
and guarantee the local optima.
Moreover, in the special case that sample contains only complementarily labeled data, the learning problem becomes convex programming and we can obtain global optima. 
To the best of our knowledge, the provided learning algorithm is a first 
polynomial-time algorithm in the special case.
%there is no polynomial-time algorithm
%even in the special case in which the training sample contains only complementarily labeled data.
% Ishida et al. also provided another LCL framework~\citep{ishida19a} with
% arbitrary loss functions and arbitrary prediction model.
% They provided another gradient-based optimization algorithm, which
% performed well in practice.
% However, the generalization performance was not discussed.
% They mentioned that their algorithm suffers from overfitting
% and demonstrated a practical approach to avoid the overfitting.
\par
{\bf Multi-label learning:}
%For multi-label learning,
Various approaches and generalization analyses have been provided~\citep{pmlr-v32-yu14,NIPS2015_35051070,xu2016local,xu2016robust}.
However, to the best of our knowledge, this paper is the first to propose a
$\log(k)$-dependent generalization bound for the linear (or nonlinear kernel)
hypothesis class, where $k$ is the number of classes.
\par
{\bf Multi-task learning:}
A similar generalization bound was reported by~\citep{pontil2013excess}.
Their results suggest the advantage of regularizing the weights 
$w_1,\ldots,w_T$ over $T$ tasks.
However, our result is derived from an entirely different argument
from~\citep{pontil2013excess} and the derivation is highly simplified. \par
% If the learner consider learning $w_1, \ldots, w_T$ separately,
% the goal of the learner is to find a hypothesis $h_w$ from
% $\calH = \{h_W: (x) \mapsto \langle w, \Phi(x) \rangle \mid \|w\| \leq 1\}$. The difference from multi-task learning is the regularization of $w_t$.
% The total hypothesis set over $T$ tasks is 
% $\tilde{\calH} = \{h_W: (x,t) \mapsto \langle w_t, \Phi(x) \rangle \mid \|w_t\| \leq 1\}$ and $\|W\| \leq T$. ****************%%\UTF{00C3}\UTF{0161}\UTF{00C5}\UTF{00A0}\UTF{00C2}\UTF{0081}\UTF{00C3}¡ø\UTF{00C2}¡ñ\UTF{00C2}\UTF{00BA}\UTF{00C3}\UTF{0161}\UTF{00C2}\UTF{00AA}\UTF{00C2}\UTF{008D}%
% Thus, the generalization risk of $h \in \tilde{\calH}$ can be upper bounded with a probability of at least $1-\delta$ for $\tilde{\calH}$:
% \begin{align}
% \Expo_{t}[R_{\calD_t}(h_W)] \leq \frac{1}{T}\sum_{t=1}^T \emR_{S_t}(h_W) + \frac{\sqrt{T}}{\sqrt{n}} + 3\sqrt{\frac{\log \frac{2}{\delta}}{2n}}.
% \end{align}
% The above bound is simply derived by
% using the existing bound of hyperplanes~\cite{mohri2018foundations}.
% The empirical Rademacher complexity for the size $nT$ of sample 
% with the regularized hyperplane $\|W\| \leq T$ becomes
% $\frac{T}{\sqrt{nT}}$. 
% Our generalization risk bound just logarithmically depends on
% $T$. 
% The difference implies that the regularization among tasks has
% an advantage in multi-task learning.
\par
{\bf Top-1 ranking learning:}
% Many types of problem setting exist for ranking learning tasks.
% Various measures for ranking at the top have been provided~\citep{rudin:colt06,agarwal11-infinite-push,NIPS2014_5222,menon2016bipartite,NIPS2012_4635}.
Top-1 ranking measure was originally discussed in~\citep{hidasi2018recurrent}.
However, the basic problem setting is different from ours.
%\citet{hidasi2018recurrent} 
They assumed that the recommender has i.i.d. positive and negative items as the sample.
%The loss is defined by each positive item with the mini-batch sampled negative set of items.
Moreover, they did not propose a general form of the problem and theoretical analysis.
%We provided a general form of the problem setting and the theoretical aspects.
\par
{\bf MIL:}
%\subsection{Multiple-Instance Learning}
% Since \citet{Dietterich:1997} first proposed MIL,
% many researchers have introduced various theories and applications 
% of
% MIL~\citep{Gartner02multi-instancekernels,NIPS2002misvm,Sabato:2012:MLA,pmlr-v28-zhang13a,Doran:2014,CARBONNEAU2018329}.
MIL was originally proposed by~\citet{Dietterich:1997}, which is
known as weakly supervised learning and there have been proposed
many real applications~\citep{Gartner02multi-instancekernels,NIPS2002misvm,pmlr-v28-zhang13a,Doran:2014,CARBONNEAU2018329}.
The generalization bound and learning algorithm have been
analyzed from the theoretical perspective~\citep{Sabato:2012:MLA,doran:thesis,suehiro2020multiple}.
There have been several studies on the relationship between MIL with other learning tasks.
\citet{zhou2007relation} showed that a classical MIL can be considered as specific semi-supervised learning.
\citet{Zhang2020RobustML} utilized MIL for extracting causal instances.
However, these works do not imply any type of reduction in the sense of computation theory: if problem A is reduced to B, then we should immediately obtain an algorithm for A from any algorithm for B combined with the reduction (input-output transformations) with a certain performance guarantee.
\citet{suehiro2020multiple} found that a local-feature-based time-series classification
problem can be reduced to a MIL problem with a generalization risk bound. However, the reduced problem is too specific.
Our results first show that various learning problems can be reduced to MIL.

\subsection{Practical implications}
An important contribution of the paper in both the theoretical and practical aspects is to provide a simple and general reduction scheme among various learning problems with theoretical guarantees on generalization bounds. This means that when faced with a new learning problem A, we can search for an existing ERM problem B that is reducible from A. If succeeded, then we immediately obtain a learning algorithm for A with a generalization bound. Usually, this process is expected to be much easier than designing a learning algorithm from scratch.
\par
In particular, we demonstrate that various learning problems are reducible to a particular problem, MIL. That is, we only have to improve ERM algorithms for MIL, which work on the original learning problems as well. Moreover, we show that ERM for MIL can be formulated as DC programming problems in Section~\ref{subsec:algo}. Therefore, we can employ a state-of-the-art DC programming package, which is rapidly evolving these days~\citep{le2018dc}. For instance, complementarily labeled learning, which is only known to have a non-convex optimization formulation~\citep{ishida2017learning,ishida19a}, would enjoy the benefits from a promising DC programming approach.
\par
\paragraph{Experiments:} We demonstrate that our theoretical results are practically useful in the following experiment on complementarily labeled learning tasks~\footnote{The code is available in \url{https://github.com/suehiro93/MIL_reduction}}.
We use three artificial datasets and four benchmark datasets available in UCI machine learning repository~\footnote{\url{https://archive.ics.uci.edu/ml/}}.
The details of artificial datasets are described in Section~\ref{sec:art_data}.
For all datasets, all training instances are complementarily labeled uniformly at random.
That is, the ERM problem which is derived from our MIL-reduction scheme becomes a convex programming problem (quadratic programming problem). On the other hand, \citep{ishida2017learning} solves a nonconvex optimization problem by using Adam~\citep{kingma2014adam}. 
The size of training sample is fixed to 1000 and we used the remaining data as a test set.
Although we did not tune the optimization hyperparameters of~\citep{ishida2017learning} (the number of epochs is 200 and the learning rate is 0.01), we stopped the learning at the epoch when the test accuracy was the maximum. The loss of~\citep{ishida2017learning} was fixed to PC loss which was the best-performed loss~\citep[see][]{ishida2017learning}.
Our regularization parameter is chosen from $\{0.01, 1, 100\}$ and the regularization parameter of \citep{ishida2017learning} is chosen from $\{0.01, 1, 100\}$.
We evaluated the average accuracy over 10 trials.

Table~\ref{tab:acc} shows that our method achieved higher classification accuracy than~\citep{ishida2017learning} on many datasets. This result indicates that our MIL-reduction scenario for complementarily labeled learning, which is derived from the proposed MIL-reduction scheme, is useful in practice.
Moreover, our ERM algorithm does not require any hyperparameters for the optimization because the optimization problem is a convex programming problem (or DC programming problem when the training sample contains both labeled and complementarily labeled instances). On the other hand, the learning algorithm provided by \citet{ishida2017learning} solves a nonconvex optimization problem and usually requires several hyperparameters (e.g., learning rate and the number of epochs) of the nonconvex-optimization solver. 
\begin{table}[t]
\centering
\caption{Average test accuracy over 10 trials.}
\label{tab:acc}
\begin{tabular}{l|cccc}
\hline
Dataset  &Class & Dim.& Ours & Ishida+  \\
\hline \hline
artificial1 & 5 & 50 & \textbf{0.9999}    & 0.9998 \\
artificial2 & 10 & 50 & \textbf{0.808}    & 0.646 \\
artificial3 & 25& 50 & 0.063    & \textbf{0.065} \\
covertype & 7 & 54 & \textbf{0.562}    & 0.549 \\
satimage & 7 & 36 & \textbf{0.804}    & 0.751 \\
waveform & 3& 40& \textbf{0.833}    & 0.832 \\
yeast & 10& 8& 0.348    & \textbf{0.407} \\
\hline
\end{tabular}
\end{table}


\subsection{Conclusion and future work}
We revealed that various learning problems can be reduced to a MIL 
problem by our ERM-based reduction scheme.
%An interesting point is that some supervised learning problem can be reduced to
%MIL (weakly supervised learning).
%The theoretical analyses of the MIL-reducible problems are highly simplified, and
%we can derive the theoretical results in a unified manner.
The results imply that our MIL-reduction gives a simplified and unified scheme
for the analyses for various learning problems.
Moreover, we obtained novel theoretical results for some learning problems.
A practical concern is that the applicable loss functions are limited in the current scheme. For example, some loss functions without satisfying the conditions of MIL-reducibility (e.g., square loss) cannot be used.
%As future work, we consider other reduction schemes based on 
%the proposed ERM-reduction.
We explore the relaxation of the ERM-reducible condition.
An interesting open problem is how the class of MIL-reducible problems is characterized.
Our results imply that MIL is one of the hardest problems in a certain class C of learning problems. In other words, we could say that MIL is a C-complete problem. We would like to investigate how the class C is characterized.

%\input{8_acknowledgment}
%\acks{}
\section*{Acknowledgment}
This work was supported by JSPS KAKENHI (Grant Number JP19H04067 and JP20H05967) and JST, ACT-X (Grant Number JPMJAX200G).


% Manual newpage inserted to improve layout of sample file - not
% needed in general before appendices/bibliography.

%\newpage

%\appendix
%\section*{Appendix A.}
%\label{app:theorem}

% Note: in this sample, the section number is hard-coded in. Following
% proper LaTeX conventions, it should properly be coded as a reference:

%In this appendix we prove the following theorem from
%Section~\ref{sec:textree-generalization}:

%\bibliographystyle{elsarticle-num}
\bibliography{suehiro_124}

%\input{9_supplementary}

\end{document}
