% \documentclass{midl} % Include author names
\documentclass{midl} % Anonymized submission

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

% \jmlrvolume{-- Under Review}
\jmlryear{2020}
\jmlrworkshop{Full Paper -- MIDL 2020}
% \editors{Under Review for MIDL 2020}

%\title[Revisiting bounding boxes]{Revisiting bounding boxes: weakly supervised image segmentation with inequality constraints and tightness prior}

\title[Bounding boxes for weakly supervised segmentation]{Bounding boxes for weakly supervised segmentation: \\ Global constraints get close to full supervision}

\usepackage{multirow}
\usepackage{booktabs}  % Midrule, toprule

\newcommand{\ttt}{\boldsymbol{\theta}}
\newcommand{\elb}{\tilde{\psi}_{t}}

\newcommand{\rred}[1]{\textcolor{red}{#1}}
\newcommand{\bblue}[1]{\textcolor{blue}{#1}}
\newcommand{\alert}[1]{\textcolor{orange}{#1}}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
\midlauthor{\Name{Hoel Kervadec} \Email{hoel@kervadec.science} \\
    \addr \'ETS Montréal
\AND
    \Name{Jose Dolz} \\
    \addr \'ETS Montréal
\AND
    \Name{Shanshan Wang} \\
    \addr Shenzhen Institutes of Advanced Technology
\AND
    \Name{Eric Granger} \\
    \addr \'ETS Montréal
\AND
    \Name{Ismail {Ben Ayed}} \\
    \addr \'ETS Montréal
}

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
% \midlauthor{\Name{Author Name1\midljointauthortext{Contributed equally}\nametag{$^{1,2}$}} \Email{abc@sample.edu}\\
% \addr $^{1}$ Address 1 \\
% \addr $^{2}$ Address 2 \AND
% \Name{Author Name2\midlotherjointauthor\nametag{$^{1}$}} \Email{xyz@sample.edu}\\
% \Name{Author Name3\nametag{$^{2}$}} \Email{alphabeta@example.edu}\\
% \Name{Author Name4\midljointauthortext{Contributed equally}\nametag{$^{3}$}} \Email{uvw@foo.ac.uk}\\
% \addr $^{3}$ Address 3 \AND
% \Name{Author Name5\midlotherjointauthor\nametag{$^{4}$}} \Email{fgh@bar.com}\\
% \addr $^{4}$ Address 4
% }

\begin{document}
    \maketitle

    \begin{abstract}

        We propose a novel weakly supervised learning segmentation based on several global constraints derived from box annotations. Particularly, we leverage a classical tightness prior to a deep learning setting via imposing a set of constraints on the network outputs. Such a powerful topological prior prevents solutions from excessive shrinking by enforcing any horizontal or vertical line within the bounding box to contain, at least, one pixel of the foreground region. Furthermore, we integrate our deep tightness prior with a global background emptiness constraint, guiding training with information outside the bounding box. We demonstrate experimentally that such a global constraint is much more powerful than standard cross-entropy for the background class. Our optimization problem is challenging as it takes the form of a large set of inequality constraints on the outputs of deep networks.
        We solve it with sequence of unconstrained losses based on a recent powerful extension of the log-barrier method, which is well-known in the context of interior-point methods. This accommodates standard stochastic gradient descent (SGD) for training deep networks, while avoiding computationally expensive and unstable Lagrangian dual steps and projections. Extensive experiments over two different public data sets and applications (prostate and brain lesions) demonstrate that the synergy between our global tightness and emptiness priors yield very competitive performances, approaching full supervision and outperforming significantly DeepCut. Furthermore, our approach removes the need for computationally expensive proposal generation. Our code is shared anonymously.



    \end{abstract}

    \begin{keywords}
        CNN,image segmentation, weak supervision, bounding boxes, global constraints, Lagrangian optimization, log-barriers
    \end{keywords}

    \section{Introduction}

        Semantic segmentation is of paramount importance in the understanding and interpretation of medical images, as it plays a crucial role in the diagnostic, treatment and follow-up of many diseases. Even though the problem has been widely studied during the last decades, we have witnessed a tremendous progress in the recent years with the advent of deep convolutional neural networks (CNNs) \cite{litjens2017survey,ronneberger2015u,rajchl2016deepcut,dolz20173d}. Nevertheless, a main limitation of these models is the need of large annotated datasets, which hampers the performance and limits the scalability of deep CNNs in the medical domain, where pixel-wise annotations are prohibitively time-consuming. Weakly supervised learning has gained popularity to alleviate the need of large amounts of pixel-labeled images. Weak labels can come in the form of image tags \cite{pathak2015constrained}, scribbles \cite{lin2016scribblesup}, points \cite{Bearman2016}, bounding boxes \cite{dai2015boxsup,khoreva2017simple,hsu2019weakly} or global constraints \cite{jia2017constrained,kervadec2019constrained}. A common paradigm in the weakly supervised learning setting is to employ weak annotations to generate {\em pseudo-masks} or {\em proposals}. These proposals are `'fake" labels, which are generated iteratively to refine the parameters of deep CNNs, thereby mimicking full supervision. Unfortunately, as discussed in several recent works \cite{tang2018regularized,kervadec2019constrained}, proposals contain errors, which might be propagated during training, affecting severely segmentation performances. Furthermore, iterative proposal generation increases significantly the computation load for training. More recently, several studies investigated global loss functions, e.g., in the form of constraints on the target-region size \cite{pathak2015constrained,jia2017constrained,kervadec2019constrained,bateson2019constrained}. This can be done by constraining the softmax outputs of
        deep networks, leveraging unlabeled data with a single loss function and removing the need for iterative proposal generation. Nevertheless, despite the good performances achieved by these works in certain practical scenarios, their applicability might be limited by the assumptions underlying such global constraints, e.g., precise knowledge of the target region size.

        Among different weak supervision approaches, bounding box annotations are an appealing alternative due to their simplicity and low-annotation cost. In practice, bounding boxes can be defined with two corner coordinates, allowing fast placement and light storage. Furthermore, they provide localization-awareness, which spatially constrains the problem. This form of supervision has indeed become popular in computer vision to initialize shallow segmentation models, whose outputs are later used to train deep networks, as in full supervision \cite{dai2015boxsup,papandreou2015weakly,khoreva2017simple,pu2018graphnet}. A naive use of bounding boxes amounts to generating pseudo-labels by simply considering each pixel within the bounding box as a positive sample for the respective class \cite{papandreou2015weakly,rajchl2016deepcut}. However, in a realistic scenario, a bounding box also contains background pixels. To account for this, some advanced foreground extraction methods are employed. Particularly,  the very popular GrabCut \cite{rother2004grabcut} is a standard choice to generate segmentation masks from bounding boxes, even though alternative approaches such as Multiscale Combinatorial Grouping (MCG) \cite{pont2017multiscale} were recently used for the same purpose \cite{dai2015boxsup}.

          \paragraph{Contributions:}We propose a novel weakly supervised learning paradigm based on several global constraints derived from box annotations.
        First, we leverage the classical tightness prior in \cite{lempitsky2009image} to a deep learning setting, and re-formulate the problem by imposing a set of constraints on the network outputs. Such a powerful topological prior prevents solutions from excessive shrinking by enforcing any horizontal or vertical line within the bounding box to contain, at least, one pixel of the foreground region. Furthermore, we integrate our deep tightness prior with a global background emptiness constraint, guiding training with information outside the bounding box. As we will see in our experiments, such a global constraint is much more powerful than standard cross-entropy for the background class. Our optimization problem is challenging as it takes the form of a large set of inequality constraints, which are difficult to handle in the context of deep networks. We solve it with sequence of unconstrained losses based on a recent powerful extension of the log-barrier method \cite{kervadec2019log}, which is well-known in the context of interior-point methods. This accommodates standard stochastic gradient descent (SGD) for training deep networks, while avoiding computationally expensive and unstable Lagrangian dual steps and projections. Extensive experiments over two different public data sets and applications (prostate and brain lesions) demonstrate that the synergy between our global tightness and emptiness priors yield very competitive performances, approaching full supervision and outperforming significantly DeepCut \cite{rajchl2016deepcut}. Furthermore, our approach removes the need for computationally expensive proposal generation.

        %Further, from the bounding box annotation we can derive that pixels outside the box do not belong to the object of interest. In other words, the size of the target outside the bounding box is equal to 0. We integrate this information into a joint constrained optimization framework, which we solve with inequality constraints \cite{kervadec2019log}. Here talk a bit about log-barrier....

       % \paragraph{Contributions}
    %    \textcolor{red}{Jose: Hoel, can you list all what is new here????}..
    %    \begin{itemize}
     %       \item Decompose the loss in several parts, to deal with both the uncertainty inside the box, and the background outside of it
     %       \item Include a tightness prior directly into the training process with the use of log-barrier extensions
     %       \item Regularize the predictions using the size of boxes as an upper bound
     %   \end{itemize}

        %We exploit implicit information (no objects outside the bounding box)
        %This motivated the attention that weakly supervised segmentation received in the past few years \cite{khoreva2017simple,vernaza2017learning,kolesnikov2016seed,vezhnevets2010towards,papandreou2015weakly,buhmann2012weakly,xu2015learning,dai2015boxsup,wei2017stc,pinheiro2015image,jia2017constrained,kervadec2019constrained}. Weak supervision is loosely defined, but it usually span across methods that use data without annotation for each pixel. Those labels can take many forms, such as points, scribbles across the image, image-level tags, or bounding boxes (see Figure \ref{fig:weak_labels} for an illustration).

        \begin{figure}[h]
            \centering
            \includegraphics[width=.8\textwidth]{figures/labels.jpg}
            \caption{Example of weak labels on two different tasks: prostate segmentation and stroke lesion segmentation.}
            \label{fig:weak_labels}
        \end{figure}

        %One main interest of bounding boxes, on top of their simplicity and speed to annotate, is that it allows to re-use datasets annotated for object detection, tapping in a much larger set of existing datasets. Despite those advantages, fewer recent works focused on bounding boxes \cite{rajchl2016deepcut,papandreou2015weakly}, which are loosely based on the venerable GrabCut \cite{rother2004grabcut},

        %Bounding boxes: uncertainty to deal with

        %Can reuse annotations made for object detection


    \section{Related works}

        \paragraph{Weakly supervised medical image segmentation.}Despite the increasing interest in weakly supervised segmentation models in the computer vision community, the literature on these models in medical imaging remains scarce. The authors of \cite{qu2019weakly} leverage point annotations in the context of histopathology images. From labeled points, they derived additional information in the form of a voronoi diagram, so as to generate coarse labels
        for nuclei segmentation. Their objective function integrated the cross-entropy with coarse labels and the conditional random field (CRF) loss in \cite{tang2018regularized}. Similarly to previous works in computer vision, \cite{nguyen2019novel} used classification activation maps (CAMs) derived from the networks as a pseudo-masks to train a CNN in a fully supervised manner. To constrain the location of the target, they employed an Active Shape Model (ASM) as a prior information. Nevertheless, this method presents two limitations. First, as in similar works, inaccuracies of the pseudo-masks may lead to sub-optimal performances. Second, the ASM is tailored to this specific application, as its generation for novel classes is dependent on the segmentation masks. More recently, \cite{wu2019weakly} proposed to refine the generated CAM with attention, with the goal of generating more reliable pseudo-masks. Alternatively, other recent methods investigated how to constrain network predictions with global statistics, for instance, the size of the target region \cite{jia2017constrained,kervadec2019curriculum,kervadec2019constrained,bateson2019constrained}. This type of prior information can be imposed as equality \cite{jia2017constrained} or inequality \cite{kervadec2019constrained,bateson2019constrained} constraint. Although such constrained-CNN predictions
        achieved outstanding performances in a few weakly-supervised learning scenarios, their applicability remains limited to certain assumptions.

      %For instance, Jia et .al [7] employed an L2 penalty to impose equality constraints on the size of the tar- get regions in the context of histopathology image segmentation. However, their formulation requires the exact knowledge of region size, which limits its applica- bility. More recently, Kervadec et al. [8] proposed using inequality constraints, which provide more flexibility, and significantly improves performance compared to cases where learning relies on partial image labels in the form of scribbles. Nevertheless, the values used to bound network predictions in [8] are derived from manual annotations, which is a limiting assumption.


        \paragraph{Bounding box supervision.}Most CNN-based methods under the umbrella of bounding-box supervision fall under the category of
        proposal-based methods. In these approaches, the bounding box annotations are exploited to obtain initial pseudo-masks, or proposals, typically with a shallow segmentation method, e.g., the very popular GrabCut method \cite{rother2004grabcut}. Then, training typically follows an iterative scheme, which involves two steps, one updating the network parameters and the other adjusting the pseudo-labels \cite{dai2015boxsup,papandreou2015weakly,khoreva2017simple}. To further refine the pseudo-labels generated at each iteration, several works \cite{rajchl2016deepcut,song2019box} used the popular DenseCRF \cite{krahenbuhl2011efficient} or other heuristics.  While this might be very effective on some datasets, DenseCRF typically assumes that all the training images have consistent and strong contrast between the foreground and background regions.
        Finding the optimal DenseCRF parameters\footnote{Several hyper-parameters controls the edge sensitivity of popular DenseCRF \cite{krahenbuhl2011efficient}, mostly $\theta_\beta$ and $\theta_\gamma$, but also $\omega_1,\omega_2$ and $\theta_\alpha$ to some extent.} is difficult when the contrast of the object edge varies significantly within the same dataset. Moreover, the ensuing training is not end-to-end, as it still relies on a DenseCRF post-processing, even at inference time. Another drawback of those bounding-box based learning approaches -- which is also shared by other proposal-based methods in general -- is that early mistakes will re-enforce themselves during training. For example, in DeepCut \cite{rajchl2016deepcut}, while the pseudo-labels cannot grow beyond the bounding box, the inner foreground may gradually disappear. More recently, Hsu et al \cite{hsu2019weakly} employed a Multiple Instance Learning (MIL) framework to impose a tightness prior in the context of instance segmentation of natural images. Focusing on instance segmentation, the method used bounding boxes generated by R-CNN. In such MIL framework, positive bags are composed of box lines while negative bags correspond to lines outside the box. The MIL loss function is defined so as to push the maximum predicted probability within each positive bag to 1, and the maximum predicted probability within each negative bag to 0. This MIL loss is integrated with a GridCRF loss \cite{Marin2019} to ensure consistency between neighboring pixels. As many other works, the final predictions are refined with DenseCRF \cite{krahenbuhl2011efficient}.



       % the bounding box prior in the context of weakly supervised instance segmentation.
        %The overall method contains three different parts: (i) a box regression, (ii) a box classification, and (iii) a segmentation of the content of the box.
    %    Authors leverage the concept of tightness prior \cite{lempitsky2009image} and reformulate the segmentation problem as a Multiple Instance Learning framework.
     %    While this is a very interesting use of the tightness prior, this method limits its scope in the inside of bounding boxes generated by a R-CNN: it is not performed on the whole image, which can be an order of magnitude bigger.

        %"escape" the box, the inside of the box can gradually disappear, causing to train the network with the object labeled as background.

        %For example, Dai et \textit{.al} \cite{dai2015boxsup} introduced an iterative procedure to train a deep segmentation CNN with the supervision of image proposals derived from the bounding box annotations. First, authors resort to MCG \cite{pont2017multiscale} to generate the set of candidate segments. Then, they propose an iterative solution that involves two steps. At each iteration, the network parameters, $\theta$, are updated and then semantic labels of the candidate segments of all images are predicted.


        %Box based segmentation goes back a long time, with GrabCut \cite{rother2004grabcut} being one of the most well known algorithm to segment an object, given a user-provided bounding box. In 2009, \cite{lempitsky2009image} introduced the concept of "tightness" prior when using a bounding box. It assume that the user-supplied box is not too loose, and reasonably close to the object. They could incorporate this prior with hard constraints into the energy minimization framework, and proposed an updated GraphCut algorithm called Pinpoint, which could be combined with GrabCut.


        %DeepCut \cite{rajchl2016deepcut} is, as far a we know, one of the few and most well known method in medical for box supervised segmentation. It is also closely related to the work of \cite{papandreou2015weakly,khoreva2017simple}.Those methods fall under the category of proposal based methods: pseudo-labels are generated from the network predictions and iteratively refined, using DenseCRF \cite{krahenbuhl2011efficient} or other heuristics. Those refined proposals are then used in place of the original boxes to train a new network. While this is very effective on some datasets, the DenseCRF needs the dataset to contain clear and consistent contrast between foreground and background. Tuning the CRF parameters proves difficult when there is a lot of disparity in the contrast between, and each new heuristics add complexity to tune the whole. Moreover, the resulting network cannot be considered end-to-end, as it still relies on a DenseCRF post-processing, even at inference time.



        %Probs missing BoxSup, We want to highlight that those methods are heavy on heuristics

        \begin{figure}
            \centering
            \subfigure[]{\includegraphics[height=.15\textheight]{figures/camel.png}}
            \subfigure[]{\includegraphics[height=.15\textheight]{figures/camel_h_1.png}
            \includegraphics[height=.15\textheight]{figures/camel_h_2.png}
            \includegraphics[height=.15\textheight]{figures/camel_h_n.png}
            \includegraphics[height=.15\textheight]{figures/camel_v_n.png}}
            \caption{(a) Illustration of the tightness prior: any vertical (red) or horizontal (blue) line will cross at least one (1) pixel of the camel. (b) This can be generalized, where segments of width $w$ cross at least $w$ pixels of camel.}
            \label{fig:tightness_prior}
        \end{figure}

    \section{Method}
        \subsection{Preliminary notations}

        Let ${X}: \Omega \subset \mathbb{R}^{2,3} \rightarrow \mathbb{R}$ denotes a training image, and $\Omega$ its corresponding spatial domain. In a standard
        fully supervised setting, we can denote the training set as $\mathcal{D} = \left\{ ( X,Y ) \right\}^D$, where $X \in \mathbb{R}^{\Omega}$ are input images and $Y \in \{ 0, 1 \}^{\Omega}$ their corresponding pixel-wise labels. In the context of this work, however, labels $Y$ take the form of bounding boxes (as shown in Figure \ref{fig:weak_labels}, third column). Thus, we use $\Omega_O$ and $\Omega_I$ to define the area outside and inside the bounding box, respectively, with $\Omega_O \cup \Omega_I = \Omega$. Let $s_{\ttt} \in [0, 1]^\Omega$ denote the probabilities predicted by the CNNs, where 0 and 1 represent background and foreground, respectively.
        %We have a dataset $\mathcal{D} = \left\{ ( d,y ) \right\}^D$, where $d \in \mathbb{R}^{\Omega}$ are input images and $y \in \{ 0, 1 \}^{\Omega}$ their corresponding label. When the label takes the form of bounding boxes (as shown in Figure \ref{fig:weak_labels}, third column), we can define $\Omega_O$ for the area outside the box (red in the Figure) and $\Omega_I$ for the area inside the boxes (no color in the Figure), such as $\Omega_O \cup \Omega_I = \Omega$. We denote the probabilities predicted by the CNN as $s_{\ttt} \in [0, 1]^\Omega$, where 1 means foreground and 0 means background.
        In fully supervised setting, one would typically optimize the standard cross-entropy loss:
        \[ \min_{\ttt} \mathcal{L}_{\text{CE}}(\ttt) := -\sum_{p \in \Omega} \left[ y_p \log(s_{\ttt}(p)) + (1 - y_p) \log(1 - s_{\ttt}(p)) \right] . \]

        \subsection{Dealing with box annotations}

        %\subsection{On how to deal with the certainty outside the box}
        \paragraph{Certainty outside the box.}
            As shown in Figure \ref{fig:weak_labels}, we certainly know that all pixels $p$ outside a given bounding box ($\Omega_O$) belong to the background. A straightforward solution would be to employ the cross-entropy, but only partially for each of those pixels outside the bounding box:
            \[ \mathcal{L}_{\text{MCE}} := - \sum_{p \in \Omega_O} \log(1 - s_{\ttt}(p)) . \]

            Alternatively, notice that the size of the predicted foreground\footnote{Here we refer the size as the sum of the softmax probabilities, as it is easy to compute and differentiable. Therefore, it accommodates standard Stochastic Gradient Descent.}, when computed over the background pixels ($\Omega_O$), should be equal to zero. This gives the following global constraint for our optimization problem, which enforces that the background region is empty:
            \begin{equation}
                \label{eq:emptiness_constraints}
                \sum_{p \in \Omega_O} s_{\ttt}(p) \leq 0.
            \end{equation}
            We will refer to this constraint as the \textit{emptiness constraint}, $\mathcal{L}_{EMP}$. $\mathcal{L}_O$ will denote either $\mathcal{L}_{\text{MCE}}$ or $\mathcal{L}_{\text{EMP}}$.
            %Kervadec et al. investigated how to impose such inequalities during training of a neural network \cite{kervadec2019constrained, kervadec2019log}. We will detail it more in Section \eqref{sec:log-barrier}.

            % From now onward, we will denote $\mathcal{L}_O$ as either $\mathcal{L}_{\text{MCE}}$ or the constraint from Eq. \eqref{eq:emptiness_constraints}.


            %If we re-use the more advanced log-barrier method from \cite{kervadec2019log} to enforce the constraints, this gives the following loss, whom to refer as emptiness constraint:

            % \begin{equation}
            %     \mathcal{L}_{\text{EMP}} := \elb\left( \sum_{p \in \Omega_O} s_{d,\ttt}(p) \right)
            % \end{equation}
            % $\elb$ is described more in details in section \ref{sec:log-barrier}.


        %\subsection{On how to deal with the uncertainty inside the box}
        \paragraph{Uncertainty inside the box.}
            \label{sec:tightness_prior}

            While bounding box annotations provide cues about the spatial location of the target regions, pixel-wise information still remain uncertain.
            However, the bounding box can be further exploited to impose a powerful topological prior, referred to as \textit{tightness prior} \cite{lempitsky2009image}. This global prior assumes that the target region should be sufficiently close to each of the sides of the bounding box. Therefore, we can expect that each horizontal or vertical line will cross at least one pixel of the target region (as illustrated in Figure \ref{fig:tightness_prior}), and for any region shape. Furthermore, we can regroup the lines into segments of width $w$, each containing $w$ lines. In this case, we can assume that at least $w$ pixels of the object will be crossed by the segment. Formally, we can write this as a set of inequality constraints:
            \begin{align}
                \label{eq:tightness_constraints}
                \sum_{p \in s_l} y_p &\geq w \ \qquad \forall s_l \in \mathcal{S}_L
            \end{align}
            where $\mathcal{S}_L := \{s_l\}$ is the set of segments parallel to the sides of the bounding boxes. This can be easily translated
            into inequality constraints on the outputs of the CNN, where the sum of the softmax probabilities for each segment should be greater or equal to its width. The set of segments $\mathcal{S}_L$ can be efficiently pre-computed; only the masked softmax sum is required during training.
            % \textcolor{red}{Jose: do we need to mention this? }

            % \rred{Hoel: Is think it is very relevant yes. That keeps the extra cost to a minimum (the hundreds of log-barrier). Depending on how the data is stored, it can be a bit costly to perform (need first to find all the connected components).
            % It is a bit like the boundary loss, where the only thing that remains by the time you reach the loss function is multiplying two tensors, summing them, and in this case, putting that into a log-barrier.}


            %To write it more formally -- with $\mathcal{S}_H := \{s_h\}$ the set of horizontal segments composing $\Omega_I$, and $\mathcal{S}_V := \{s_v\}$ is the set of vertical segments -- the following inequalities hold:

            %\begin{align}
            %    \label{eq:tightness_constraints}
            %    \sum_{p \in s_h} y_p &\geq w & \forall s_h \in \mathcal{S}_H \\
            %    \sum_{p \in s_v} y_p &\geq w & \forall s_v \in \mathcal{S}_V .\nonumber\\
            %\end{align}


        \subsection{Additional regularization: constraining the global size}
            \label{sec:box_size}

            The first two parts of the loss are biased toward opposed, trivial solutions: $\mathcal{L}_O$ trivial solution is to predict the whole image as background, while the easiest way to satisfy the tightness constraints \eqref{eq:tightness_constraints} is to predict everything as foreground. But there is more information that we can exploit from the boxes: their total size gives an upper bound on the object size. We can also assume that a small fraction $\epsilon$ of the box belongs to the target region, which yield another lower bound. This takes the form of region-size constraint similar to \cite{kervadec2019constrained}:
            \begin{align}
                \label{eq:constraint_box_size}
                \min_{\ttt} ~&\mathcal{L}_1(\ttt) + ... + \mathcal{L}_n(\ttt) \\
                    & \text{s.t. } \epsilon |\Omega_I| \leq \sum_{p \in \Omega} s_{\ttt}(p) \leq |\Omega_I| . \nonumber
            \end{align}

        \subsection{Lagrangian optimization with log-barrier extensions}
            \label{sec:log-barrier}

            Optimizing $\mathcal{L}_O$ with the constraints from sections \ref{sec:tightness_prior} and \ref{sec:box_size} gives the following constrained optimization problem:
            \begin{align}
                \label{eq:constrained_problem}
                \min_{\ttt} ~&\mathcal{L}_O(\ttt) \\
                    %& \text{s.t. } \sum_{p \in s_h} s_{\ttt}(p) \geq w &\forall s_h \in \mathcal{S}_H \nonumber\\
                   % & \text{s.t. } \sum_{p \in s_v} s_{\ttt}(p) \geq w &\forall s_v \in \mathcal{S}_V \nonumber\\
                    & \text{s.t. } \sum_{p \in s_l} s_{\ttt}(p) \geq w &\forall s_l \in \mathcal{S}_L \nonumber\\
                    & \text{s.t. } \epsilon |\Omega_I| \leq \sum_{p \in \Omega} s_{\ttt}(p) \leq |\Omega_I| . \nonumber
            \end{align}
            This formulation involves a large number of competing constraints. Recent optimization works on constrained CNNs \cite{kervadec2019log} suggest that, in the case of multiple competing constraints, log-barrier extensions provide approximations of Lagrangian optimization in the form of sequences of unconstrained losses, which removes completely expensive and unstable primal-dual steps in the context of deep networks, handling the multiple constraints fully within SGD. Therefore, log-barriers can accommodate the interplay between multiple competing constraints, unlike naive penalty-based methods. These desirable properties are consistent with well-established interior-point and log-barrier methods in convex optimization \cite{Boyd2004}.

            For an inequality constraint in the form of $z \leq 0$, the log-barrier extension can be defined as follows:
            \begin{equation}
                \label{eq:log_barrier_extension}
                \tilde{\psi}_{t}(z) =
                \begin{cases}
                    -\frac{1}{t} \log (-z) & \text{if } z \leq -\frac{1}{t^2} \\
                    tz - \frac{1}{t} \log (\frac{1}{t^2}) + \frac{1}{t} & \text{otherwise} ,
                \end{cases}
            \end{equation}
            where $t$ is a parameter that \textit{raise} the barrier over time (i.e., during training). The main difference with a penalty (such as $\max(0, z)^2$, used by \cite{kervadec2019constrained}) is that \eqref{eq:log_barrier_extension} acts as a {\em barrier} even when the constraint is satisfied ($z \leq 0$), with a gradient getting more aggressive when approaching constraint-violation boundary. This makes the training more stable, and prevents already satisfied constraints from being violated during the next training epochs. Using a penalty could oscillate, alternating between zero and a high-penalty values \cite{kervadec2019log}.

            %Since Equation \eqref{eq:constrained_problem} has many, competing constraints, we decided to use the extended log-barrier.

        \subsection{Final model}
            Using the log-barrier extension, we obtain the final unconstrained optimization problem, which can be optimized with standard SGD:
            \begin{multline}
                \label{eq:final}
                \min_{\ttt} ~ \mathcal{L}_O(\ttt) + \lambda \left[ \sum_{s_l \in \mathcal{S}_L} \elb\left(w - \sum_{p \in s_l} s_{\ttt}(p)\right)   \right] \\
                    + \elb\left(\epsilon |\Omega_I| - \sum_{p \in \Omega} s_{\ttt}(p)\right) + \elb\left(\sum_{p \in \Omega} s_{\ttt}(p) - |\Omega_I| \right)  .
                    %+ \lambda \left[ \sum_{s_h \in \mathcal{S}_H} \elb\left(w - \sum_{p \in s_h} s_{\ttt}(p)\right) + \sum_{s_v \in \mathcal{S}_V} \elb\left(w - \sum_{p \in s_v} s_{\ttt}(p) \right)  \right] \\
            \end{multline}
            $\lambda$ is a real number balancing the tightness prior with respect to the other parts of the loss.
            Notice that all log-barrier extensions $\elb$ use the same $t$, with a common scheduling strategy for $t$. This limits the number of hyper-parameters and simplifies the model.


    \section{Experiments}
        \subsection{Datasets and evaluation}
            We evaluate our method on two different tasks: prostate segmentation in MR-T2 and brain lesion segmentation in MR-T1. Among these tasks, lesion segmentation is particularly challenging, due to the heterogeneity of the lesions and high imbalance in the number of foreground and background pixels.
            %Those are two difficult task, and the brain lesion one has to face with extra difficulty due to the imbalance between foreground and background pixels, on top of the huge variability of the lesions.

            \paragraph{Prostate segmentation on MR-T2.}The first dataset that we use was made available at the MICCAI 2012 prostate MR segmentation challenge\footnote{\url{https://promise12.grand-challenge.org}} \cite{litjens2014evaluation}. It contains the transversal T2-weighted MR images of 50 patients acquired at different centers, with multiple MRI vendors and different scanning protocols. The images include patients with benign diseases, as well as with prostate cancer. Images resolution ranges from $15\times256\times256$ to $54\times512\times512$ voxels, with a spacing ranging from $2\times0.27\times0.27$ to $4\times0.75\times0.75$mm$^3$. We employed 40 patients for training and 10 for validation.

            \paragraph{Brain lesion segmentation on MR-T1.}We also evaluated the proposed method on the Anatomical Tracings of Lesions After Stroke (ATLAS) \cite{liew2018large}, an open-source dataset of stroke lesions. It contains 229 T1-weighted MR images, coming from different cohorts and different scanners. All the images have a resolution of $197\times233\times189$ pixels, with a spacing of $1\times1\times1$ mm. The annotations were done by a team of 11 experts, who received a standardized training. We retained 26 images for validation, while the rest were used for training.

            \paragraph{Evaluation.}To compare quantitatively the performances of the different methods, we employed the Dice similarity coefficient, a standard performance metric in medical image segmentation. In addition to the baseline models, we also perform comprehensive comparisons with DeepCut \cite{rajchl2016deepcut}, whose learning setting is also based on bounding box annotations.
            % The ATLAS dataset is frequently used for research on the segmentation of diverse lesions. This dataset consists of 304 T1-weighted MR images collected from 11 cohorts. Images from cohorts 3-11 were collected on 3T MRI scanners, with a resolution of 1 mm3 (isotropic), while those from cohorts 1 and 2 have a resolution of 0.9 mm × 0.9 mm × 0.9 mm and were collected with a 1.5T scanner. Specifications of the data, including scanner strength, brand, and image resolution, are included in the meta-data of the dataset. From the analysis, the average size of the identified lesions is determined as 2.12 ± 3.898 × 104 mm3 , with minimum and maximum lesion sizes of 10 mm3 and 2.838 × 105mm3 , respectively. On average, the individual patient is more likely to have only one lesion (58%). The probability of identifying lesions in the left and right hemispheres are roughly equal (48.4%, 43.8%, and 7.7%, for the left and right hemispheres, and other locations (e.g. brainstem), respectively). Overall, there is a greater amount of subcortical lesions (70.7%) compared to cortical lesions (21.5%). A total of 11 experts with standardized training in the identification and segmentation of lesions were hired to manually draw lesion masks. For each MR image, lesions were identified manually using MRIcron, an open-source tool for brain imaging visualization [32] that is used to display the brain. For each individual brain, at least one lesion mask was identified, and the lesion properties were subsequently recorded in the metadata.

        %\subsection{Ablation on the CNN architecture}
           % While we use two different networks between each datasets, we also tried the two architectures within the same dataset. This is to ensure that the proposed method is architecture-agnostic.



        \subsection{Implementation details}
            \label{sec:implementation_details}

            To evaluate our method under different settings, we experimented with a differnt network architecture for each task. We employ a residual version of the well-known UNet \cite{ronneberger2015u} to segment the prostate, whereas ENet \cite{paszke2016enet} was a backbone architecture in the stroke lesion segmentation experiments. The models were trained with ADAM \cite{kingma2014adam}, an
            initial learning rate of $5\times10^{-4}$ and a batch size of 4 for the prostate and 32 for stroke lesions. While we employed offline data augmentation (i.e., mirroring, flipping, rotation) to augment the PROMISE12 dataset, no augmentation was performed on the ATLAS dataset. The reason for this is the low number of images on the PROMISE12 dataset compared to ATLAS.


            %On PROMISE12, a modified, fully residual U-Net \cite{ronneberger2015u} is used, with a batch size of 32. Offline data augmentation (mirroring, flipping, rotation) was performed, adding 4 additional images per image. On ATLAS, we used ENet \cite{paszke2016enet} with a batch size of 32, without any form of data augmentation.

            The log-barrier parameters were set following \cite{kervadec2019log}, and were shared across all the log-barrier
            instances.
            We set $\lambda$ (from Eq. \eqref{eq:final}) as $0.0001$ for both datasets. The DenseCRF hyper-parameters are the same as in \cite{rajchl2016deepcut}, and the proposals are updated every 10 epochs for PROMISE12, and every 5 epochs for ATLAS.
            We empirically found that changes on the width $w$ of the segments for the tightness constraints did not have a significant impact on the results. Therefore, $w$ was set to 5 in all the experiments.

            All methods are implemented in PyTorch, with the exception of the DenseCRF \cite{krahenbuhl2011efficient} which uses the Python wrapper PyDenseCRF \footnote{\url{https://github.com/lucasb-eyer/pydensecrf}}. To speed the proposal generation of DeepCut, the CRF inference is parallelized using the standard Python multiprocessing module, with a careful use of SharedArrays to avoid un-necessary and costly copies of arrays between the processes. The code is available online\footnote{\url{https://github.com/LIVIAETS/boxes_tightness_prior}}.

            %Though we played a bit with the width $w$ of the segments for the tightness constraints, it didn't affect the results much. We sticked with our initial guess of 5.

        \subsection{Sensitivity study on box-annotation precision}
            While the main experiments are performed on tight boxes (i.e., the gap between the target regions and the bounding-box sides is not significant), we perform additional experiments where a margin $m$ of 10 pixels was added on each side. This enables us to evaluate the robustness of each model to imprecise bounding-box placement. Robustness to placement is of significant importance, since perfect annotation of all bounding boxes might be unrealistic. Furthermore, robustness to imprecision also alleviates the problem of annotator subjectivity.


    \section{Results}
        \subsection{Main experiment}
            The results of the segmentation experiments are reported in Table \ref{table:main_results}. We can observe that the proposed method consistently outperforms DeepCut \cite{rajchl2016deepcut} across the two datasets. The differences in performance range from 1\% in the PROMISE12 dataset to 10\% in the case of ATLAS. Furthermore, the results obtained from the two loss functions designed to deal with the background constraints indicate that the proposed global emptiness constraint is more effective in our setting. We hypothesize this is due to several factors. First, employing the emptiness constraint on background pixels results in all the constraint losses being on the same scale, which has very nice properties from an optimization perspective. Second, the imbalance nature of the segmentation task in the ATLAS dataset makes the use of the cross-entropy over all the background pixels a suboptimal alternative, forcing solutions that encourage empty segmentations. Finally, we can observe that the proposed method achieves performances comparable to full supervision, particularly in the task of stroke lesion segmentation.
            Using only a subset of the losses does not give optimal results, showing their synergy.

            \begin{table}[h]
                % Results from constrained_cnn-200128-48d2fde-huehuecoyotl-prostate.tar.gz
                % Results from constrained_cnn-200128-48d2fde-huehuecoyotl-atlas.tar.gz
                \centering
                \small \begin{tabular}{l||c||c}
                    \multirow{2}{*}{Method} & PROMISE12 & ATLAS\\
                    % \hline
                    \cline{2-3}
                     & DSC & DSC \\
                    \toprule
                    % Cross-entropy on box & - & - & - \\
                    Deep cut \cite{rajchl2016deepcut} & 0.827 (0.085) & 0.375 (0.246) \\
                    \hline
                    % Box size &  &  \\
                    % % ~~~ w/ masked cross-entropy & NA & - & - \\
                    % ~~~ w/ masked emptiness constraint & NA & NA \\
                    % \hline
                    Tightness prior &  & \\
                    % ~~~ w/ masked cross-entropy & NA & - & - \\
                    ~~~ w/ emptiness constraint & NA & 0.161 (0.145) \\
                    \hline
                    Tightness prior + box size & 0.620 (0.100) & 0.146 (0.134) \\
                    ~~~ w/ masked cross-entropy ($\mathcal{L}_{\text{MCE}}$) & 0.774 (0.045) &0.159 (0.203) \\
                    ~~~ w/ emptiness constraint ($\mathcal{L}_{\text{EMP}}$) & \textbf{0.835 (0.032)} & \textbf{0.474 (0.245)} \\
                    \midrule
                    Full supervision (Cross-entropy) & 0.901 (0.025)& 0.489 (0.294) \\
                \end{tabular}
                \caption{Results on the validation set for the proposed method, and the different baselines in both PROMISE12 and ATLAS datasets. The best results in the weakly supervised setting are highlighted in bold. NA means that the network didn't learn to segment anything meaningful.}
                \label{table:main_results}
            \end{table}


            %Out of the two base losses, $\mathcal{L}_{\text{MCE}}$ and $\mathcal{L}_{\text{EMP}}$, the emptiness constraint proves the be the more effective, especially on the ATLAS dataset. We hypothesize this is due to several factors: all constraints are on the same scale, and  has nicer optimization properties, and high class-imbalance in ATLAS.


            Figure \ref{fig:validation_metrics_plot} depicts the validation results over training of the different models. Even though DeepCut achieves similar results as the proposed approach in the PROMISE12 dataset, we can see that it is very unstable during training, as is the case generally for proposal-based methods. Additionally, its performance degrades over time. This effect is even more noticeable on the ATLAS dataset, where it collapses to empty segmentations after 25 epochs. This behaviour is a clear example of the instability of proposal-based methods, since we observed similar findings on the training images. More details about this issue are provided in Appendix \ref{apx:deepcut}.

            \begin{figure}[h!]
                \centering
                % \subfigure{\includegraphics[width=0.45\textwidth]{figures/prostate/val_3d_dsc.png}}
                % % \subfigure[Val 2D-DSC]{\includegraphics[width=0.45\textwidth]{figures/atlas/val_dice.png}}
                \includegraphics[width=0.45\textwidth]{figures/prostate/val_3d_dsc.png}
                \hspace{.5cm}
                \includegraphics[width=0.45\textwidth]{figures/atlas/val_3d_dsc.png}
                % \subfigure{\includegraphics[width=0.45\textwidth]{figures/atlas/val_3d_dsc.png}}
                \caption{Evolution the validation DSC values over time for both PROMISE12 and ATLAS, and for different methods.}
                \label{fig:validation_metrics_plot}
            \end{figure}

            Qualitative segmentation results are depicted in Fig \ref{fig:qualitative_results}. We can observe how the proposed method with masked CE achieves satisfactory visual results on the prostate (first two rows), but fails to properly segment stroke lesions (last two rows). In contrast, when background segmentations are optimized with the proposed emptiness constraint, we observe how the segmentation results approach full supervision performance in both datasets. This is in line with the results reported in Table \ref{table:main_results}. On the other hand, DeepCut succeeds to segment the prostate but it is not able to obtain satisfactory segmentations for brain lesions. Looking closer at these segmentations, we can observe that they do not reliably follow the target boundaries. This can be explained by the fact that denseCRF assumes strong contrasts between foreground and background regions, which is not the case in many of these images. Furthermore, the results provided by denseCRF are sensitive to its hyper-parameters $\theta_\beta$, $\theta_\gamma$, $\omega_1$ and $\omega_2$, which control the edge sensitivity. Since the set of hyper-parameters were fixed across all the images in the whole dataset, it might happen that an optimal set of hyper-parameters for a given image performs sub-optimally for another image.

            \begin{figure}[h!]
                \centering
                \includegraphics[width=1\textwidth]{figures/cherry.jpg}
                \caption{Predicted segmentation on the validation set for the two tasks.}
                \label{fig:qualitative_results}
            \end{figure}

            %the performances on PROMISE12 seem decent for DeepCut, if we take a look at Figure \ref{fig:validation_metrics_plot} we can see that it is very unstable, with performances decreasing over time, or even collapsing. The best results happen after the first refinement of the proposals, and only go downwards after that.
            %This is not a case of bad over-fitting, but rather a prime example of the instability of proposal methods.




        %\subsection{Several networks}

        %\textcolor{red}{Jose: while interesting, I do not think that this conveys an interesting message. Actually I think it may hurt us. We have enough material in the experiments. I suggest to remove this part (we can keep it for the extension).}
        %Table \ref{table:networks} reports the segmentation results obtained with two different networks.

        %    \begin{table}[h]
        %        \centering
        %        \small \begin{tabular}{l||c||c}
         %           Method & ENet & Residual UNet \\
          %          % \hline
          %          \cline{2-3}
                    %   & 2D DSC & 3D DSC & 2D DSC & 3D DSC \\
          %          \toprule
          %          Tightness prior + box size &  &   \\
           %         ~~~ w/ masked emptiness constraint & 0.474 (0.245) & 0.488 (0.273) \\
            %        \midrule
             %       Full supervision (Cross-entropy) & 0.489 (0.294) & 0.555 (0.302) \\
              %  \end{tabular}
              %  \caption{DSC values for ATLAS, with two different CNN architecture.}
              %  \label{table:networks}
        %    \end{table}

        \subsection{Resilience to box imprecision}
            Results of the sensitivity study on the box precision are reported in Table \ref{table:abl_margin}.
            While all methods were able to reach similar performances when the bounding box annotation is nearly perfect (despite stability issues for some methods), their performance degrades as the margin between the region of interest and the borders of the bounding box increases. Specifically, if a margin $m$ of 10 pixels is added on each side, the performance of the proposed method only drops by 5\%, in terms of DSC, whereas DeepCut performance decreases by 14\%.
            % Qualitative results are depicted in Figure \ref{fig:abl_margin}.
            % On the other hand, the naive baseline \textcolor{red}{Jose:@Hoel, is this our baseline, i.e. masked CE, right?}, equivalent to DeepCut before its first batch of proposals refinement, shows that it is the most affected but those imprecision. This might explains the drop in performance of DeepCut.

            \begin{table}[h!]
                \centering
                \small \begin{tabular}{l|c|c}
                    Method & Margin=0 & Margin=10 \\
                    \toprule
                    % Cross-entropy on box & 0.826 (0.034) & 0.637 (0.073) \\
                    DeepCut & 0.827 (0.085) & 0.684 (0.069) \\
                    Ours (emptiness constraint) & \textbf{0.835 (0.032)} & \textbf{0.778 (0.047)} \\
                \end{tabular}
                \caption{Sensitivity study wrt. the box margins on the PROMISE12 dataset. Best results highlighted in bold.}
                \label{table:abl_margin}
            \end{table}

            % \begin{figure}[ht]
            %     \begin{minipage}[b]{0.4\linewidth}
            %         \centering
            %         \small \begin{tabular}{l|c|c}
            %             Method & Margin=0 & Margin=10 \\
            %             \toprule
            %             $\mathcal{L}_{CE}$ on box & 0.826 & 0.637 \\
            %             DeepCut & 0.827 & 0.684 \\
            %             Ours ($\mathcal{L}_{EMP}$) & \textbf{0.835} & \textbf{0.778} \\
            %         \end{tabular}
            %         \caption{Ablation study on the box margin on PROMISE12.}
            %         \label{table:abl_margin}
            %     \end{minipage}
            %     \hspace{0.5cm}
            %     \begin{minipage}[b]{0.50\linewidth}
            %         \centering
            %         \includegraphics[width=1\textwidth]{figures/margins.jpg}
            %         \caption{Comparison depending on the margin between the box and the object.}
            %         \label{fig:abl_margin}
            %     \end{minipage}
            % \end{figure}

             %\begin{figure}[h]
            %     \centering
            %     \includegraphics[width=.8\textwidth]{figures/margins.jpg}
            %     \caption{Comparison depending on the margin between the box and the object. }
                 % \textcolor{red}{Jose: Maybe no need to also show this in the conference version.. Hoel: I think it is good to pre-emptively deal with this question, as I am sure it'll come up in the questions. Visually seeing the difference in the box makes it very explicit}}
            %     \label{fig:abl_margin}
            % \end{figure}


            Finally, the computational cost of the different methods is discussed in more details in Appendix \ref{apx:perf}.

    \section{Conclusion}
        In this paper we proposed a novel weakly-supervised learning paradigm based on several global constraints, which are derived from bounding box annotations.
        First, the classical tightness prior is integrated into a a deep learning framework by reformulating the problem as a set of constraints on the outputs of the network. Second, a global background emptiness constraint is employed to enforce empty segmentations outside the bounding box, which is demonstrated to be more powerful than standard cross-entropy for handling the background class. Integration of such a large set of inequality constraints on deep networks represents a challenging optimization problem.

        We solve it with sequence of unconstrained losses, which are based on a recent extension of the log-barrier method. Since this formulation accommodates standard stochastic gradient descent, it can be easily trained on deep networks. We performed comprehensive experiments on two public benchmarks for the challenging tasks of prostate and brain stroke lesion segmentation, and demonstrated that the proposed approach outperforms state-of-the-art approaches with bounding-box supervision. Furthermore, quantitative and qualitative results indicate that the proposed approach has the potential to close the gap between bounding-box annotations and full supervision in semantic-segmentation tasks.

        The sensibility study showed that the proposed method is resilient to imprecision in the box tightness. Future works will investigate the use of 3D bounding boxes as annotations, which will make the corresponding 2D boxes looser. Such a workflow could further speed up the annotation process.
        The proposed framework could also be extended to 3D-CNN, by generating segments for the tightness prior along the three axes.
        Furthermore, our approach is also compatible with multi-class segmentation problems, even when bounding boxes of different classes overlap.


        %Main contribution of the paper: we show that by leveraging both the certainty and the uncertainty of bounding box, directly into the training loss, we avoid the caveats of proposal methods, and achieve performances close to full supervision. This is true for different datasets, with different network architecture, and different base losses. It is also efficient, as most of the extra computations can be done in pre-processing, affecting very little the training speed. We also showed that our method is more resilient to imprecision in the drawing of the box, which is important for real-life settings.

        %The proposed method could easily be extended in 3D, to use bounding cubes. Segments generated for the tightness prior would be on the three different axes. It can also be trivially generalized to multi-class segmentation problems.

    % Acknowledgments---Will not appear in anonymized version
    \midlacknowledgments{This work is supported by the National Science and Engineering Research Council of Canada (NSERC), via its Discovery Grant program. We also thank NVIDIA for the GPU donation.}

    \bibliography{kervadec20.bib}

    \newpage
    \appendix

    \section{DeepCut training instability}
        \label{apx:deepcut}

        We investigated the generated pseudo-labels (as showed in Figure \ref{fig:proposals}) by DeepCut, and the main culprit is when the proposal under-segment the object inside the box. This forces, at the next training step, the network to segment the object as background. This kind of conflicting feedback to the network (some other proposal label similar looking patches as foreground) makes the training unstable, and slowly skew the network toward empty predictions. This will cause the next batch of proposals to be even smaller, until the network outputs empty foreground for all the images.

        \begin{figure}[h]
            \centering
            \includegraphics[width=1\textwidth]{figures/deepcut_proposals.jpg}
            \caption{Progression of the pseudo-labels from DeepCut: only a few of those cases can make the training very unstable.}
            \label{fig:proposals}
        \end{figure}

    \section{Implementation and performances}
        \label{apx:perf}

        Performances were measured on a machine equipped with an AMD Ryzen 1700X, 32GB of RAM (frequency did not affect speed) and an NVIDIA Titan RTX. They are reported in Table \ref{table:speed}. The settings and hyper-parameters are the same as described in Section \ref{sec:implementation_details}.

        Most of the extra time introduced by our model comes from the naive log-barrier implementation that we used. Instead of leveraging \verb|if/else| switch and code vectorization we used a standard Python \verb|for| loop over all constraints. This could be improved using the recent PyTorch development of its JIT compiler. The width parameter of the segments will affect the overhead of our method: wider segments means less of them, which, in turns, results in less constraints to handle.

        Notice that implementing the DenseCRF post-processing in a parallel and efficient fashion introduces a lot of software engineering uncommon in modern learning frameworks. While the DenseCRF implementation itself is highly efficient, it remains a single process that can handle only one image at a time. Performing it in parallel should be easy in theory, but is actually not very efficient with Python standard multiprocessing tools. In practice, all the arrays (containing either the image or probabilities) are pickled and copied across processes. Those back-and-forth copies can add up quickly and slow-down the processing substantially, on top of filling the computer memory more quickly.
        The solution is to carefully use SharedArray\footnote{Carefully, because they are not concurrency safe.}, which will contain all the batch in a single object. The sub-processed will read and write only a subset of those SharedArrays, corresponding to their assigned batch item.

        \begin{table}[h]
            \centering
            \small\begin{tabular}{l|c|c|c|c||c|c}
                 & \multicolumn{2}{c|}{Time per epoch (s)} & \multicolumn{2}{c||}{Proposals update (s)} & \multicolumn{2}{c}{\textbf{Total (h)}} \\
                Method & Pr & At & Pr & At & Pr & At \\
                \toprule
                Full supervision & 150 & 235 & - & - & 4.2 & 3.3 \\
                Ours & 170 & 325 & - & - & 4.7 & 4.5 \\
                \midrule
                DeepCut & 150 & 235 & 440 & 3120 & 6.6 & 11.9 \\
            \end{tabular}
            \caption{Comparison in training speed between the different methods on the two datasets, PROMISE12 (Pr) and ATLAS (At).}
            \label{table:speed}
        \end{table}
\end{document}
