%\documentclass{uai2024} % for initial submission
\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amsfonts}

\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors
\usepackage{caption}
\usepackage{subcaption}
\usepackage{graphicx}
\usepackage{wrapfig}
\usepackage{bm}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage{multirow}

\renewcommand{\algorithmicrequire}{\textbf{Input:}}
\renewcommand{\algorithmicensure}{\textbf{Output:}}

\newcommand{\blue}[1]{\textcolor{black}{#1}}
\newcommand{\red}[1]{\textcolor{red}{#1}}
\newcommand{\purple}[1]{\textcolor{black}{#1}}

% \newcommand{\blue}[1]{\textcolor{black}{#1}}
% \newcommand{\red}[1]{\textcolor{black}{#1}}

\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\argmax}{arg\,max}
\newtheorem{theorem}{Theorem}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{ILP-FORMER: Solving Integer Linear Programming with Sequence to Multi-Label Learning}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
%\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2024 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1,2]{Shufeng Kong}
\author[3\thanks{Correspondence to: Caihua Liu <Caihua.Liu@guet.edu.cn>}]{Caihua Liu}
\author[2]{Carla Gomes}
%\author[1]{Further~Coauthor}
%\author[3]{Further~Coauthor}
%\author[3,1]{Further~Coauthor}
% Add affiliations after the authors

\affil[1]{%
    School of Software Engineering\\
    Sun Yat-sen University\\
    Zhuhai, Guangdong, China
}

\affil[2]{%
    Computer Science Dept.\\
    Cornell University\\
    Ithaca, NY, USA
}

\affil[3]{%
    School of Artificial Intelligence\\
    Guilin University of Electronic Technology\\
    Guilin, Guangxi, China
}


  \begin{document}
\maketitle

\begin{abstract}
Integer Linear Programming (ILP) is an essential class of combinatorial optimization problems (COPs). Its inherent NP-hardness has fostered considerable efforts towards the development of heuristic strategies. An emerging approach involves leveraging data-driven methods to automatically learn these heuristics. For example, using deep (reinforcement) learning to recurrently reoptimize an initial solution with Large Neighborhood Search (LNS) has demonstrated exceptional performance across numerous applications. A pivotal challenge within LNS lies in identifying an optimal subset of variables for reoptimization at each stage. Existing methods typically learn a policy to select a subset, either by maintaining a fixed cardinality or by decomposing the subset into independent binary decisions for each variable. However, such strategies overlook the modeling of LNS’s sequential processes and fail to explore the correlations inherent in variable selection. To overcome these shortcomings, we introduce ILP-FORMER, an innovative model that reimagines policy learning as a sequence-to-multi-label classification (MLC) problem. Our approach uniquely integrates a  causal transformer encoder to capture the sequential nature of LNS. Additionally, we employ an MLC decoder with contrastive learning to exploit the correlations in variable selection. Our extensive experiments confirm that ILP-FORMER delivers state-of-the-art anytime performance on several ILP benchmarks. Furthermore, ILP-FORMER exhibits impressive generalization capabilities when dealing with larger problem instances.
\end{abstract}

\section{Introduction}
\label{sec:intro}
ILP has found applications in production planning \cite{mula2006models}, scheduling \cite{ku2016mixed}, scientific discovery \cite{chen2021automating}, and telecommunications networks \cite{gollowitzer2011mip}, among many others. It is well-known that ILP is NP-complete \cite{karp1972reducibility} and many efforts have been devoted to designing effective heuristics to find near-optimal solutions \cite{taha2014integer}. Historically, such algorithms were designed largely manually, requiring a careful understanding of the underlying structure within specific classes of optimization problems. 

Due to the recent success of deep learning (DL) and reinforcement learning (RL), there has been an increasing interest in automatically learning heuristics for COPs from training data \cite{bengio2021machine}. Existing works often leverage machine learning (ML) to output solutions directly from input instances, configure hyperparameters of COP algorithms, or learn a local decision policy for search frameworks such as branch\&bound (B\&B), local branching (LB), or LNS. Among them, we are particularly interested in learning to iteratively reoptimize an initial solution with LNS \cite{wu2021learning,song2020general,sonnerat2021learning,nair2020neural}. These approaches are attractive because we can leverage existing state-of-the-art commercial ILP solvers such as Gurobi or SCIP as a generic black-box subroutine and thus benefits from the cutting-edge technologies of such commercial ILP solvers. 
 
 In this paper, we focus on boosting the performance of LNS, though our method can also be applied to boost the performance of other local search algorithms such as LB. A key challenge of LNS is to select a promising variable subset to reoptimize based on the current solution. Since the selection choice is combinatorial, finding an optimal subset is also computationally hard. Song et al. \cite{song2020general} learn to select fixed, predefined variable subsets with imitation learning and RL. Wu et al. \cite{wu2021learning} learn to select arbitrary variable subsets with RL by factorizing the selection of a variable subset into elementary selections on each variable separately. Similarly, Sonnerat et al. \cite{sonnerat2021learning} learn to predict the probability of selecting a variable independently of other variables using imitation learning and Nair et al. \cite{nair2020neural}  use RL to learn a policy that selects one variable at a time. Recently, Huang et al. \cite{10.5555/3618408.3618971} adopt contrastive learning for a better embedding of ILPs. Nevertheless, all of these works miss modeling the sequential processes of LNS and also do not exploit correlations of variable selection. To address these limitations, we propose to model the policy learning as a sequence to a multi-label classification problem, which jointly models the selection of variables as well as the sequential processes of LNS. 
 
 Our contributions are threefold: (1) we give a new angle of sequence to multi-label classification for learning an effective local decision policy for LNS; (2) we materialize this idea by providing a novel model to seamlessly integrate a customized decision transformer encoder to model the sequential processes of LNS and an MLC decoder with contrastive learning to exploit correlations of variable selection; (3) we conduct extensive experiments on various benchmarks and the results show that our model significantly outperforms state-of-the-art baselines


\section{Other Related Work}

\textbf{Learning to Optimize.} Recently, there has been an increasing interest in applying ML to learn solving COPs. Broadly speaking, there are three categories of learning to optimize algorithms: (1) Learning to predict solutions from inputs. \citet{larsen2018predicting} train a deep neural network (DNN) to predict the solution of a stochastic load planning problem. \citet{nair2020solving} propose neural diving to  learn a DNN to generate multiple partial assignments for its integer variables, and the resulting smaller mixed integer programs (MIPs) for unassigned variables are solved with an off-the-shelf MIP solver to construct high-quality joint assignments. \citet{joshi2019efficient}  learn a DNN by supervision to predict the probability of an edge to be in the traveling salesman problem (TSP) tour. A feasible tour is then generated by beam search. (2) Learning to configure COP algorithms. \citet{liu2022learning} learn to configure the search neighborhood size of LB in each step by using RL. \citet{deng2022deep} integrate belief propagation (BP), gated recurrent units (GRUs),
and graph attention networks (GATs) within the message-passing framework
to reason about dynamic weights and damping factors for composing new BP
messages. (3) Learning alongside COP algorithms. \citet{nair2020neural} learn a DNN to make variable selection decisions in B\&B to bound the objective value gap with a small tree. Deep Bucket Elimination (DBE) \cite{razeghi2021deep} uses DNNs to approximate the large bucket functions.
 \citet{deng2022pretrained} propose a pre-trained cost model which predicts the optimal cost of a given
partially instantiated COP. The predicted cost is then used to construct heuristics for various COP
algorithms such as LNS and B\&B. Our work belongs to the third category.


\iffalse
\textbf{Sequence Learning.}
The recent rapid development of sequence modeling is largely due to the successful applications of DNNs, from LSTMs \cite{hochreiter1997long} and sequence-to-sequence models \cite{sutskever2014sequence} to transformer architectures with self-attention \cite{vaswani2017attention}. Sequence learning aims to capture the temporal dependence of  sequential data (text, speech, video, etc.), and it is widely used in NLP \cite{devlin2018bert,li2022survey}. There is also a recent attempt to apply sequence learning in scientific discovery such as using a causal transformer to model material property with sequential structures \cite{bai2022xtal2dos}. In light of this, it is also tempting to consider how such sequence models can lead to improved performance in LNS, which is also concerned with sequential processes. A causal transformer is an architecture to efficiently model sequences, which is the cornerstone of a decision transformer in RL \cite{chen2021decision}. However, little has been done to model the sequential processes of LNS with transformers; we will address this limitation by adopting a causal transformer (GPT2).

\textbf{Multi-Label Classification.}
Multi-label classification (MLC) is a prediction task where each sample can have more than one labels. Unlike the single-label scenario, label correlations are prevalent in MLC. Early works capture the correlations through classifier chains \cite{read2011classifier}, Bayesian inference \cite{zhang2007ml}, and dimensionality reduction \cite{bhatia2015sparse}. Thanks to the huge capacity of DNNs, one can alleviate the laborious feature mapping and therefore focus on the loss function, feature-label and label-label correlation modeling \cite{yeh2017learning, bai2020disentangled, zhao2021hot}.
It has been shown that contrastive learning can exploit label information effectively in a data-driven manner, and learn meaningful feature and label embeddings that capture the label correlations and enhance the predictive power \cite{bai2022gaussian}. However, current LNS algorithms fail to explore the correlations of variable selection in each step. In this work, we will formulate the policy learning of variable selection as an MLC problem and adopt contrastive learning to model category-level label correlations. To the best of our knowledge, we are the first to use sequence to multi-label learning to improve the performance of LNS, and thus enable us to benefit from modeling long sequences of LNS behaviors and exploiting correlations between variable selection simultaneously.
\fi

\textbf{Primal Heuristics.}
%\section{More Related Works on Primal Heuristics}
Numerous primal heuristic algorithms have been proposed to enhance the efficiency of solving ILPs \cite{berthold2013measuring}. Primal heuristics span from simpler rounding heuristics \cite{achterberg2012rounding} to more computationally demanding diving and large neighborhood search (LNS) heuristics, such as Relaxation Induced Neighborhood Search (RINS) \cite{danna2005exploring}. LNS heuristics are improvement heuristics that solve auxiliary problems using the branch-and-bound technique. In contrast, learning-based LNS approaches can be regarded as primal heuristics automatically learned through machine learning. These approaches showcase significant potential by exploiting data-driven techniques, which ultimately result in improved performance and adaptability across a wide range of problem instances. This work is particularly interested in advancing the capabilities of learning-based LNS approaches.

%RINS, a prominent LNS meta-heuristic, seeks to improve a given feasible MIP solution. By comparing the feasible solution with one obtained from relaxing integer variables, it identifies and eliminates variables with differing values between the two. The resulting sub-MIP is then solved using a MIP solver.

%The solution-polishing heuristic \cite{rothberg2007evolutionary} employs a variable-fixing neighborhood similar to RINS, while also integrating an evolutionary algorithm approach. Unlike RINS, this heuristic uses crossover and mutation operations to combine multiple solutions chosen from a pool of available feasible solutions. The crossover process fixes variables that have identical values across all selected solutions, while mutation is introduced by randomly fixing additional variables to refine already high-quality solutions.

%Adaptive LNS \cite{hendel2022adaptive} capitalizes on an ensemble of LNS algorithms for MIPs and employs a multi-armed bandit to adaptively switch among them during a MIP solve. Although our work does not focus on an ensemble approach, it could be incorporated as another ensemble member to enhance performance.

%In contrast, learning-based LNS approaches can be regarded as primal heuristics automatically learned through machine learning. These approaches showcase significant potential by exploiting data-driven techniques, which ultimately result in improved performance and adaptability across a wide range of problem instances. This work is particularly interested in advancing the capabilities of learning-based LNS approaches..

\section{Preliminaries}
\subsection{Integer Program}

An integer linear program (ILP) is a problem of optimizing a linear function over points in a polyhedral set:
$\arg \min_{x}\{\mu^Tx|Wx\le b;  x \ge 0; x\in \mathbb{Z}^n\}$,
%\begin{equation} \label{eq:IP}
%\begin{array}{ll@{}ll}
%\min  & \mu^Tx  &\\
%\text{s.t.}& Wx \le b,        &\\
%& lb\le x \le ub, &\\ 
%& x\in \mathbb{Z}^n,
%\end{array}
%\end{equation}
where $x\in \mathbb{Z}^n $ is a vector of $n$ decision variables;  $\mu\in \mathbb{R}^n$ denotes the vector of objective coefficients; the incidence matrix $W\in \mathbb{R}^{m\times n}$ and vector $b\in \mathbb{R}^m$ together define $m$ linear constraints. 
%It is well-known that ILPs are NP-hard in general \cite{karp1972reducibility}. 


\subsection{LNS and Its Markov Decision Process Formulation}

%Given an initial solution to an IP instance, LNS iteratively improves the current solution by selecting a subset of decision variables  for reoptimization with an off-the-shelf ILP solver such as Gurobi or CPLEX.  
Given an initial assignment of values to the decision variables in an ILP instance, LNS iteratively refines this assignment by selecting a subset of decision variables, relaxing their values, and solving a subproblem that aims to optimize the objective function while respecting the instance's constraints.
LNS aims to explore a complex solution neighborhood and gradually improve its current solution until a certain termination condition is met \cite{pisinger2010large}. A key challenge of LNS is how to define a good solution neighborhood, namely, one needs to decide which variable subset to reoptimize given the current solution. Obviously, such a decision problem is combinatorial,  and  many works devote to constructing effective heuristics for it \cite{ropke2006adaptive,perron2004propagation,dumez2021large}. In this work, we are particularly interested in the recent trend of learning-based approaches, where data-driven methods are applied to learn the heuristics automatically \cite{song2020general,wu2021learning,sonnerat2021learning}. To this end, the LNS framework can be formulated as a \emph{Markov Decision Process} (MDP) $(\mathcal{S}, \mathcal{A}, P, R)$: 
\begin{itemize}
\item $\mathcal{S}$ is a set of states. A state $s_t \in S$ describes the current status of the LNS process in step $t$, which normally includes the static IP instance information (e.g., variables, constraints, and objectives) and the dynamic solving statistics (e.g., the incumbent solution);
\item $\mathcal{A}$ is a set of all candidate variable subsets for reoptimization.  A variable subset $a_t \in\mathcal{A} $ is also called an \emph{action} of an agent that is executed in step $t$;
\item  $P(s_t, a_t)$ is the transition function to return the next state. Let $x_t$ be the solution with state $s_t$, a smaller sub-IP is first generated by keeping the values of non-selected variables in $x_t$ and reoptimizing the remainder, and then the next state $s_{t+1}$ is obtained by updating $s_t$ with the new solution to the sub-IP: $x_{t+1} = \arg \min_{x}\{\mu^Tx|Wx\le b; x \ge 0; x\in \mathbb{Z}^n; x^i=x^i_t, \forall x^i \not\in a^t\}$;
\item $R(s_t, a_t)$ is the reward function to return the change of objective values, which is defined as $r_t = R(s_t, a_t)=\mu^T(x_t - x_{t+1})$. Let $T$ be the step limit, the \emph{cumulative rewards} from step $t$ of an episode is defined as $R_t = \sum_{k=t}^T\gamma^{k-t}r_k$ with a discount factor $\gamma\in [0, 1]$.
\end{itemize}
\iffalse
\begin{itemize}
\item $\mathcal{S}$ is a set of states. A state $s_t \in S$ describes the current status of the LNS process in step $t$, which normally includes the static IP instance information (e.g., variables, constraints, and objectives) and the dynamic solving statistics (e.g., the incumbent solution).
\item $\mathcal{A}$ is a set of all candidate variable subsets for reoptimization.  A variable subset $a_t \in\mathcal{A} $ is also called an \emph{action} of an agent that performs in step $t$.
\item $P(s_t, a_t)$ is the transition function to return the next state. Let $x_t$ be the solution with state $s_t$, a smaller sub-IP is first generated by keeping the values of non-selected variables in $x_t$ and reoptimizing the remainder, and then the next state $s_{t+1}$ is obtained by updating $s_t$ with the new solution to the sub-IP: $x_{t+1} = \arg \min_{x}\{\mu^Tx|Wx\le b; x \ge 0; x\in \mathbb{Z}^n; x^i=x^i_t, \forall x^i \not\in a^t\}$.
\item $R(s_t, a_t)$ is the reward function to return the change of objective values, which is defined as $r_t = R(s_t, a_t)=\mu^T(x_t - x_{t+1})$. Let $T$ be the step limit, the \emph{cumulative rewards} from step $t$ of an episode is defined as $R_t = \sum_{k=t}^T\gamma^{k-t}r_k$ with a discount factor $\gamma\in [0, 1]$.
\end{itemize}
\fi

A \emph{policy} is a (potentially probabilistic) mapping $\pi: \mathcal{S} \rightarrow \mathcal{A}$. The goal of RL-based algorithms for solving ILPs is to find a policy function to maximize the expected cumulative reward $\mathbb{E}[R_1]$ over all episodes, i.e., the expected improvement over initial solutions. However, existing RL-based algorithms for IP solving train a policy by either temporal difference (TD) learning \cite{sutton2018reinforcement}, policy gradient \cite{williams1992simple}, or behavior cloning \cite{torabi2018behavioral}, all of which miss modeling sequential processes of LNS explicitly. Furthermore, RL-based algorithms may suffer from various issues, such as the need for bootstrapping to propagate returns in TD-learning can cause stability problems, the discounting future rewards can induce undesirable short-sighted behaviors, policy gradient is known to be sample inefficient, and behavior cloning can suffer from cascading errors \cite{kaelbling1996reinforcement,ross2010efficient,chen2021decision}. To circumvent these disadvantages, we propose to learn a policy with decision transformers, which seeks to benefit from modeling sequential processes of LNS and better generalization.

\subsection{Decision Transformer}\label{sec:DT}
Decision transformer (DT) \cite{chen2021decision} abstracts the decision-making process in RL as a sequence modeling problem and attempts to learn a return-conditioned state-action mapping. The return-conditionality means that given a history of return-state-action tokens, such that the first token represents the desired return at the current state, the DT predicts the action required to achieve this desired return. In this paper, we follow the convention of the original DT
and define return, $g_t$, as the non-discounted rewards-to-go: $g_{t} = \sum_{t}^{T} r_{t}$. DT
takes as input a sequence of three-tokens: $(\langle g_{t-K}, s_{t-K}, a_{t-K}\rangle, \cdots, \langle g_t, s_t, a_t \rangle)$, where $K\le T$ is the context length. Each token is then encoded into an embedding and added by a positional encoding. Furthermore, let $(\langle z_{g_{t-K}}, z_{s_{t-K}}, z_{a_{t-K}}\rangle, \cdots, \langle z_{g_t}, z_{s_t}, z_{a_t} \rangle)$ be the corresponding sequence of embeddings, and this sequence of embeddings is fed into a causal transformer to produce another sequence of embeddings $(\langle z^h_{g_{t-K}}, z^h_{s_{t-K}}, z^h_{a_{t-K}}\rangle, \cdots, \langle z^h_{g_t}, z^h_{s_t}, z^h_{a_t} \rangle)$. A decoder takes as input $z^h_{s_t}$ and outputs $\hat{a}_t$. During training, a suitable loss function is applied to penalize the difference between the prediction $\hat{a}_t$ and label $a_t$. During inference,  after specifying a target return based on desired performance and the environment starting state,  DT generates actions autoregressively. The actions are executed and the target return is subtracted by the achieved rewards to obtain the next states. The process of generating actions and applying them to obtain the next return-to-go and state is repeated until episode termination. 


%\subsection{Contrastive Learning}


\section{Solving ILPs with Sequence to Multi-Label Classification}
 Instead of learning to select fixed, predefined variable subsets \cite{song2020general}, Wu et al. \cite{wu2021learning} factorizes the combinatorial action space $\mathcal{A}$ into elementary actions on each dimension (i.e. variables), where $a_t^i \in \{1, 0\}$ denotes the elementary action of whether selecting $x^i$ for reoptimization in step $t$, and $a_t^i$ is $1$ if $x^i$ is selected and $0$ otherwise. Therefore, any action can be expressed as $a_t=\cup_{i=1}^n a_t^i$ and the action selection problem can be converted into $n$ separated binary classification problems. The policy for action selection is factorized by
 \vspace{-1.8mm}
\begin{equation}\label{eq:policydecoposition}
\pi(a_t|s_t) = \prod_{i=1}^n \pi^i(a_t^i|s_t),
\end{equation}
which expresses the probability of selecting an action as the product of probabilities of selecting its elements. However, such an action space factorization limits the class of policies that can be learned and it also fails to explore the correlations between elementary actions. To address these limitations, \textit{we propose to model the policy learning as a sequence to multi-label classification problem, which jointly models the selection of multiple elementary actions as well as the sequential processes of LNS}, i.e.,
\begin{equation}\label{seqmlcproblem}
\pi(a_t|s_t) = p_{\theta}(a_t|h_{\phi}(Q(t, K))),
\end{equation}
where $Q(t, K)$ denotes the function used to return the last $K$ sequence of return-state-action tokens from steps $t-K$ to $t$, namely, $(\langle g_{t-K}, s_{t-K}, a_{t-K}\rangle, \cdots, \langle g_t, s_t, \cdot \rangle)$; $h_{\phi}(\cdot)$ denotes the sequence encoder parameterized by $\phi(\text{NN})$; and the MLC decoder parameterized by $\theta(\text{NN})$ takes as input state embedding $z^h_{s_t}$ produced by $h_{\phi}$ and outputs action distribution $p_{\theta}(a_t|z^h_{s_t})$.  
Effective implementations of the sequence encoder and MLC decoder are crucial to this work. 

\subsection{A Novel Transformer Model for Solving ILPs}
In this section, we propose a novel model ILP-FORMER for the problem given in equation~(\ref{seqmlcproblem}) based on the causal transformer and contrastive learning.

\begin{figure}[th!]
    \centering
\includegraphics[width=0.47\textwidth]{transformer.jpg}
    %\includegraphics[width=0.9\textwidth]{figs/c-gmvae2.pdf}
    \caption{The architecture of our model ILP-FORMER: it consists of several token encoders to produce latent token embeddings, a causal transformer to capture dependence between token embeddings, and a contrastive MLC decoder to exploit correlations between label categories. Here we use a small IP with 4 variables and 3 constraints to show the full pipeline of our model. The problem is first translated into a factor graph $G$, and $G$ is associated with dynamic factor-node features that describe the states of MDP and are encoded into state embeddings in different steps. Similarly, returns $g_t$ and actions $a_t$ are also encoded into latent embeddings. We use a GCN as state encoder and two simple MLPs as return and action encoders respectively. Each token embedding is further added with its relative positional encoding. The sequence of embeddings is fed into the causal transformer to produce another sequence of embeddings. Finally, the contrastive MLC decoder takes as inputs the state embeddings and outputs action predictions.}
    \label{fig:Seq2MLC}
      \vspace{-1em}
\end{figure}

\subsubsection{Factor Graph Representation}\label{sec:FG}
An IP instance can be represented by a \emph{factor graph} \cite{gasse2019exact} which is a bipartite graph $\mathcal{G}=(\mathcal{V}, \mathcal{C}, \mathcal{E})$  consisting of variable-nodes $\mathcal{V}=\{v_1,\cdots,v_n\}$ and factor-nodes $\mathcal{C}=\{c_1,\cdots,c_m\}$. Variable nodes correspond to the variables and factor nodes correspond to the constraints in the IP. An edge $e_{ij}\in \mathcal{E}$ between $v_i$ and $c_j$ is established only if the $j$-th constraint contains the $i$-th variable. %namely, the entry $W_{ji}$ of the IP's incidence matrix is nonzero. 
The variable nodes are associated with a feature matrix $V\in  \mathbb{Z}^{n\times d_v}$, where $d_v$ is the number of features for each variable node. The features of each variable-node $v_i$ include two parts: (1) \emph{static} features: a one-hot vector indicates the node type and the objective coefficient $\mu_i$ of $x_i$; (2) \emph{dynamic} features: the current solution of $x_i$ in step $t$ and the incumbent solution of $x_i$. Note that the dynamic features are used to describe the states of MDP in different steps.
The factor nodes are also associated with a feature matrix $C\in  \mathbb{Z}^{m\times d_c}$, where $d_c$ is the number of features for each factor node. The features of each factor-node $c_i$ only include static features: a one-hot vector indicates the node type and the value $b_i$ at the right-hand-side (RHS) of the $i$-th constraint.
Finally, the weight matrix of edges is exactly the incidence matrix.


\subsubsection{Model Architecture}
Fig.~\ref{fig:Seq2MLC} gives the overall architecture of our ILP-FORMER and it consists of a customized DT encoder and a contrastive MLC decoder. Our encoder is only composed of several customized  token encoders and a causal transformer without  the linear decoder of DT \cite{chen2021decision}.
%encoders, a causal transformer, and a contrastive MLC decoder.

\textbf{Token Encoders}: Each token is first encoded into an embedding and added by a positional encoding. For return and action tokens, two simple multilayer perceptrons (MLPs) are used as return and action encoders respectively. Positional encodings are produced by 
another simple MLP which takes as input a single scalar $t$. Each state token $s_t$ is represented by a factor graph as introduced in section~\ref{sec:FG}, and we use a graph convolutional network (GCN) \cite{zhang2019graph} as the state encoder. A single graph convolution layer is detailed below
\begin{equation}\label{eq:GCN}
\begin{aligned}
C^{(k+1)} &= C^{(k)} + \sigma\left(\text{LN}\left(WV^{(k)}H_v^{(k)}\right)\right), \\
V^{(k+1)} &= V^{(k)} + \sigma\left(\text{LN}\left(W^TC^{(k+1)}H_c^{(k)}\right)\right),
\end{aligned}
\end{equation}
where $H_v^{(k)},H_c^{(k)}\in \mathbb{R}^{d_h\times d_h}$ are trainable weight matrices in the $k$-th layer; $V^{(k)}\in \mathbb{R}^{n\times d_h}$ and $C^{(k)}\in \mathbb{R}^{m\times d_h}$ are embeddings for variable-nodes and factor-nodes respectively in the $k$-th layer; LN and $\sigma(\cdot)$ denote layer normalization and Tanh activation function respectively. The initial embeddings  $V^{(0)}$ and $C^{(0)}$ are linear projections of the raw feature matrices $V$ and $C$ respectively. In this paper, all MLP encoders only have two layers, and the embeddings' dimensions $d_h$ are set to be $128$; the GCN encoder consists of two convolution layers and a mean pooling layer. 

\begin{figure}
    \centering
    \includegraphics[width=0.48\textwidth]{MLC.png}
    %\includegraphics[width=0.9\textwidth]{figs/c-gmvae2.pdf}
    \caption{The architecture of our contrastive MLC decoder. Firstly, a GCN takes as input an IP instance and learns embeddings for label categories respectively, and labels within the same category share the same embedding. Secondly, we use the state embedding in step $t$ as an input feature whose inner products with label embeddings are used to produce prediction $\hat{a}_t$. Lastly, a contrastive loss is designed to pull together the feature embedding and positive label embeddings, while separating the feature embedding from the negative label embeddings.} %Note that a label embedding is positive only if its label is $1$. An example with label $[1,0,0,1]$ is shown.}
    \label{fig:cMLCDecoder}
      \vspace{-1em}
\end{figure}

\textbf{Causal Transformer}:
Causal transformer \cite{vaswani2017attention} is an architecture to efficiently model sequences that consist of stacked self-attention layers with residual connections. In our model, each layer receives a sequence of $L=3K$ token embeddings $\{z_i\}_{i=1}^L$, and outputs $L$ embeddings $\{z_i^h\}_{i=1}^L$, preserving the input dimensions. Specifically, each token embedding $z_i$ is mapped to a key $z^k_i$, a query $z^q_i$, and a value $z^v_i$ via linear functions, and the output $z_i^h$ is given by
\begin{equation}\label{eq:transformer}
z_i^h = \sum_{j=1}^i \alpha_{ij} z^v_j, \quad \alpha_{ij} = \frac{\exp(z^q_i\cdot z_j^k)}{\sum_{j'=1}^i\exp(z^q_i\cdot z_{j'}^k)}. 
\end{equation}  
In this work, we adopt the causal transformer GPT2 \cite{radford2019language} to learn and reason about sequences and we defer the other architecture details to the original paper.

\textbf{Contrastive MLC Decoder}:
Recall that an elementary action $a^i_t\in \{0,1\}$ (a.k.a. a label) denotes whether or not to select variable $x^i$ for reoptimization in step $t$. A label vector $a_t=\cup_{i=1}^n a_t^i \in \{0,1\}^n$ denotes the selected action given state $s_t$. Different from eq.~(\ref{eq:policydecoposition}) which approximates $\pi(a_t|s_t)$ with $n$ separated binary classification problems, we propose to approximate $\pi(a_t|s_t)$ with an MLC decoder that finds a mapping from $z^h_{s_t}$ to $a_t$, where $z^h_{s_t}$ is a state embedding generated by the causal transformer and served as an input feature for our MLC decoder. \emph{A key aspect of learning a policy with an MLC module is that we can  exploit the correlations between elementary actions, which is missing in those existing ML-boosted IP solvers} \cite{wu2021learning,song2020general,sonnerat2021learning}.
%To the best of our knowledge, existing works for IP solving lack of exploring the correlations between elementary actions. Therefore, we propose to use constrastive learning to capture category-level label-label interactions. 

We propose to exploit category-level label correlation with contrastive learning based on the MLC model GMVAE \cite{bai2022gaussian}. GMVAE assumes that the number of labels is fixed and label embeddings are shared across all samples. This is not applicable to our case since different instances may have a different number of decision variables, i.e., actions from different instances may have different cardinalities. Alternatively, we learn category-level label embeddings for each IP instance with a shared GCN, and the embeddings are only shared across samples within each instance. Fig.~\ref{fig:cMLCDecoder} gives the architecture of our  MLC decoder. We denote $a^i$ the $i$-th category of labels $\{a^i_j\}_{j=t-K}^t$ collected from steps $[t-K, t]$. Our idea is as follows: (1) we learn an embedding $z_{a^i}^l$ for each label category $a^i$ such that labels within the same category share the same embedding. Since the number of label categories is exactly the number of variables in an IP instance, we use the GCN described in equations~(\ref{eq:GCN}) to take as input an IP instance and output node embeddings, and we use the variable-node embeddings as label category embeddings respectively; (2) we use the state embedding $z^h_{s_t}$ of each LNS step as an input feature whose inner products with label embeddings correspond to feature-label similarity and can be used for prediction; (3) we use contrastive learning to capture correlations between label categories by pulling similar categories' embeddings together. Specifically, let $I \equiv \{1,\cdots,n\}$ and we define the positive label set of $a_t$ as $P(a_t)\equiv \{i\in I| a^i_t=1\}$. Given a sequence of return-state-action tokens $(\langle g_{t-K}, s_{t-K}, a_{t-K}\rangle, \cdots, \langle g_t, s_t, a_t \rangle)$, our  decoder is designed to optimize the following contrastive loss function:

\begin{equation}
\mathcal{L}_{CL} =\frac{1}{K}\sum_{j=t-K}^t\frac{1}{|P(a_{j})|}\sum_{i\in P(a_{j})}-\log \frac{z^h_{s_{j}}\cdot z^l_{a^i}}{\sum_{i'\in I} z^h_{s_{j}}\cdot z^l_{a^{i'}}},
\end{equation} 
where state embeddings $z^h_{s_{j}}$ for $j \in [t-K,t]$ are generated by the causal transformer and are computed as in eq.~(\ref{eq:transformer}). 

For example, if in most of the actions $a_t$, labels $a^i_t$ and $a^j_t$ often appear together (i.e., they both equal $1$), contrastive learning will implicitly pull their embeddings together. In other words, if two labels do co-appear often, their label embeddings would become similar. On the other hand, if they never co-occur or only co-appear occasionally, their connections are not significant and our decoder will not optimize for their similarity.


%Since the number of label categories equal that of variables in an IP instance, we use the GCN described in equations~(\ref{eq:GCN}) to take as input an IP instance and output node embeddings, and we take variable-node embedding of $x^i$ as the embedding of label category $a^i$, where labels within the same category share the same embedding.


%Our idea is simple: we learn embeddings for each label class and the inner products between embeddings should reflect the similarity. We use the state embeddings $z^h_{s_t}$ as feature embeddings whose inner products with label embeddings correspond to feature-label similarity and can be used for prediction. For example, if in most of the samples, $a^i_t$ and $a^j_t$ are always appear together (i.e., they both equal $1$), the contrastive learning will implicitly pull their embeddings together. In other words, if two labels do co-appear often, their label embeddings would become similar. On the other hand, if they never co-occur or only co-appear occasionally, their connections are not significant and our model will not optimize for their similarity.

%\subsection{A Novel MLC Decoder for Action Selection with Contrastive Learning}


\subsection{Training Algorithm}
Our model will be trained with supervised learning. Given a set of training IP instances, we first collect a dataset of sequences of return-state-action tokens that are generated by the MDP with some expert policy: $\mathcal{D}=\{(\langle g_{t-K}, s_{t-K}, a_{t-K}\rangle_j, \cdots, \langle g_t, s_t, a_t \rangle_j)\}_{j=1}^N$, where $K\le T$ is the length of each sequence.
%; $g_{t}=\sum_{t}^Tr_t$;  $s_t$ denotes the state of the MDP in step $t$;  and $a_t=\cup_{i=1}^n a_t^i\in \{0,1\}^n$ expresses the variable subset selected for reoptimization. 
For each sequence $q \in \mathcal{D}$, our model will take as input $q$ and generate a set of action predictions $\{\hat{a}_j\}_{j=t-K}^t$, and we will also collect the set of labels from $q$, $\{a_j\}_{j=t-K}^t$. A supervised cross-entropy loss for each sequence is given by 
\begin{equation}
\mathcal{L}_{CE} =\frac{1}{K}\sum_{j=t-K}^t\frac{1}{n} \sum_{i=1}^n a_j^i \log \hat{a}^{i}_{j} + (1-a_j^i)\log(1-\hat{a}_j^i).
\end{equation}

The final objective function to minimize is given by 
\begin{equation}
\mathcal{L} = \mathcal{L}_{CL}-\beta \mathcal{L}_{CE},
\end{equation}
where $\beta$ is a trade-off weight. The model is trained with Adam \cite{kingma2014adam} and optimized with $\mathcal{L}$. We sample mini-batches of sequence length $K$ from the dataset $\mathcal{D}$ and the model is trained with GPUs in parallel. However, similar to DT, our model will generate action predictions autoregressively during testing. We refer the reader to section~\ref{sec:DT} for more details.  



\iffalse

\begin{figure*}[!tbp]
  \centering
  \subfloat[Flower one.]{\includegraphics[width=0.55\textwidth]{transformer.png}\label{fig:f1}}
  \hfill
  \subfloat[Flower two.]{\includegraphics[width=0.45\textwidth]{MLC.png}\label{fig:f2}}
  \caption{My flowers.}
\end{figure*}

\fi

\section{Experiments}

\begin{table*}[th!]
\centering
\scalebox{0.8}{
\begin{tabular}{c|cc|cc|cc|cc}
\hline
\multirow{2}{*}{Methods} & \multicolumn{2}{c|}{MVC-1000}   & \multicolumn{2}{c|}{MC-500}  & \multicolumn{2}{c|}{SC-1000}  & \multicolumn{2}{c}{CA-2000}         \\ \cline{2-9} 
  &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% \\ \hline
Gurobi     &482.2$\pm$ 0.8     &10.83   &$-863.9\pm 3.8$    & 4.94   &$554.9\pm 8.3$    &6.26     &$-111668\pm 2.0$    & 4.18         \\ %\hline
FT-LNS    &$470.0\pm 0.4$     &8.02   &$-866.2\pm 1.7$    & 4.69   &$564.1\pm 8.4$    &8.02     &$-110041\pm 1.6$    & 5.57         \\ %\hline
RL-LNS    &$469.0\pm 0.5$     &7.79   &$-878.0\pm1.6$    & 3.39   &$551.9\pm 8.3$    &5.69     &$-111787\pm 2.6$    & 4.07         \\ %\hline
LB-SRMRL  &$472.4\pm 0.7$     &8.57   &$-859.1\pm 2.3$    & 5.47   &$560.9\pm 7.3$    &7.41     &$-110741\pm 3.1$    & 4.97         \\ %\hline
CL-LNS    &$450.2\pm 0.4$     &3.47   &$-865.3\pm 1.6$    & 4.46   &$540.2\pm 7.4$    &3.45     &$-112956\pm 2.1$    & 3.07         \\ \hline     
ILP-FORMER     &$\mathbf{435.1\pm 0.8}$ &\textbf{0} &$\mathbf{-908.8\pm 1.3}$ &\textbf{0} &$\mathbf{522.2\pm 5.3}$ &0 &$\mathbf{-116535\pm 2.1}$ &\textbf{0} \\ \hline
\end{tabular}
}
\caption{A comparison of ILP-FORMER and the state-of-the-art baselines on 4 diverse benchmarks. The time limit is set to 200s. Each result is averaged over 5 runs. The gap is the ratio of objective difference w.r.t. the best result.}\label{tab1}
  \vspace{-1em}
\end{table*}


\begin{table*}[th!]
\centering
\scalebox{0.8}{
\begin{tabular}{c|cc|cc|cc|cc}
\hline
\multirow{2}{*}{Methods} & \multicolumn{2}{c|}{MVC-2000}   & \multicolumn{2}{c|}{MC-1000}  & \multicolumn{2}{c|}{SC-2000}  & \multicolumn{2}{c}{CA-4000}         \\ \cline{2-9} 
  &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% \\ \hline
Gurobi     &$392.5 \pm 1.3$   &10.53   &$-1784.7\pm 1.0$    & 6.74   &$295.7\pm 7.9$    &6.63    &$-212890\pm 1.8$    & 8.97         \\ %\hline
FT-LNS    &$390.5 \pm 1.1$    &9.97   &$-1767.8\pm 1.0$    & 7.62   &$303.3\pm 8.0$    &9.38    &$-211324\pm 2.1$    & 9.64         \\ %\hline
RL-LNS    &$375.8 \pm 2.1$    &5.83   &$-1831.0\pm 0.9$    & 4.32   &$295.4\pm 7.8$    &6.53    &$-216650\pm 1.7$    & 7.36         \\ %\hline
LB-SRMRL  &$395.2 \pm 1.9$    &11.29   &$-1765.6\pm 1.5$    & 7.73   &$301.4 \pm 7.2$   &8.69    &$-209420\pm 2.1$    & 10.45          \\ %\hline
CL-LNS    &$370.2\pm 1.4$     &4.25   &$-1865.3\pm 1.6$    & 2.52   &$290.2\pm 7.4$    &4.65    &$-222956\pm 2.1$    & 4.67         \\ \hline  
ILP-FORMER &$\mathbf{355.1 \pm 1.1}$ &\textbf{0} &$\mathbf{-1913.6 \pm 0.8}$ &\textbf{0} &$\mathbf{277.3 \pm 7.1}$ &\textbf{0} &$\mathbf{-233870 \pm 1.7} $ &\textbf{0} \\ \hline
\end{tabular}
}
\caption{Generalization to larger instances with a double number of variables. The time limit is set to 500s.}\label{tab2}
  \vspace{-1em}
\end{table*}

\begin{table*}[th!]
\centering
\scalebox{0.8}{
\begin{tabular}{c|cc | cc | cc | cc}
\hline
\multirow{2}{*}{Methods} & \multicolumn{2}{c|}{MVC-4000}   & \multicolumn{2}{c|}{MC-2000}  & \multicolumn{2}{c|}{SC-4000}  & \multicolumn{2}{c}{CA-8000}         \\ \cline{2-9} 
  &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% \\ \hline
Gurobi     &$278.3 \pm 0.9$     &7.78   &$-3574.4\pm 0.8$    & 5.91   &$175.4\pm 7.0$    &7.48     &$-422291 \pm 1.2$    & 4.71         \\ %\hline
FT-LNS    &$279.2 \pm 1.7$     &8.13   &$-3526.2\pm 0.8$    & 7.18   &$175.2\pm 6.6$    &7.35     &$-431234 \pm 0.9$    & 2.69         \\ %\hline
RL-LNS    &$273.6 \pm 2.1$     &5.96   &$-3612.5\pm 0.7$    & 4.91   &$172.4\pm 7.1$    &5.64     &$ -432980\pm 0.7$    & 2.30         \\ %\hline
LB-SRMRL  &$275.6 \pm 2.2$     &6.74   &$-3505.1 \pm 0.9$    & 7.73   &$177.1 \pm 7.2$    &8.52     &$-415631 \pm 0.5$    & 6.21         \\ %\hline
CL-LNS    &$270.2\pm 2.4$     &4.65   &$-3535.3\pm 0.6$    & 6.94   &$173.2\pm 7.1$    &4.92     &$-434211\pm 2.1$    & 2.02         \\ \hline  
ILP-FORMER &$\mathbf{258.2 \pm 1.9}$ &\textbf{0} &$\mathbf{-3798.9 \pm 1.0}$ &\textbf{0} &$163.2 \pm 6.1$ &0.60 &$\mathbf{-439151 \pm 0.5}$ &\textbf{0} \\ \hline
\end{tabular}
}
\caption{Generalization to larger instances with a quadruple number of variables. The time limit is set to 500s.}\label{tab3}
\vspace{-1em}
\end{table*}


We conduct experiments on four diverse NP-hard COP benchmarks, including minimum vertex cover (MVC), maximum cut (MC), set covering (SC), and combinatorial auction (CA). We follow the experimental settings of \cite{song2020general} and \cite{wu2021learning}.

\iffalse
\begin{table*}
\centering
\begin{tabular}{c|cc | cc | cc }
\hline
\multirow{2}{*}{Methods} & \multicolumn{2}{c|}{MVC-1000}   & \multicolumn{2}{c|}{MC-500}  & \multicolumn{2}{c}{SC-1000}           \\ \cline{2-7} 
  &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\%  \\ \hline
ILP-FORMER$\ominus$MLC    &$465.5\pm 0.6$     &1.84   &$-862.0\pm 1.5$    & 3.88   &$557.2\pm 8.3$    &3.53              \\ %\hline
ILP-FORMER$\ominus$DT  &$492.6\pm 1.1$     &7.77   &$-829.1\pm 2.8$    & 7.55   &$579.9\pm 6.3$    &7.75              \\ \hline
ILP-FORMER     &$\mathbf{457.1\pm 0.8}$ &\textbf{0} &$\mathbf{-896.8\pm 1.4}$ &\textbf{0} &$\mathbf{538.2\pm 5.3}$ &0  \\ \hline

\end{tabular}
\caption{An ablation study on the casual transformer and MLC decoder components of ILP-FORMER. Note that the experimental settings here follow that of Table~\ref{tab1}.}\label{tab4}
\end{table*}
\fi

\subsection{Datasets and Experimental Setup}\label{sec:setup}
\textbf{Datasets.} MVC and MC are graph optimization problems; SC and CA are general IPs. For MVC, we use the Erd\H{o}s-R\'{e}nyi (ER) model \cite{erdHos1960evolution} to generate random graphs with 1000 nodes and edge probability 0.15. For MC, we use the Barabasi-Albert (BA) model \cite{albert2002statistical} to generate random graphs with 500 nodes and an average degree of 4. For SC, we generate instances with matrices  having 5000 rows and 1000 columns following the procedure in \cite{balas1980set}, where each entry $B_{ij}\in \{0,1\}$ represents whether the $i$-th element in the universe belongs to the $j$-th set. For CA, we use the Combinatorial Auction Test Suit (CATS) \cite{leyton2000towards} with arbitrary relationships to generate instances with 2000 items and 4000 bids. For each problem type, we generate 100, 20, and 50 instances for training, validation, and testing, respectively. For each training instance, we use LNS with LB to run on it and set a time limit of half an hour for solving the sub-IP in each step with LB and Gurobi. We use the resulting trajectory for training ILP-FORMER, and we randomly sample 20 sequences of return-state-action tokens from each trajectory. Therefore, our dataset for training ILP-FORMER includes 2000 sequences of three tokens.

\textbf{Initialization.} LNS starts with a feasible initial solution. For MVC, MAXCUT, SC, and CATS, we initialize a feasible solution by including all vertices in the cover set, randomly partitioning all vertices into two complementary sets, including all sets in the set cover, and accepting no bids, respectively. The initialization process does not incur additional computational costs in our experiments.

\textbf{Implementation and Hyperparameters.}
Return, action, and position encoders are all simple MLPs with 2 layers and 128 hidden neurons. The state encoder is a GCN with 2 convolution layers and a mean pooling layer. We use the GPT2 as our casual transformer. Dimensions of all hidden embeddings are set to 128. We set the batch size and the number of training epochs to 128 and 100, respectively, for all experiments. Our model was implemented with the Pytorch deep learning framework and the whole model was trained using the Adam optimizer \cite{kingma2014adam} with a learning rate of 0.0001 and a weight decay ratio of 0.01 in an end-to-end fashion. All experiments were carried out on a machine with a 4.2
GHz quad-core Intel i7 CPU, 16 GB RAM, and an Nvidia RTX 3090 24GB GPU card. %Further implementation details can be found in the appendix.

The hyperparameters we employed are as follows: (1) Number of layers: 3; (2) Number of attention heads: 1; (3) Embedding dimension: 128; (4) Nonlinearity function: ReLU; (5) Batch size: 128; (6) Context length K: 25; (7) Dropout ratio: 0.1; (8) Learning rate: 1e-4; (9) Gradient norm clipping: 0.25. We maintained other parameters at their default values. We trained the model from scratch and did not utilize any pre-trained weights. 

Grid search is adopted for tuning. We tune learning rate from $0.00005$ to $0.002$ with interval 0.00005, dropout ratio from [0.05, 0.1, 0.3, 0.5], weight decay from [0, 0.01, 0.0001], $\beta$ from [0.1, 0.5, 1, 1.5, 2.0], token embedding size from [64, 128, 256, 512], context length $K$ from [15, 20, 25, 30], batch size from [64, 128, 256],  Gradient norm clipping from [0.15, 0.2, 0.25, 0.3, 0.35]. 

In the data collection process, we run LNS with LB and adaptive neighborhood size. The neighborhood size is initially set to 10\% of the number of variables in the input problem instance. It is then adapted following the approach described in the paper by \cite{sonnerat2021learning}. 
During testing, at each step $t$, the model generates action distributions $a_{t,i}$ for each dimension $i$ autoregressively. We apply a threshold of 0.5 to convert these values into 1 or 0, representing the selection or non-selection of the corresponding variable $x_i$ in step $t$. 

\textbf{Baselines.}
We compare our method with five baselines: (1) Gurobi (version 9.5) with default settings: a leading state-of-the-art IP solver; (2) FT-LNS: the best-performing LNS version by \cite{song2020general}, which applies imitation learning to mimic the best demonstrations; (3) RL-LNS: the current state-of-the-art learning-based LNS method for solving ILPs \cite{wu2021learning}, which uses deep RL to learn LNS policy via action factorization to represent all potential variable subsets; (4) LB-SRMRL: the best-performing LB version by \cite{liu2022learning}, which uses a regression model and RL to learn a hybrid model to predict and adapt the neighborhood size for the LB heuristic; and (5) CL-LNS: the current state-of-the-art learning-based LNS method for solving ILPs \cite{10.5555/3618408.3618971}, which contrastive learning for ILP representation learning. 
We follow the default settings of these learning-based baselines and further fine-tune them on our datasets to get the best hyperparameters. For more details of the settings of these baselines, we refer the reader to their original papers.

\textbf{Evaluation Metrics.} The performances of different algorithms are compared in two measures: (1) the objective of solutions returned by different algorithms within a time limit; (2) the gap between solutions, namely, the ratio of objective difference w.r.t. the best result.


\begin{figure*}[t!]
  \centering
  \begin{subfigure}[b]{0.45\textwidth}
    \includegraphics[width=\textwidth]{MVC_1000_plot.png}
    \caption{MVC-1000}
    \label{fig:plot1}
  \end{subfigure}
  \hfill
  \begin{subfigure}[b]{0.45\textwidth}
    \includegraphics[width=\textwidth]{MC_500_plot.png}
    \caption{MC-500}
    \label{fig:plot2}
  \end{subfigure}
  
  \begin{subfigure}[b]{0.45\textwidth}
    \includegraphics[width=\textwidth]{SC_1000_plot.png}
    \caption{SC-1000}
    \label{fig:plot3}
  \end{subfigure}
  \hfill
  \begin{subfigure}[b]{0.45\textwidth}
    \includegraphics[width=\textwidth]{CA_2000_plot.png}
    \caption{CA-2000}
    \label{fig:plot4}
  \end{subfigure}
  
  \caption{Anytime Performance Comparison of Gurobi, FT-LNS, RL-LNS, LB-SRMRL, CL-LNS, and ILP-FORMER on Four IP benchmarks. Runtimes are up to 30 minutes.}
  \label{fig:plots}
  \vspace{-1em}
\end{figure*}

\begin{table*}[th!]
\centering
\scalebox{0.8}{
\begin{tabular}{c|cc | cc | cc | cc}
\hline
\multirow{2}{*}{Methods} & \multicolumn{2}{c|}{MVC-1000}   & \multicolumn{2}{c|}{MC-500}  & \multicolumn{2}{c|}{SC-1000}  & \multicolumn{2}{c}{CA-2000}         \\ \cline{2-9} 
  &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% \\ \hline
SCIP      &$552.2\pm 0.5$     &21.60   &$-793.9\pm 2.8$    & 11.41   &$604.3\pm 3.2$    &11.41     &$-100514\pm 2.1$    & 12.31         \\ %\hline
FT-LNS    &$490.2\pm 0.6$     &7.95   &$-836.3\pm 1.2$    & 6.68   &$585.2\pm 6.4$    &7.91     &$-107141\pm 1.7$    & 6.53         \\ %\hline
RL-LNS    &$480.0\pm 0.5$     &5.70   &$-847.5\pm1.3$    & 2.42    &$575.6\pm 6.2$    &6.14     &$-108787\pm 2.2$    & 5.09         \\ %\hline
LB-SRMRL  &$492.4\pm 0.9$     &8.43   &$-820.3\pm 1.9$    & 8.47   &$580.8\pm 5.3$    &7.10     &$-107741\pm 3.0$    & 6.01         \\ %\hline
CL-LNS    &$470.4\pm 0.9$     &3.66   &$-860.3\pm 1.9$    & 4.01   &$560.8\pm 5.3$    &3.41     &$-109941\pm 3.0$    & 4.09         \\ \hline
ILP-FORMER     &$\mathbf{454.1\pm 0.7}$ &\textbf{0} &$\mathbf{-896.2\pm 1.2}$ &\textbf{0} &$\mathbf{542.3\pm 5.3}$ &0 &$\mathbf{-114626\pm 1.5}$ &\textbf{0} \\ \hline
\end{tabular}
}
\caption{Results with SCIP. The time limit is set to 200s. Each result is averaged over 5 runs. The gap is the ratio of objective difference w.r.t. the best result. The best results are shown in \textbf{bold}.}\label{tab4}
  \vspace{-5mm}
\end{table*}



\subsection{Experimental Results}
A comparison of ILP-FORMER and other state-of-the-art baselines on 4 diverse benchmarks is given in Table~\ref{tab1}. All learning-based algorithms, including our ILP-FORMER, call Gurobi to solve sub-IPs with a time limit of 2s at every step. We can observe that LB-SRMRL is not comparable to other algorithms. CL-LNS remains the most competitive baseline and consistently outperforms Gurobi, RL-LNS, and FT-LNS. 
%Our ILP-FORMER consistently outperforms RL-LNS and Gurobi on all benchmarks by averaged ratios of 2.41\% and 3.68\%, respectively. 
Overall, these results suggest that our approach can reliably offer substantial improvements over state-of-the-art solvers. 

We also compare the generalization ability of all algorithms to solve large IPs. To this end, we generate two sets of testing instances following the same settings as in section \ref{sec:setup} but double and quadruple the number of variables respectively. Note that we only generate 50 testing instances for each problem type without considering training and validation. We test all (trained) models on these new instances and summarize results in Tables~\ref{tab2} and \ref{tab3}. We can observe that the advantage of our ILP-FORMER still preserves on larger problem instances compared to baselines. Specifically, Table~\ref{tab2} shows that ILP-FORMER still consistently outperforms all baselines on the 4 benchmarks when the instance size is doubled.
%ILP-FORMER outperforms RL-LNS and Gurobi by averaged ratios of 2.06\% and 4.56\% respectively. 
On the other hand, Table~\ref{tab3} shows that ILP-FORMER outperforms all baselines on 3 out of 4 benchmarks when the instance size is quadrupled.
%ILP-FORMER outperforms RL-LNS and Gurobi by averaged ratios of 1.12\% and 3.03\% respectively. 
In summary, our ILP-FORMER learned on small instances generalizes well to larger instances, with a persistent advantage over other methods.


\subsection{Anytime Performance}

%We present further results that showcase the anytime performance of various algorithms, including random LNS in this experiment for easier comparison between random LNS and ILP-FORMER, across four benchmarks as depicted in Figure~\ref{fig:plots}. Our observations reveal the following: (1) ILP-FORMER substantially surpasses other baselines with a notable margin; (2) Given extended time limits, the advatange of ILP-FORMER remains; (3) Despite being trained on trajectories derived from random LNS, ILP-FORMER can generate remarkably superior trajectories during testing, resulting in significantly improved performance compared to random LNS. This is similar to the findings in \cite{chen2021decision}, where DT is capable of generating optimal trajectories during test time even when trained on random walk data for the shortest pathfinding problem.
We further showcase the anytime performance of various algorithms, including random LNS in this experiment, to facilitate an easier comparison between random LNS and ILP-FORMER across four benchmarks, as illustrated in Figure~\ref{fig:plots}. Our observations indicate that:
(1) ILP-FORMER significantly outperforms other baselines with a noteworthy margin.
(2) Even with extended time limits, ILP-FORMER's advantage persists.
%(3) Interestingly, even though ILP-FORMER is trained on trajectories derived from random LNS, it can generate remarkably superior trajectories during testing, leading to substantially improved performance compared to random LNS. This mirrors the findings in \cite{chen2021decision}, where the Decision Transformer (DT) can generate optimal trajectories during test time, even when trained on random walk data for the shortest pathfinding problem.

\subsection{Additional Experiments with SCIP}

Our framework can integrate any ILP solver to enhance incumbent solutions. We primarily conducted experiments with Gurobi, given its status as a leading ILP solver. Additionally, we also present results utilizing SCIP (v6.0.1) as an alternative ILP solver. By employing the same settings as detailed in Section 5.1 and applying them to the four benchmarks, we display the results in Table~\ref{tab4}. These outcomes align with those observed when using Gurobi as the ILP solver, albeit with SCIP exhibiting a notably lower performance compared to Gurobi.  


\subsection{Ablation Study}
To demonstrate the strength of ILP-FORMER, we compare it with two variants: (1) ILP-FORMER$\ominus$MLC: a modified ILP-FORMER where its MLC decoder is replaced with a linear decoder; (2)ILP-FORMER$\ominus$DT: a modified ILP-FORMER where its casual transformer component is removed. The results are summarized in Table~\ref{tab4}. We can observe that ILP-FORMER outperforms the two variants consistently;
our model's performance drop significantly if we do not consider modeling the sequential process of LNS (drop by 4.09\% on average) or exploit correlations of variable selection (drop by 3.13\% on average).

\begin{table*}[th!]
	\centering
	\begin{tabular}{c|cc | cc | cc }
		\hline
		\multirow{2}{*}{Methods} & \multicolumn{2}{c|}{MVC-1000}   & \multicolumn{2}{c|}{MC-500}  & \multicolumn{2}{c}{SC-1000}           \\ \cline{2-7} 
		&Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\%  \\ \hline
		ILP-FORMER$\ominus$MLC    &$460.5\pm 0.6$     &5.86   &$-882.0\pm 1.5$    & 2.95   &$540.2\pm 8.3$    &3.45              \\ %\hline
		ILP-FORMER$\ominus$DT  &$452.6\pm 1.1$     &4.02   &$-889.1\pm 0.8$    & 2.17   &$538.9\pm 6.3$    &3.20             \\ \hline
		ILP-FORMER     &$\mathbf{435.1\pm 0.8}$ &\textbf{0} &$\mathbf{-908.8\pm 1.3}$ &\textbf{0} &$\mathbf{522.2\pm 5.3}$ &0  \\ \hline
		
	\end{tabular}
	\caption{An ablation study on the casual transformer and MLC decoder components of ILP-FORMER. Note that the experimental settings here follow that of Table~\ref{tab1}.}\label{tab4}
\end{table*}


\subsection{Testing on Real-World Instances in MIPLIB}
We follow the experimental settings for real-world instances in MIPLIB as described in \cite{wu2021learning}. We exclude ``easy'' instances with relatively small sizes, as well as instances where Gurobi cannot find any feasible solutions within a 3600-second time limit. Consequently, we choose 35 representative ``hard'' or ``open'' instances containing only integer variables. Within these instances, the number of variables ranges from 150 to 393,800 (averaging 49,563), and the number of constraints varies from 301 to 850,513 (averaging 96,778). We use the datasets in section \ref{sec:setup} to train our model and evaluate our model (with Gurobi as the repair solver) on this realistic dataset, in the style of \emph{active search} \cite{bello2016neural,wu2021learning,khalil2017learning} on each instance. Our findings indicate that, with a 3600-second time limit, ILP-FORMER surpasses both solvers on 20 of the 35 instances and exhibits comparable performance on 10 of the 35 instances. %More details are provided in the appendix.

\begin{table*}[h]
	\centering
	\scalebox{0.99}{
		\begin{tabular}{|c|c|c|c|}
			\hline
			\textbf{Instance} & \textbf{Gurobi} & \textbf{CL-LNS} & \textbf{ILP-FORMER}  \\
			\hline
			a2864-99blp & -72 &  -73 & \textbf{-85}  \\
			\hline
			bab3 &  -655388.1120 & -655022.5305 & \textbf{-655412.3501}  \\
			\hline
			bley-xs1noM &  \textbf{3938322.37} & 3965411.35 & 3958310.21  \\
			\hline
			cdc7-4-3-2 & -257 & -276 & \textbf{-280}  \\
			\hline
			comp12-2idx & 380 & 363 & \textbf{352}  \\
			\hline
			ds & \textbf{177} & 319 & 189  \\
			\hline
			ex1010-pi &  239 & 238 & 238  \\
			\hline
			graph20-80-1rand & -6 & -6 & -6  \\
			\hline
			graph40-20-1rand & \textbf{-15} & -14 & -14  \\
			\hline
			neos-3426085-ticino &  226 & 226 & 226  \\
			\hline
			neos-3594536-henty &  401948 & 402426 & \textbf{401896}  \\
			\hline
			neos-3682128-sandon &  34666770.0 & \textbf{34666765.12338} & 34666770  \\
			\hline
			ns1828997 &  133 & 128 & \textbf{98}  \\
			\hline
			nursesched-medium-hint03 &  115 & 131 & 115  \\
			\hline
			opm2-z12-s8 & -33269 & -53379 & \textbf{-55269}  \\
			\hline
			pb-grow22 & -46217.0 & -46881.0 & \textbf{-56782}  \\
			\hline
			proteindesign121hz512p9 &  1499 & 1489 & \textbf{1481}  \\
			\hline
			queens-30 &  -39 & -39 & -39  \\
			\hline
			ramos3 &  245 & 248 & \textbf{216}  \\
			\hline
			rmine13 & -3493.781904 & -3487.807859 & \textbf{-3493.79821}  \\
			\hline
			rmine15 & -1979.559046 & -5001.279118 & \textbf{-5002.129874}  \\
			\hline
			rococoC12-010001 &  34673 & \textbf{35440} & 35467  \\
			\hline
			s1234 &  29 & 40 & 29  \\
			\hline
			scpj4scip &  133 & 134 & \textbf{131}  \\
			\hline
			scpk4 &  330 & 329 & \textbf{325}  \\
			\hline
			scpl4 &  279 & 281 & \textbf{269}  \\
			\hline
			sorrell3 &  -16 & -16 & -16  \\
			\hline
			sorrell4 & -23 & -23 & -23  \\
			\hline
			sorrell7 &  -187 & -187 & \textbf{-190}  \\
			\hline
			supportcase2 & 397 & 397 & 397  \\
			\hline
			t1717 & 201342 & 186891 & \textbf{185241}  \\
			\hline
			t1722 & 117171 & 117978 & \textbf{115983}  \\
			\hline
			tokyometro & 8479.5 &  8562.80 & \textbf{8456.7}  \\
			\hline
			v150d30-2hopcds & 41 & 41 & 41  \\
			\hline
			z26 & -1083 & -1172 & \textbf{-1176}  \\
			\hline
		\end{tabular}
	}
	\caption{Results on MIPLIB. The best results are shown in \textbf{bold}. The time limit is set to 3600s.}
	\label{tab2}
\end{table*}


%\subsection{Ablation Analysis and MIPLIB Benchmark Evaluation}
%Our study also incorporates an ablation study and assesses our model using real-world instances sourced from MIPLIB (refer to the Supplementary Material for more details). Our research findings provide substantial evidence supporting the effectiveness of our proposed methods. Specifically, within a time limit of 3600 seconds, ILP-FORMER outperforms baselines in 20 out of the 35 instances, showcasing superior performance. Furthermore, it demonstrates comparable performance in 10 out of the 35 instances, underscoring the robustness and competitiveness of our model.

%\subsection{Testing on Real-World Instances in MIPLIB}
%We follow the experimental settings for real-world instances in MIPLIB as described in \cite{wu2021learning}. We exclude ``easy'' instances with relatively small sizes, as well as instances where Gurobi cannot find any feasible solutions within a 3600-second time limit. Consequently, we choose 35 representative ``hard'' or ``open'' instances containing only integer variables. Within these instances, the number of variables ranges from 150 to 393,800 (averaging 49,563), and the number of constraints varies from 301 to 850,513 (averaging 96,778). We use the datasets in section \ref{sec:setup} to train our model and evaluate our model (with Gurobi as the repair solver) on this realistic dataset, in the style of \emph{active search} \cite{bello2016neural,wu2021learning,khalil2017learning} on each instance. Our findings indicate that, with a 3600-second time limit, ILP-FORMER surpasses both solvers on 20 of the 35 instances and exhibits comparable performance on 10 of the 35 instances. %More details are provided in the appendix.


\section{Conclusion}
%Addressing large-scale IP problems presents a formidable challenge. An increasingly prevalent approach for the automated design and tuning of IP solvers leverages data-driven methodologies. 
This paper concentrates on enhancing learning-based LNS approaches, given their ability to conveniently utilize any existing solver as a subroutine. Thus, they can inherit the advantages of meticulously engineered heuristic or complete approaches, along with their software implementations. We introduce ILP-FORMER, a novel approach that models policy learning as a sequence to an MLC problem. It seamlessly integrates a customized decision transformer encoder, encompassing a causal transformer, to model the sequential processes of LNS, and an MLC decoder with contrastive learning to exploit correlations in variable selection. Furthermore, we carry out comprehensive experiments on diverse benchmarks. The results suggest that our ILP-FORMER approach consistently delivers substantial improvements over state-of-the-art solvers and exhibits excellent generalization capabilities for larger instances.

\section*{Acknowledgment}
This project is partially supported by the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship, a Schmidt Futures program; the National Science Foundation, the Air Force Offce of Scientifc Research; the Department of Energy (DOE); and the Toyota Research Institute (TRI). The work of Caihua Liu is supported and funded by the Humanities and Social Sciences Youth Foundation, Ministry of Education of the People's Republic of China (Grant No.21YJC870009).

% References
\bibliography{ref}

\end{document}

\newpage

\onecolumn

\title{ILP-FORMER: Solving Integer Linear Programming with Sequence to Multi-Label Learning\\(Supplementary Material)}
\maketitle


\appendix

\section{Ablation Study}
To demonstrate the strength of ILP-FORMER, we compare it with two variants: (1) ILP-FORMER$\ominus$MLC: a modified ILP-FORMER where its MLC decoder is replaced with a linear decoder; (2)ILP-FORMER$\ominus$DT: a modified ILP-FORMER where its casual transformer component is removed. The results are summarized in Table~\ref{tab4}. We can observe that ILP-FORMER outperforms the two variants consistently;
our model's performance drop significantly if we do not consider modeling the sequential process of LNS (drop by 4.09\% on average) or exploit correlations of variable selection (drop by 3.13\% on average).

\begin{table*}[th!]
\centering
\begin{tabular}{c|cc | cc | cc }
\hline
\multirow{2}{*}{Methods} & \multicolumn{2}{c|}{MVC-1000}   & \multicolumn{2}{c|}{MC-500}  & \multicolumn{2}{c}{SC-1000}           \\ \cline{2-7} 
  &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\% &Obj.$\pm$Std.\% &Gap\%  \\ \hline
ILP-FORMER$\ominus$MLC    &$460.5\pm 0.6$     &5.86   &$-882.0\pm 1.5$    & 2.95   &$540.2\pm 8.3$    &3.45              \\ %\hline
ILP-FORMER$\ominus$DT  &$452.6\pm 1.1$     &4.02   &$-889.1\pm 0.8$    & 2.17   &$538.9\pm 6.3$    &3.20             \\ \hline
ILP-FORMER     &$\mathbf{435.1\pm 0.8}$ &\textbf{0} &$\mathbf{-908.8\pm 1.3}$ &\textbf{0} &$\mathbf{522.2\pm 5.3}$ &0  \\ \hline

\end{tabular}
\caption{An ablation study on the casual transformer and MLC decoder components of ILP-FORMER. Note that the experimental settings here follow that of Table~\ref{tab1}.}\label{tab4}
\end{table*}


\section{Testing on Real-World Instances in MIPLIB}
We follow the experimental settings for real-world instances in MIPLIB as described in \cite{wu2021learning}. We exclude ``easy'' instances with relatively small sizes, as well as instances where Gurobi cannot find any feasible solutions within a 3600-second time limit. Consequently, we choose 35 representative ``hard'' or ``open'' instances containing only integer variables. Within these instances, the number of variables ranges from 150 to 393,800 (averaging 49,563), and the number of constraints varies from 301 to 850,513 (averaging 96,778). We use the datasets in section \ref{sec:setup} to train our model and evaluate our model (with Gurobi as the repair solver) on this realistic dataset, in the style of \emph{active search} \cite{bello2016neural,wu2021learning,khalil2017learning} on each instance. Our findings indicate that, with a 3600-second time limit, ILP-FORMER surpasses both solvers on 20 of the 35 instances and exhibits comparable performance on 10 of the 35 instances. %More details are provided in the appendix.

\begin{table*}[h]
\centering
\scalebox{0.99}{
\begin{tabular}{|c|c|c|c|}
\hline
\textbf{Instance} & \textbf{Gurobi} & \textbf{CL-LNS} & \textbf{ILP-FORMER}  \\
\hline
a2864-99blp & -72 &  -73 & \textbf{-85}  \\
\hline
bab3 &  -655388.1120 & -655022.5305 & \textbf{-655412.3501}  \\
\hline
bley-xs1noM &  \textbf{3938322.37} & 3965411.35 & 3958310.21  \\
\hline
cdc7-4-3-2 & -257 & -276 & \textbf{-280}  \\
\hline
comp12-2idx & 380 & 363 & \textbf{352}  \\
\hline
ds & \textbf{177} & 319 & 189  \\
\hline
ex1010-pi &  239 & 238 & 238  \\
\hline
graph20-80-1rand & -6 & -6 & -6  \\
\hline
graph40-20-1rand & \textbf{-15} & -14 & -14  \\
\hline
neos-3426085-ticino &  226 & 226 & 226  \\
\hline
neos-3594536-henty &  401948 & 402426 & \textbf{401896}  \\
\hline
neos-3682128-sandon &  34666770.0 & \textbf{34666765.12338} & 34666770  \\
\hline
ns1828997 &  133 & 128 & \textbf{98}  \\
\hline
nursesched-medium-hint03 &  115 & 131 & 115  \\
\hline
opm2-z12-s8 & -33269 & -53379 & \textbf{-55269}  \\
\hline
pb-grow22 & -46217.0 & -46881.0 & \textbf{-56782}  \\
\hline
proteindesign121hz512p9 &  1499 & 1489 & \textbf{1481}  \\
\hline
queens-30 &  -39 & -39 & -39  \\
\hline
ramos3 &  245 & 248 & \textbf{216}  \\
\hline
rmine13 & -3493.781904 & -3487.807859 & \textbf{-3493.79821}  \\
\hline
rmine15 & -1979.559046 & -5001.279118 & \textbf{-5002.129874}  \\
\hline
rococoC12-010001 &  34673 & \textbf{35440} & 35467  \\
\hline
s1234 &  29 & 40 & 29  \\
\hline
scpj4scip &  133 & 134 & \textbf{131}  \\
\hline
scpk4 &  330 & 329 & \textbf{325}  \\
\hline
scpl4 &  279 & 281 & \textbf{269}  \\
\hline
sorrell3 &  -16 & -16 & -16  \\
\hline
sorrell4 & -23 & -23 & -23  \\
\hline
sorrell7 &  -187 & -187 & \textbf{-190}  \\
\hline
supportcase2 & 397 & 397 & 397  \\
\hline
t1717 & 201342 & 186891 & \textbf{185241}  \\
\hline
t1722 & 117171 & 117978 & \textbf{115983}  \\
\hline
tokyometro & 8479.5 &  8562.80 & \textbf{8456.7}  \\
\hline
v150d30-2hopcds & 41 & 41 & 41  \\
\hline
z26 & -1083 & -1172 & \textbf{-1176}  \\
\hline
\end{tabular}
}
\caption{Results on MIPLIB. The best results are shown in \textbf{bold}. The time limit is set to 3600s.}
\label{tab2}
\end{table*}

\end{document}
