"index","Title","Authors","pdf_url","project_url","abstract"
"18131","Joint Estimation and Inference for Data Integration Problems based on Multiple Multi-layered Gaussian Graphical Models","Subhabrata Majumdar, George Michailidis","https://jmlr.org//papers/volume23/18-131/18-131.pdf","https://github.com/GeorgeMichailidis/JMMLE_code","The rapid development of high-throughput technologies has enabled the generation of data from biological or disease processes that span multiple layers, like genomic, proteomic or metabolomic data, and further pertain to multiple sources, like disease subtypes or experimental conditions. In this work, we propose a general statistical framework based on Gaussian graphical models for horizontal (i.e. across conditions or subtypes) and vertical (i.e. across different layers containing data on molecular compartments) integration of information in such datasets. We start with decomposing the multi-layer problem into a series of two-layer problems. For each two-layer problem, we model the outcomes at a node in the lower layer as dependent on those of other nodes in that layer, as well as all nodes in the upper layer. We use a combination of neighborhood selection and group-penalized regression to obtain sparse estimates of all model parameters. Following this, we develop a debiasing technique and asymptotic distributions of inter-layer directed edge weights that utilize already computed neighborhood selection coefficients for nodes in the upper layer. Subsequently, we establish global and simultaneous testing procedures for these edge weights. Performance of the proposed methodology is evaluated on synthetic and real data."
"18467","Debiased  Distributed Learning for Sparse Partial Linear Models in  High Dimensions","Shaogao Lv, Heng Lian","https://jmlr.org//papers/volume23/18-467/18-467.pdf","","Although various distributed machine learning schemes have been proposed recently for purely linear models and fully nonparametric models, little attention has been paid to distributed optimization for semi-parametric models with multiple structures (e.g. sparsity, linearity and nonlinearity).  To address these issues, the current paper proposes a new communication-efficient distributed learning algorithm for sparse partially linear models with an increasing number of features. The proposed method is based on the classical divide and conquer strategy for handling big data and the computation on each subsample consists of  a debiased estimation of the doubly regularized least squares approach. With the proposed method, we theoretically prove that our global parametric estimator can achieve the optimal parametric rate in our semi-parametric model given an appropriate partition on the total data. Specifically, the choice of data partition  relies on the underlying smoothness of the nonparametric component, and it is adaptive to the sparsity parameter. Finally, some simulated experiments are carried out to illustrate the empirical performances of our debiased technique under the distributed setting."
"191056","Recovering shared structure from multiple networks with unknown edge distributions","Keith Levin, Asad Lodhia, Elizaveta Levina","https://jmlr.org//papers/volume23/19-1056/19-1056.pdf","","In increasingly many settings, data sets consist of multiple samples from a population of networks, with vertices aligned across networks; for example, brain connectivity networks in neuroscience. We consider the setting where the observed networks have a shared expectation, but may differ in the noise structure on their edges. Our approach exploits the shared mean structure to denoise edge-level measurements of the observed networks and estimate the underlying population-level parameters. We also explore the extent to which edge-level errors influence estimation and downstream inference. In the process, we establish a finite-sample concentration inequality for the low-rank eigenvalue truncation of a random weighted adjacency matrix, which may be of independent interest. The proposed approach is illustrated on synthetic networks and on data from an fMRI study of schizophrenia."
"19267","Exploiting locality in high-dimensional Factorial hidden Markov models","Lorenzo Rimella, Nick Whiteley","https://jmlr.org//papers/volume23/19-267/19-267.pdf","https://github.com/LorenzoRimella/GraphFilter-GraphSmoother","We propose algorithms for approximate filtering and smoothing in high-dimensional Factorial hidden Markov models. The approximation involves discarding, in a principled way, likelihood factors according to a notion of locality in a factor graph associated with the emission distribution. This allows the exponential-in-dimension cost of exact filtering and smoothing to be avoided. We prove that the approximation accuracy, measured in a local total variation norm, is ""dimension-free"" in the sense that as the overall dimension of the model increases the error bounds we derive do not necessarily degrade. A key step in the analysis is to quantify the error introduced by localizing the likelihood function in a Bayes' rule update. The factorial structure of the likelihood function which we exploit arises naturally when data have known spatial or network structure. We demonstrate the new algorithms on synthetic examples and a London Underground passenger flow problem, where the factor graph is effectively given by the train network."
"19450","Empirical Risk Minimization under Random Censorship","Guillaume Ausset, Stephan Clémençon, François Portier","https://jmlr.org//papers/volume23/19-450/19-450.pdf","","We consider the classic supervised learning problem where a continuous non-negative random label $Y$ (e.g. a random duration) is to be predicted based upon observing a random vector $X$ valued in $\mathbb{R}^d$ with $d\geq 1$ by means of a regression rule with minimum least square error. In various applications, ranging from industrial quality control to public health through credit risk analysis for instance, training observations can be right censored, meaning that, rather than on independent copies of $(X,Y)$, statistical learning relies on a collection of $n\geq 1$ independent realizations of the triplet $(X, \; \min\{Y,\; C\},\; \delta)$, where $C$ is a nonnegative random variable with unknown distribution, modelling censoring and $\delta=\mathbb{I}\{Y\leq C\}$ indicates whether the duration is right censored or not. As ignoring censoring in the risk computation may clearly lead to a severe underestimation of the target duration and jeopardize prediction, we consider a plug-in estimate of the true risk based on a Kaplan-Meier estimator of the conditional survival function of the censoring $C$ given $X$, referred to as Beran risk, in order to perform empirical risk minimization. It is established, under mild conditions, that the learning rate of minimizers of this biased/weighted empirical risk functional is of order $O_{\mathbb{P}}(\sqrt{\log(n)/n})$ when ignoring model bias issues inherent to plug-in estimation, as can be attained in absence of censoring. Beyond theoretical results, numerical experiments are presented in order to illustrate the relevance of the approach developed."
"19497","XAI Beyond Classification: Interpretable Neural Clustering","Xi Peng, Yunfan Li, Ivor W. Tsang, Hongyuan Zhu, Jiancheng Lv, Joey Tianyi Zhou","https://jmlr.org//papers/volume23/19-497/19-497.pdf","http://www.pengxi.me","In this paper, we study two challenging problems in explainable AI (XAI) and data clustering. The first is how to directly design a neural network with inherent interpretability, rather than giving post-hoc explanations of a black-box model. The second is implementing discrete $k$-means with a differentiable neural network that embraces the advantages of parallel computing, online clustering, and clustering-favorable representation learning. To address these two challenges, we design a novel neural network, which is a differentiable reformulation of the vanilla $k$-means, called inTerpretable nEuraL cLustering (TELL). Our contributions are threefold. First, to the best of our knowledge, most existing XAI works focus on supervised learning paradigms. This work is one of the few XAI studies on unsupervised learning, in particular, data clustering. Second, TELL is an interpretable, or the so-called intrinsically explainable and transparent model. In contrast, most existing XAI studies resort to various means for understanding a black-box model with post-hoc explanations. Third, from the view of data clustering, TELL possesses many properties highly desired by $k$-means, including but not limited to online clustering, plug-and-play module, parallel computing, and provable convergence. Extensive experiments show that our method achieves superior performance comparing with 14 clustering approaches on three challenging data sets. The source code could be accessed at www.pengxi.me."
"19882","Bayesian Multinomial Logistic Normal Models through Marginally Latent Matrix-T Processes","Justin D. Silverman, Kimberly Roche, Zachary C. Holmes, Lawrence A. David, Sayan Mukherjee","https://jmlr.org//papers/volume23/19-882/19-882.pdf","https://github.com/jsilve24/fido_paper_code","Bayesian multinomial logistic-normal (MLN) models are popular for the analysis of sequence count data (e.g., microbiome or gene expression data) due to their ability to model multivariate count data with complex covariance structure. However, existing implementations of MLN models are limited to small datasets due to the non-conjugacy of the multinomial and logistic-normal distributions. Motivated by the need to develop efficient inference for Bayesian MLN models, we develop two key ideas. First, we develop the class of Marginally Latent Matrix-T Process (Marginally LTP) models. We demonstrate that many popular MLN models, including those with latent linear, non-linear, and dynamic linear structure are special cases of this class. Second, we develop an efficient inference scheme for Marginally LTP models with specific accelerations for the MLN subclass. Through application to MLN models, we demonstrate that our inference scheme are both highly accurate and often 4-5 orders of magnitude faster than MCMC."
"20040","Deep Learning in Target Space","Michael Fairbank, Spyridon Samothrakis, Luca Citi","https://jmlr.org//papers/volume23/20-040/20-040.pdf","https://github.com/mikefairbank/dlts_paper_code","Deep learning uses neural networks which are parameterised by their weights.  The neural networks are usually trained by tuning the weights to directly minimise a given loss function.  In this paper we propose to re-parameterise the weights into targets for the firing strengths of the individual nodes in the network. Given a set of targets, it is possible to calculate the weights which make the firing strengths best meet those targets. It is argued that using targets for training addresses the problem of exploding gradients, by a process which we call cascade untangling, and  makes the loss-function surface smoother to traverse, and so leads to easier, faster training, and also potentially better generalisation, of the neural network.  It also allows for easier learning of deeper and recurrent network structures. The necessary conversion of targets to weights comes at an extra computational expense, which is in many cases manageable.  Learning in target space can be combined with existing neural-network optimisers, for extra gain.  Experimental results show the speed of using target space, and examples of improved generalisation, for fully-connected networks and convolutional networks, and the ability to recall and process long time sequences and perform natural-language processing with recurrent networks."
"201111","Scaling Laws from the Data Manifold Dimension","Utkarsh Sharma, Jared Kaplan","https://jmlr.org//papers/volume23/20-1111/20-1111.pdf","https://github.com/U-Sharma/NeuralScaleID","When data is plentiful, the test loss achieved by well-trained neural networks scales as a power-law $L \propto N^{-\alpha}$ in the number of network parameters $N$. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$.  This simple theory predicts that the scaling exponents $\alpha \approx 4/d$ for cross-entropy and mean-squared error losses.  We confirm the theory by independently measuring the intrinsic dimension and the scaling exponents in a teacher/student framework, where we can study a  variety of $d$ and $\alpha$ by dialing the properties of  random teacher networks.  We also test the theory with CNN image classifiers on several datasets and with GPT-type language models."
"20112","Interpolating Predictors in High-Dimensional Factor Regression","Florentina Bunea, Seth Strimas-Mackey, Marten Wegkamp","https://jmlr.org//papers/volume23/20-112/20-112.pdf","","This work studies  finite-sample properties of the risk of the minimum-norm interpolating predictor in high-dimensional regression models.   If the effective rank of the covariance matrix $\Sigma$ of the $p$ regression features is much larger than the sample size $n$,  we show that the min-norm interpolating  predictor is not desirable, as its risk approaches the risk of trivially predicting the response by 0. However, our detailed finite-sample analysis reveals, surprisingly, that  this behavior is not present when  the regression response and the features are jointly low-dimensional, following a widely used  factor regression model. Within this popular model class, and when the effective rank of $\Sigma$ is smaller than $n$, while still allowing for $p \gg n$, both the bias and the variance terms of the excess risk can be controlled, and the risk of the minimum-norm interpolating predictor approaches optimal benchmarks. Moreover, through a  detailed analysis of the bias term, we exhibit model classes under   which our upper bound on the excess risk approaches zero, while the corresponding upper bound  in the recent work arXiv:1906.11300 diverges. Furthermore,  we show that the minimum-norm interpolating predictor analyzed under the factor regression model, despite being model-agnostic and devoid of tuning parameters, can have similar risk to predictors based on principal components regression and ridge regression, and  can improve over LASSO based predictors, in the high-dimensional regime."
"201152","Near Optimality of Finite Memory Feedback Policies in Partially Observed Markov Decision Processes","Ali Kara, Serdar Yuksel","https://jmlr.org//papers/volume23/20-1152/20-1152.pdf","","In the theory of Partially Observed Markov Decision Processes (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical dynamic or linear programming methods is challenging even if the original system has finite state and action spaces, since the state space of the fully observed belief-MDP model is always uncountable. Furthermore, there exist very few rigorous value function approximation and optimal policy approximation results, as regularity conditions needed often require a tedious study involving the spaces of probability measures leading to properties such as Feller continuity. In this paper, we study a planning problem for POMDPs where the system dynamics and measurement channel model are assumed to be known. We construct an approximate belief model by discretizing the belief space using only finite window information variables. We then find optimal policies for the approximate model and we rigorously establish near optimality of the constructed finite window control policies in POMDPs under mild non-linear filter stability conditions and the assumption that the measurement and action sets are finite (and the state space is real vector valued). We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound, where the rate of convergence is exponential under explicit and testable exponential filter stability conditions. While there exist many experimental results and few rigorous asymptotic convergence results, an explicit rate of convergence result is new in the literature, to our knowledge."
"201165","Approximate Information State for Approximate Planning and Reinforcement Learning in Partially Observed Systems","Jayakumar Subramanian, Amit Sinha, Raihan Seraj, Aditya Mahajan","https://jmlr.org//papers/volume23/20-1165/20-1165.pdf","https://github.com/info-structures/ais","We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundamental notion of information state. We provide two definitions of information state---i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called AIS) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based  multi-time scale policy gradient algorithms and detailed numerical experiments with low, moderate and high dimensional environments."
"201188","Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality","Dimitris Bertsimas, Ryan Cory-Wright, Jean Pauphilet","https://jmlr.org//papers/volume23/20-1188/20-1188.pdf","https://github.com/ryancorywright/ScalableSPCA.jl","Sparse principal component analysis (PCA) is a popular dimensionality reduction technique for obtaining principal components which are linear combinations of a small subset of the original features. Existing approaches  cannot supply certifiably optimal principal components with more than $p=100s$ of variables. By reformulating sparse PCA as a convex mixed-integer semidefinite optimization problem, we design a cutting-plane method which solves the problem to certifiable optimality at the scale of selecting $k=5$ covariates from $p=300$ variables, and provides small bound gaps at a larger scale. We also propose a convex relaxation and greedy rounding scheme that provides bound gaps of $1-2\%$ in practice within minutes for $p=100$s or hours for $p=1,000$s and is therefore a viable alternative to the exact method at scale. Using real-world financial and medical data sets, we illustrate our approach's ability to derive interpretable principal components tractably at scale."
"201219","On Generalizations of Some Distance Based Classifiers for HDLSS Data","Sarbojit Roy, Soham Sarkar, Subhajit Dutta, Anil K. Ghosh","https://jmlr.org//papers/volume23/20-1219/20-1219.pdf","","In high dimension, low sample size (HDLSS) settings, classifiers based on Euclidean distances like the nearest neighbor classifier and the average distance classifier perform quite poorly if differences between locations of the underlying populations get masked by scale differences. To rectify this problem, several modifications of these classifiers have been proposed in the literature. However, existing methods are confined to location and scale differences only, and they often fail to discriminate among populations differing outside of the first two moments. In this article, we propose some simple transformations of these classifiers resulting in improved performance even when the underlying populations have the same location and scale. We further propose a generalization of these classifiers based on the idea of grouping of variables. High-dimensional behavior of the proposed classifiers is studied theoretically. Numerical experiments with a variety of simulated examples as well as an extensive analysis of benchmark data sets from three different databases exhibit advantages of the proposed methods."
"201248","A Stochastic Bundle Method for Interpolation","Alasdair Paren, Leonard Berrada, Rudra P. K. Poudel, M. Pawan Kumar","https://jmlr.org//papers/volume23/20-1248/20-1248.pdf","https://github.com/oval-group/borat","We propose a novel method for training deep neural networks that are capable of interpolation, that is, driving the empirical loss to zero. At each iteration, our method constructs a stochastic approximation of the learning objective. The approximation, known as a bundle, is a pointwise maximum of linear functions. Our bundle contains a constant function that lower bounds the empirical loss. This enables us to compute an automatic adaptive learning rate, thereby providing an accurate solution. In addition, our bundle includes linear approximations computed at the current iterate and other linear estimates of the DNN parameters. The use of these additional approximations makes our method significantly more robust to its hyperparameters. Based on its desirable empirical properties, we term our method Bundle Optimisation for Robust and Accurate Training (BORAT). In order to operationalise BORAT, we design a novel algorithm for optimising the bundle approximation efficiently at each iteration. We establish the theoretical convergence of BORAT in both convex and non-convex settings. Using standard publicly available data sets, we provide a thorough comparison of BORAT to other single hyperparameter optimisation algorithms. Our experiments demonstrate BORAT matches the state-of-the-art generalisation performance for these methods and is the most robust."
"201297","TFPnP: Tuning-free Plug-and-Play Proximal Algorithms with Applications to Inverse Imaging Problems","Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Hua Huang, Carola-Bibiane Schönlieb","https://jmlr.org//papers/volume23/20-1297/20-1297.pdf","https://github.com/Vandermode/TFPnP","Plug-and-Play (PnP) is a non-convex optimization framework that combines proximal algorithms, for example, the alternating direction method of multipliers (ADMM), with advanced denoising priors. Over the past few years, great empirical success has been obtained by PnP algorithms, especially for the ones that integrate deep learning-based denoisers. However, a key problem of PnP approaches is the need for manual parameter tweaking which is essential to obtain high-quality results across the high discrepancy in imaging conditions and varying scene content. In this work, we present a class of tuning-free PnP proximal algorithms that can determine parameters such as denoising strength, termination time, and other optimization-specific parameters automatically. A core part of our approach is a policy network for automated parameter search which can be effectively learned via a mixture of model-free and model-based deep reinforcement learning strategies. We demonstrate, through rigorous numerical and visual experiments, that the learned policy can customize parameters to different settings, and is often more efficient and effective than existing handcrafted criteria. Moreover, we discuss several practical considerations of  PnP denoisers, which together with our learned policy yield state-of-the-art results. This advanced performance is prevalent on both linear and nonlinear exemplar inverse imaging problems, and in particular shows promising results on compressed sensing MRI, sparse-view CT, single-photon imaging, and phase retrieval."
"201361","Spatial Multivariate Trees for Big Data Bayesian Regression","Michele Peruzzi, David B. Dunson","https://jmlr.org//papers/volume23/20-1361/20-1361.pdf","https://github.com/mkln/spamtree","High resolution geospatial data are challenging because standard geostatistical models based on Gaussian processes are known to not scale to large data sizes. While progress has been made towards methods that can be computed more efficiently, considerably less attention has been devoted to methods for large scale data that allow the description of complex relationships between several outcomes recorded at high resolutions by different sensors. Our Bayesian multivariate regression models based on spatial multivariate trees (SpamTrees) achieve scalability via conditional independence assumptions on latent random effects following a treed directed acyclic graph. Information-theoretic arguments and considerations on computational efficiency guide the construction of the tree and the related efficient sampling algorithms in imbalanced multivariate settings. In addition to simulated data examples, we illustrate SpamTrees using a large climate data set which combines satellite data with land-based station data. Software and source code are available on CRAN at https://CRAN.R-project.org/package=spamtree."
"201402","Decimated Framelet System on Graphs and Fast G-Framelet Transforms","Xuebin Zheng, Bingxin Zhou, Yu Guang Wang, Xiaosheng Zhuang","https://jmlr.org//papers/volume23/20-1402/20-1402.pdf","https://github.com/YuGuangWang/FGT","Graph representation learning has many real-world applications, from self-driving LiDAR, 3D computer vision to drug repurposing, protein classification, social networks analysis. An adequate representation of graph data is vital to the learning performance of a statistical or machine learning model for graph-structured data. This paper proposes a novel multiscale representation system for graph data, called decimated framelets, which form a localized tight frame on the graph. The decimated framelet system allows storage of the graph data representation on a coarse-grained chain and processes the graph data at multi scales where at each scale, the data is stored on a subgraph. Based on this, we establish decimated G-framelet transforms for the decomposition and reconstruction of the graph data at multi resolutions via a constructive data-driven filter bank. The graph framelets are built on a chain-based orthonormal basis that supports fast graph Fourier transforms. From this, we give a fast algorithm for the decimated G-framelet transforms, or FGT, that has linear computational complexity O(N) for a graph of size N. The effectiveness for constructing the decimated framelet system and the FGT is demonstrated by a simulated example of random graphs and real-world applications, including multiresolution analysis for traffic network and representation learning of graph neural networks for graph classification tasks."
"201433","Universal Approximation in Dropout Neural Networks","Oxana A. Manita, Mark A. Peletier, Jacobus W. Portegies, Jaron Sanders, Albert Senen-Cerda","https://jmlr.org//papers/volume23/20-1433/20-1433.pdf","","We prove two universal approximation theorems for a range of dropout neural networks. These are feed-forward neural networks in which each edge is given a random $\{0,1\}$-valued filter, that have two modes of operation: in the first each edge output is multiplied by its random filter, resulting in a random output, while in the second each edge output is multiplied by the expectation of its filter, leading to a deterministic output. It is common to use the random mode during training and the deterministic mode during testing and prediction. Both theorems are of the following form: Given a function to approximate and a threshold $\varepsilon>0$, there exists a dropout network that is $\varepsilon$-close in probability and in $L^q$. The first theorem applies to dropout networks in the random mode. It assumes little on the activation function, applies to a wide class of networks, and can even be applied to approximation schemes other than neural networks. The core is an algebraic property that shows that deterministic networks can be exactly matched in expectation by random networks. The second theorem makes stronger assumptions and gives a stronger result. Given a function to approximate, it provides existence of a network that approximates in both modes simultaneously. Proof components are a recursive replacement of edges by independent copies, and a special first-layer replacement that couples the resulting larger network to the input. The functions to be approximated are assumed to be elements of general normed spaces, and the approximations are measured in the corresponding norms. The networks are constructed explicitly. Because of the different methods of proof, the two results give independent insight into the approximation properties of random dropout networks. With this, we establish that dropout neural networks broadly satisfy a universal-approximation property."
"20188","Supervised Dimensionality Reduction and Visualization using Centroid-Encoder","Tomojit Ghosh, Michael Kirby","https://jmlr.org//papers/volume23/20-188/20-188.pdf","https://github.com/Tomojit1/Centroid-encoder/tree/master/GPU","We propose a new tool for visualizing complex, and potentially large and high-dimensional, data sets called Centroid-Encoder (CE).  The architecture of the Centroid-Encoder is similar to the autoencoder neural network but it has a modified target, i.e., the class centroid in the ambient space.  As such, CE incorporates label information and performs a supervised data visualization.  The training of CE is done in the usual way with a training set whose parameters are tuned using a validation set.  The evaluation of the resulting CE visualization is performed on a sequestered test set where the generalization of the model is assessed both visually and quantitatively. We present a detailed comparative analysis of the method using a wide variety of data sets and techniques, both supervised and unsupervised, including NCA, non-linear NCA, t-distributed NCA, t-distributed MCML, supervised UMAP, supervised PCA, Colored Maximum Variance Unfolding, supervised Isomap, Parametric Embedding, supervised Neighbor Retrieval Visualizer, and Multiple Relational Embedding. An analysis of variance using PCA demonstrates that a non-linear preprocessing by the CE transformation of the data captures more variance than PCA by dimension."
"20233","Evolutionary Variational Optimization of Generative Models","Jakob Drefs, Enrico Guiraud, Jörg Lücke","https://jmlr.org//papers/volume23/20-233/20-233.pdf","https://github.com/tvlearn","We combine two popular optimization approaches to derive learning algorithms for generative models: variational optimization and evolutionary algorithms. The combination is realized for generative models with discrete latents by using truncated posteriors as the family of variational distributions. The variational parameters of truncated posteriors are sets of latent states. By interpreting these states as genomes of individuals and by using the variational lower bound to define a fitness, we can apply evolutionary algorithms to realize the variational loop. The used variational distributions are very flexible and we show that evolutionary algorithms can effectively and efficiently optimize the variational bound. Furthermore, the variational loop is generally applicable (“black box”) with no analytical derivations required. To show general applicability, we apply the approach to three generative models (we use Noisy-OR Bayes Nets, Binary Sparse Coding, and Spike-and-Slab Sparse Coding). To demonstrate effectiveness and efficiency of the novel variational approach, we use the standard competitive benchmarks of image denoising and inpainting. The benchmarks allow quantitative comparisons to a wide range of methods including probabilistic approaches, deep deterministic and generative networks, and non-local image processing methods. In the category of “zero-shot” learning (when only the corrupted image is used for training), we observed the evolutionary variational algorithm to significantly improve the state-of-the-art in many benchmark settings. For one well-known inpainting benchmark, we also observed state-of-the-art performance across all categories of algorithms although we only train on the corrupted image. In general, our investigations highlight the importance of research on optimization methods for generative models to achieve performance improvements."
"20247","LSAR: Efficient Leverage Score Sampling Algorithm for the Analysis of Big Time Series Data","Ali Eshragh, Fred Roosta, Asef Nazari, Michael W. Mahoney","https://jmlr.org//papers/volume23/20-247/20-247.pdf","","We apply methods from randomized numerical linear algebra (RandNLA) to develop improved algorithms for the analysis of large-scale time series data. We first develop a new fast algorithm to estimate the leverage scores of an autoregressive (AR) model in big data regimes. We show that the accuracy of approximations lies within $(1+\mathcal{O}({\varepsilon}))$ of the true leverage scores with high probability. These theoretical results are subsequently exploited to develop an efficient algorithm, called LSAR, for fitting an appropriate AR model to big time series data. Our proposed algorithm is guaranteed, with high probability, to find the maximum likelihood estimates of the parameters of the underlying true AR model and has a worst case running time that significantly improves those of the state-of-the-art alternatives in big data regimes. Empirical results on large-scale synthetic as well as real data highly support the theoretical results and reveal the efficacy of this new approach."
"20315","Fast and Robust Rank Aggregation against Model Misspecification","Yuangang Pan, Ivor W. Tsang, Weijie Chen, Gang Niu, Masashi Sugiyama","https://jmlr.org//papers/volume23/20-315/20-315.pdf","","In rank aggregation (RA), a collection of preferences from different users are summarized into a total order under the assumption of homogeneity of users. Model misspecification in RA arises since the homogeneity assumption fails to be satisfied in the complex real-world situation. Existing robust RAs usually resort to an augmentation of the ranking model to account for additional noises, where the collected preferences can be treated as a noisy perturbation of idealized preferences. Since the majority of robust RAs rely on certain perturbation assumptions,  they cannot generalize well to agnostic noise-corrupted preferences in the real world. In this paper, we propose CoarsenRank, which possesses robustness against model misspecification. Specifically, the properties of our CoarsenRank are summarized as follows: (1) CoarsenRank is designed for mild model misspecification, which assumes there exist the ideal preferences (consistent with model assumption) that locate in a neighborhood of the actual preferences. (2) CoarsenRank then performs regular RAs over a neighborhood of the preferences instead of the original data set directly. Therefore, CoarsenRank enjoys robustness against model misspecification within a neighborhood. (3) The neighborhood of the data set is defined via their empirical data distributions. Further, we put an exponential prior on the unknown size of the neighborhood and derive a much-simplified posterior formula for CoarsenRank under particular divergence measures. (4) CoarsenRank is further instantiated to Coarsened Thurstone, Coarsened Bradly-Terry, and Coarsened Plackett-Luce with three popular probability ranking models. Meanwhile, tractable optimization strategies are introduced with regards to each instantiation respectively. In the end, we apply CoarsenRank on four real-world data sets. Experiments show that CoarsenRank is fast and robust, achieving consistent improvements over baseline methods."
"20316","On Biased Stochastic Gradient Estimation","Derek Driggs, Jingwei Liang, Carola-Bibiane Schönlieb","https://jmlr.org//papers/volume23/20-316/20-316.pdf","","We present a uniform analysis of biased stochastic gradient methods for minimizing convex, strongly convex, and non-convex composite objectives, and identify settings where bias is useful in stochastic gradient estimation. The framework we present allows us to extend proximal support to biased algorithms, including SAG and SARAH, for the first time in the convex setting. We also use our framework to develop a new algorithm, Stochastic Average Recursive GradiEnt (SARGE), that achieves the oracle complexity lower-bound for non-convex, finite-sum objectives and requires strictly fewer calls to a stochastic gradient oracle per iteration than SVRG and SARAH. We support our theoretical results with numerical experiments that demonstrate the benefits of certain biased gradient estimators."
"20357","Efficient MCMC Sampling with Dimension-Free Convergence Rate using ADMM-type Splitting","Maxime Vono, Daniel Paulin, Arnaud Doucet","https://jmlr.org//papers/volume23/20-357/20-357.pdf","","Performing exact Bayesian inference for complex models is computationally intractable. Markov chain Monte Carlo (MCMC) algorithms can provide reliable approximations of the posterior distribution but are expensive for large data sets and high-dimensional models. A standard approach to mitigate this complexity consists in using subsampling techniques or distributing the data across a cluster. However, these approaches are typically unreliable in high-dimensional scenarios. We focus here on a recent alternative class of MCMC schemes exploiting a splitting strategy akin to the one used by the celebrated alternating direction method of multipliers (ADMM) optimization algorithm. These methods appear to provide empirically state-of-the-art performance but their theoretical behavior in high dimension is currently unknown. In this paper, we propose a detailed theoretical study of one of these algorithms known as the split Gibbs sampler. Under regularity conditions, we establish explicit convergence rates for this scheme using Ricci curvature and coupling ideas. We support our theory with numerical illustrations."
"20520","MurTree: Optimal Decision Trees via Dynamic Programming and Search","Emir Demirović, Anna Lukina, Emmanuel Hebrard, Jeffrey Chan, James Bailey, Christopher Leckie, Kotagiri Ramamohanarao, Peter J. Stuckey","https://jmlr.org//papers/volume23/20-520/20-520.pdf","https://bitbucket.org/EmirD/murtree","Decision tree learning is a widely used approach in machine learning, favoured in applications that require concise and interpretable models. Heuristic methods are traditionally used to quickly produce models with reasonably high accuracy. A commonly criticised point, however, is that the resulting trees may not necessarily be the best representation of the data in terms of accuracy and size. In recent years, this motivated the development of optimal classification tree algorithms that globally optimise the decision tree in contrast to heuristic methods that perform a sequence of locally optimal decisions. We follow this line of work and provide a novel algorithm for learning optimal classification trees based on dynamic programming and search. Our algorithm supports constraints on the depth of the tree and number of nodes. The success of our approach is attributed to a series of specialised techniques that exploit properties unique to classification trees. Whereas algorithms for optimal classification trees have traditionally been plagued by high runtimes and limited scalability, we show in a detailed experimental study that our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances, providing several orders of magnitude improvements and notably contributing towards the practical use of optimal decision trees."
"20644","Data-Derived Weak Universal Consistency","Narayana Santhanam, Venkatachalam Anantharam, Wojciech Szpankowski","https://jmlr.org//papers/volume23/20-644/20-644.pdf","","Many current applications in data science need rich model classes to adequately represent the statistics that may be driving the observations. Such rich model classes may be too complex to admit uniformly consistent estimators. In such cases, it is conventional to settle for estimators with guarantees on convergence rate where the performance can be bounded in a model-dependent way, i.e. pointwise consistent estimators. But this viewpoint has the practical drawback that estimator performance is a function of the unknown model within the model class that is being estimated. Even if an estimator is consistent, how well it is doing at any given time may not be clear, no matter what the sample size of the observations. In these cases, a line of analysis favors sample dependent guarantees. We explore this framework by studying rich model classes that may only admit pointwise consistency guarantees, yet enough information about the unknown model driving the observations needed to gauge estimator accuracy can be inferred from the sample at hand. In this paper we obtain a novel characterization of lossless compression problems over a countable alphabet in the data-derived framework in terms of what we term deceptive distributions. We also show that the ability to estimate the redundancy of compressing memoryless sources is equivalent to learning the underlying single-letter marginal in a data-derived fashion. We expect that the methodology underlying such characterizations in a data-derived estimation framework will be broadly applicable to a wide range of estimation problems, enabling a more systematic approach to data-derived guarantees."
"20707","Novel Min-Max Reformulations of Linear Inverse Problems","Mohammed Rayyan Sheriff, Debasish Chatterjee","https://jmlr.org//papers/volume23/20-707/20-707.pdf","","In this article, we dwell into the class of so-called ill-posed Linear Inverse Problems (LIP) which simply refer to the task of recovering the entire signal from its relatively few random linear measurements. Such problems arise in a variety of settings with applications ranging from medical image processing, recommender systems, etc. We propose a slightly generalized version of the error constrained linear inverse problem and obtain a novel and equivalent convex-concave min-max reformulation by providing an exposition to its convex geometry. Saddle points of the min-max problem are completely characterized in terms of a solution to the LIP, and vice versa. Applying simple saddle point seeking ascend-descent type algorithms to solve the min-max problems provides novel and simple algorithms to find a solution to the LIP. Moreover, the reformulation of an LIP as the min-max problem provided in this article is crucial in developing methods to solve the dictionary learning problem with almost sure recovery constraints."
"20720","Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning","Kaiyi Ji, Junjie Yang, Yingbin Liang","https://jmlr.org//papers/volume23/20-720/20-720.pdf","","As a popular meta-learning approach, the model-agnostic meta-learning (MAML) algorithm has been widely used due to its  simplicity and effectiveness. However, the convergence of the general multi-step MAML still remains unexplored. In this paper, we develop a new theoretical framework to provide such convergence guarantee for two types of objective functions that are of interest in practice: (a) resampling case (e.g., reinforcement learning), where loss functions take the form in expectation and new data are sampled as the algorithm runs; and (b) finite-sum case (e.g., supervised learning), where loss functions take the finite-sum form with given samples. For both cases, we characterize the convergence rate and the computational complexity to attain an $\epsilon$-accurate solution for multi-step MAML in the general nonconvex setting. In particular, our results suggest that an inner-stage stepsize needs to be chosen inversely proportional to the number $N$ of inner-stage steps in order for $N$-step MAML to have guaranteed convergence. From the technical perspective, we develop novel techniques to deal with the nested structure of the meta gradient for multi-step MAML, which can be of independent interest."
"20735","A Class of Conjugate Priors for Multinomial Probit Models which Includes the Multivariate Normal One","Augusto Fasano, Daniele Durante","https://jmlr.org//papers/volume23/20-735/20-735.pdf","","Multinomial probit models are routinely-implemented representations for learning how the class probabilities of categorical response data change with $p$ observed predictors. Although several frequentist methods have been developed for estimation, inference and classification within such a class of models, Bayesian inference is still lagging behind. This is due to the apparent absence of a tractable class of conjugate priors, that may facilitate posterior inference on the multinomial probit coefficients. Such an issue has motivated increasing efforts toward the development of effective Markov chain Monte Carlo methods, but state-of-the-art solutions still face severe computational bottlenecks, especially in high dimensions. In this article, we show that the entire class of unified skew-normal (SUN) distributions is conjugate to several multinomial probit models. Leveraging this result and the SUN properties, we improve upon state-of-the-art solutions for posterior inference and classification both in terms of closed-form results for several functionals of interest, and also by developing novel computational methods relying either on independent and identically distributed samples from the exact posterior or on scalable and accurate variational approximations based on blocked partially-factorized representations. As illustrated in simulations and in a gastrointestinal lesions application, the magnitude of the improvements relative to current methods is particularly evident, in practice, when the focus is on high-dimensional studies."
"20782","An Improper Estimator with Optimal Excess Risk in Misspecified Density Estimation and Logistic Regression","Jaouad Mourtada, Stéphane Gaïffas","https://jmlr.org//papers/volume23/20-782/20-782.pdf","","We introduce a procedure for conditional density estimation under logarithmic loss, which we call SMP (Sample Minmax Predictor). This estimator minimizes a new general excess risk bound for statistical learning. On standard examples, this bound scales as $d/n$ with $d$ the model dimension and $n$ the sample size, and critically remains valid under model misspecification. Being an improper (out-of-model) procedure, SMP improves over within-model estimators such as the maximum likelihood estimator, whose excess risk degrades under misspecification. Compared to approaches reducing to the sequential problem, our bounds remove suboptimal $\log n$ factors and can handle unbounded classes. For the Gaussian linear model, the predictions and risk bound of SMP are governed by leverage scores of covariates, nearly matching the optimal risk in the well-specified case without conditions on the noise variance or approximation error of the linear model. For logistic regression, SMP provides a non-Bayesian approach to calibration of probabilistic predictions relying on virtual samples, and can be computed by solving two logistic regressions. It achieves a non-asymptotic excess risk of $O((d + B^2R^2)/n)$, where $R$ bounds the norm of features and $B$ that of the comparison parameter; by contrast, no within-model estimator can achieve better rate than $\min({B R}/{\sqrt{n}}, {d e^{BR}}/{n} )$ in general. This provides a more practical alternative to Bayesian approaches, which require approximate posterior sampling, thereby partly addressing a question raised by Foster et al. (2018)."
"20807","Active Learning for Nonlinear System Identification with Guarantees","Horia Mania, Michael I. Jordan, Benjamin Recht","https://jmlr.org//papers/volume23/20-807/20-807.pdf","","While the identification of nonlinear dynamical systems is a fundamental building block of model-based reinforcement learning and feedback control, its sample complexity is only understood for systems that either have discrete states and actions or for systems that can be identified from data generated by i.i.d. random inputs. Nonetheless, many interesting dynamical systems have continuous states and actions and can only be identified through a judicious choice of inputs. Motivated by practical settings, we study a class of nonlinear dynamical systems whose state transitions depend linearly on a known feature embedding of state-action pairs. To estimate such systems in finite time identification methods must explore all directions in feature space. We propose an active learning approach that achieves this by repeating three steps: trajectory planning, trajectory tracking, and re-estimation of the system from all available data. We show that our method estimates nonlinear dynamical systems at a parametric rate, similar to the statistical rate of standard linear regression."
"20874","Model Averaging Is Asymptotically Better Than Model Selection For Prediction","Tri M. Le, Bertrand S. Clarke","https://jmlr.org//papers/volume23/20-874/20-874.pdf","","We compare the performance of six model average predictors---Mallows' model averaging, stacking, Bayes model averaging,  bagging, random forests, and boosting---to the components used to form them.In all six cases we identify conditions under which the model average predictor is consistent for its intended limit and performs as well or better than any of its components asymptotically.   This is well known empirically, especially for complex problems, although theoretical results do not seem to have been formally established. We have focused our attention on the regression context since that is wheremodel averaging techniques differ most often from current practice."
"20900","SODEN: A Scalable Continuous-Time Survival Model through Ordinary Differential Equation Networks","Weijing Tang, Jiaqi Ma, Qiaozhu Mei, Ji Zhu","https://jmlr.org//papers/volume23/20-900/20-900.pdf","https://github.com/jiaqima/SODEN","In this paper, we propose a flexible model for survival analysis using neural networks along with scalable optimization algorithms. One key technical challenge for directly applying maximum likelihood estimation (MLE) to censored data is that evaluating the objective function and its gradients with respect to model parameters requires the calculation of integrals. To address this challenge, we recognize from a novel perspective that the MLE for censored data can be viewed as a differential-equation constrained optimization problem. Following this connection, we model the distribution of event time through an ordinary differential equation and utilize efficient ODE solvers and adjoint sensitivity analysis to numerically evaluate the likelihood and the gradients. Using this approach, we are able to 1) provide a broad family of continuous-time survival distributions without strong structural assumptions, 2) obtain powerful feature representations using neural networks, and 3) allow efficient estimation of the model in large-scale applications using stochastic gradient descent. Through both simulation studies and real-world data examples, we demonstrate the effectiveness of the proposed method in comparison to existing state-of-the-art deep learning survival analysis models. The implementation of the proposed SODEN approach has been made publicly available at https://github.com/jiaqima/SODEN."
"20918","Optimality and Stability in Non-Convex Smooth Games","Guojun Zhang, Pascal Poupart, Yaoliang Yu","https://jmlr.org//papers/volume23/20-918/20-918.pdf","","Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications. It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points. An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm. This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions. We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions. In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points. Finally, we study the stability of gradient algorithms near local minimax points. Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases. This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games."
"20924","Accelerated Zeroth-Order and First-Order Momentum Methods from Mini to Minimax Optimization","Feihu Huang, Shangqian Gao, Jian Pei, Heng Huang","https://jmlr.org//papers/volume23/20-924/20-924.pdf","","In the paper, we propose a class of accelerated zeroth-order and first-order momentum methods for both  nonconvex mini-optimization and minimax-optimization. Specifically, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method for black-box mini-optimization where only function values can be obtained. Moreover, we prove that our Acc-ZOM method achieves a lower query complexity of $\tilde{O}(d^{3/4}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(d^{1/4})$ where $d$ denotes the variable dimension. In particular, our  Acc-ZOM does not need large batches required in the existing zeroth-order stochastic algorithms. Meanwhile, we propose an accelerated zeroth-order momentum descent ascent (Acc-ZOMDA) method for black-box minimax  optimization, where only function values can be obtained. Our Acc-ZOMDA obtains a low query complexity of $\tilde{O}((d_1+d_2)^{3/4}\kappa_y^{4.5}\epsilon^{-3})$ without requiring large batches for finding an $\epsilon$-stationary point, where $d_1$ and $d_2$ denote variable dimensions and $\kappa_y$ is condition number. Moreover, we propose an accelerated first-order momentum descent ascent (Acc-MDA) method for minimax optimization,  whose explicit gradients are accessible. Our Acc-MDA achieves a low  gradient complexity of $\tilde{O}(\kappa_y^{4.5}\epsilon^{-3})$ without requiring large batches for finding an $\epsilon$-stationary point. In particular, our Acc-MDA can obtain a lower gradient complexity of $\tilde{O}(\kappa_y^{2.5}\epsilon^{-3})$ with a batch size $O(\kappa_y^4)$, which improves the best known result by a factor of $O(\kappa_y^{1/2})$. Extensive experimental results on black-box adversarial attack to deep neural networks and poisoning attack to logistic regression demonstrate efficiency of our algorithms."
"210059","Projected Statistical Methods for Distributional Data on the Real Line with the Wasserstein Metric","Matteo Pegoraro, Mario Beraha","https://jmlr.org//papers/volume23/21-0059/21-0059.pdf","https://github.com/mberaha/ProjectedWasserstein","We present a novel class of projected methods to perform statistical analysis on a data set of probability distributions on  the real line, with the 2-Wasserstein metric. We focus in particular on Principal Component Analysis (PCA) and regression. To define these models, we exploit a representation of the Wasserstein space closely related to its weak Riemannian structure by mapping the data to a suitable linear space and using a metric projection operator to constrain the results in the Wasserstein space. By carefully choosing the tangent point, we are able to derive fast empirical methods, exploiting a constrained B-spline approximation.  As a byproduct of our approach, we are also able to derive faster routines for previous work on PCA for distributions. By means of simulation studies, we compare our approaches to previously proposed methods, showing that our projected PCA has similar performance for a fraction of the computational cost and that the projected regression is extremely flexible even under misspecification. Several theoretical properties of the models are investigated, and asymptotic consistency is proven. Two real world applications to Covid-19 mortality in the US and wind speed forecasting are discussed."
"210061","Score Matched Neural Exponential Families for Likelihood-Free Inference","Lorenzo Pacchiardi, Ritabrata Dutta","https://jmlr.org//papers/volume23/21-0061/21-0061.pdf","https://github.com/LoryPack/SM-ExpFam-LFI","Bayesian Likelihood-Free Inference (LFI) approaches allow to obtain posterior distributions for stochastic models with intractable likelihood, by relying on model simulations. In Approximate Bayesian Computation (ABC), a popular LFI method, summary statistics are used to reduce data dimensionality. ABC algorithms adaptively tailor simulations to the observation in order to sample from an approximate posterior, whose form depends on the chosen statistics. In this work, we introduce a new way to learn ABC statistics: we first generate parameter-simulation pairs from the model independently on the observation; then, we use Score Matching to train a neural conditional exponential family to approximate the likelihood. The exponential family is the largest class of distributions with fixed-size sufficient statistics; thus, we use them in ABC, which is intuitively appealing and has state-of-the-art performance. In parallel, we insert our likelihood approximation in an MCMC for doubly intractable distributions to draw posterior samples. We can repeat that for any number of observations with no additional model simulations, with performance comparable to related approaches. We validate our methods on toy models with known likelihood and a large-dimensional time-series model."
"210100","(f,Gamma)-Divergences: Interpolating between f-Divergences and Integral Probability Metrics","Jeremiah Birrell, Paul Dupuis, Markos A. Katsoulakis, Yannis Pantazis, Luc Rey-Bellet","https://jmlr.org//papers/volume23/21-0100/21-0100.pdf","","We develop a rigorous and general framework for  constructing  information-theoretic divergences that subsume both $f$-divergences and integral probability metrics (IPMs),  such as  the  $1$-Wasserstein distance. We prove under which assumptions these divergences, hereafter referred to as $(f,\Gamma)$-divergences,  provide a notion of `distance' between probability measures and show that they can be expressed as a two-stage mass-redistribution/mass-transport  process. The  $(f,\Gamma)$-divergences inherit features  from IPMs,   such as   the ability  to compare distributions which are not absolutely continuous, as well as   from $f$-divergences, namely   the strict concavity of their variational representations and the ability to control heavy-tailed distributions  for particular choices of $f$. When combined, these features  establish a divergence with improved properties for estimation, statistical learning, and uncertainty quantification applications. Using statistical learning as an example, we demonstrate their advantage in training generative adversarial networks (GANs) for heavy-tailed, not-absolutely continuous sample distributions. We also show improved performance and stability over gradient-penalized Wasserstein GAN in image generation."
"210338","Structure-adaptive Manifold Estimation","Nikita Puchkin, Vladimir Spokoiny","https://jmlr.org//papers/volume23/21-0338/21-0338.pdf","","We consider a problem of manifold estimation from noisy observations. Many manifold learning procedures locally approximate a manifold by a weighted average over a small neighborhood. However, in the presence of large noise, the assigned weights become so corrupted that the averaged estimate shows very poor performance. We suggest a structure-adaptive procedure, which simultaneously reconstructs a smooth manifold and estimates projections of the point cloud onto this manifold. The proposed approach iteratively refines the weights on each step, using the structural information obtained at previous steps. After several iterations, we obtain nearly “oracle” weights, so that the final estimates are nearly efficient even in the presence of relatively large noise. In our theoretical study, we establish tight lower and upper bounds proving  asymptotic optimality of the method for manifold estimation under the Hausdorff loss, provided that the noise degrades to zero fast enough."
"210345","The Correlation-assisted Missing Data Estimator","Timothy I. Cannings, Yingying Fan","https://jmlr.org//papers/volume23/21-0345/21-0345.pdf","","We introduce a novel approach to estimation problems in settings with missing data. Our proposal -- the Correlation-Assisted Missing data (CAM) estimator -- works by exploiting the relationship between the observations with missing features and those without missing features in order to obtain improved prediction accuracy.  In particular, our theoretical results elucidate general conditions under which the proposed CAM estimator has lower mean squared error than the widely used complete-case approach in a range of estimation problems.  We showcase in detail how the CAM estimator can be applied to $U$-Statistics to obtain an unbiased, asymptotically Gaussian estimator that has lower variance than the complete-case $U$-Statistic.  Further, in nonparametric density estimation and regression problems, we construct our CAM estimator using kernel functions, and show it has lower asymptotic mean squared error than the corresponding complete-case kernel estimator.  We also include practical demonstrations throughout the paper using simulated data and the Terneuzen birth cohort and Brandsma datasets available from CRAN."
"210368","Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks","Zhong Li, Jiequn Han, Weinan E, Qianxiao Li","https://jmlr.org//papers/volume23/21-0368/21-0368.pdf","","We perform a systematic study of the approximation properties and optimization dynamics of recurrent neural networks (RNNs) when applied to learn input-output relationships in temporal data. We consider the simple but representative setting of using continuous-time linear RNNs to learn from data generated by linear relationships. On the approximation side, we prove a direct and an inverse approximation theorem of linear functionals using RNNs, which reveal the intricate connections between memory structures in the target and the corresponding approximation efficiency. In particular, we show that temporal relationships can be effectively approximated by RNNs if and only if the former possesses sufficient memory decay. On the optimization front, we perform detailed analysis of the optimization dynamics, including a precise understanding of the difficulty that may arise in learning relationships with long-term memory. The term “curse of memory” is coined to describe the uncovered phenomena, akin to the “curse of dimension” that plagues high-dimensional function approximation. These results form a relatively complete picture of the interaction of memory and recurrent structures in the linear dynamical setting."
"210439","Sampling Permutations for Shapley Value Estimation","Rory Mitchell, Joshua Cooper, Eibe Frank, Geoffrey Holmes","https://jmlr.org//papers/volume23/21-0439/21-0439.pdf","","Game-theoretic attribution techniques based on Shapley values are used to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models. As the computation of Shapley values can be expressed as a summation over a set of permutations, a common approach is to sample a subset of these permutations for approximation. Unfortunately, standard Monte Carlo sampling methods can exhibit slow convergence, and more sophisticated quasi-Monte Carlo methods have not yet been applied to the space of permutations. To address this, we investigate new approaches based on two classes of approximation methods and compare them empirically. First, we demonstrate quadrature techniques in a RKHS containing functions of permutations, using the Mallows kernel in combination with kernel herding and sequential Bayesian quadrature. The RKHS perspective also leads to quasi-Monte Carlo type error bounds, with a tractable discrepancy measure defined on permutations. Second, we exploit connections between the hypersphere $\mathbb{S}^{d-2}$ and permutations to create practical algorithms for generating permutation samples with good properties. Experiments show the above techniques provide significant improvements for Shapley value estimates over existing methods, converging to a smaller RMSE in the same number of model evaluations."
"210451","PAC Guarantees and Effective Algorithms for Detecting Novel Categories","Si Liu, Risheek Garrepalli, Dan Hendrycks, Alan Fern, Debashis Mondal, Thomas G. Dietterich","https://jmlr.org//papers/volume23/21-0451/21-0451.pdf","https://github.com/liusi2019/ocd-journal","Open category detection is the problem of detecting “alien"" test instances that belong to categories or classes that were not present in the training data. In many applications, reliably detecting such aliens is central to ensuring the safety and accuracy of test set predictions. Unfortunately, there are no algorithms that provide theoretical guarantees on their ability to detect aliens under general assumptions. Further, while there are algorithms for open category detection, there are few empirical results that directly report alien detection rates. Thus, there are significant theoretical and empirical gaps in our understanding of open category detection. In this paper, we take a step toward addressing this gap by studying a simple, but practically-relevant variant of open category detection. In our setting, we are provided with a “clean"" training set that contains only the target categories of interest and an unlabeled “contaminated” training set that contains a fraction $\alpha$ of alien examples. Under the assumption that we know an upper bound on $\alpha$, we develop an algorithm that gives PAC-style guarantees on the alien detection rate, while aiming to minimize false alarms. Given an overall budget on the amount of training data, we also derive the optimal allocation of samples between the mixture and the clean data sets. Experiments on synthetic and standard benchmark datasets evaluate the regimes in which the algorithm can be effective and provide a baseline for further advancements. In addition, for the situation when an upper bound for $\alpha$ is not available, we employ nine different anomaly proportion estimators, and run experiments on both synthetic and standard benchmark data sets to compare their performance."
"210519","Optimal Transport for Stationary Markov Chains via Policy Iteration","Kevin O'Connor, Kevin McGoff, Andrew B. Nobel","https://jmlr.org//papers/volume23/21-0519/21-0519.pdf","https://github.com/oconnor-kevin/OTC","We study the optimal transport problem for pairs of stationary finite-state Markov chains, with an emphasis on the computation of optimal transition couplings. Transition couplings are a constrained family of transport plans that capture the dynamics of Markov chains. Solutions of the optimal transition coupling (OTC) problem correspond to alignments of the two chains that minimize long-term average cost. We establish a connection between the OTC problem and Markov decision processes, and show that solutions of the OTC problem can be obtained via an adaptation of policy iteration. For settings with large state spaces, we develop a fast approximate algorithm based on an entropy-regularized version of the OTC problem, and provide bounds on its per-iteration complexity. We establish a stability result for both the regularized and unregularized algorithms, from which a statistical consistency result follows as a corollary. We validate our theoretical results empirically through a simulation study, demonstrating that the approximate algorithm exhibits faster overall runtime with low error. Finally, we extend the setting and application of our methods to hidden Markov models, and illustrate the potential use of the proposed algorithms in practice with an application to computer-generated music."
"210560","Beyond Sub-Gaussian Noises: Sharp Concentration Analysis for Stochastic Gradient Descent","Wanrong Zhu, Zhipeng Lou, Wei Biao Wu","https://jmlr.org//papers/volume23/21-0560/21-0560.pdf","","In this paper, we study the concentration property of stochastic gradient descent (SGD) solutions. In existing concentration analyses, researchers impose restrictive requirements on the gradient noise, such as boundedness or sub-Gaussianity. We consider a  much richer class of noise where only finitely-many moments are required, thus allowing heavy-tailed noises. In particular, we obtain Nagaev type high-probability upper bounds for the estimation errors of averaged stochastic gradient descent (ASGD) in a linear model. Specifically, we prove that, after $T$ steps of SGD, the ASGD estimate achieves an $O(\sqrt{\log(1/\delta)/T} + (\delta T^{q-1})^{-1/q})$ error rate with probability at least $1-\delta$, where $q>2$ controls the tail of the gradient noise. In comparison, one has the $O(\sqrt{\log(1/\delta)/T})$ error rate for sub-Gaussian noises. We also show that the Nagaev type upper bound is almost tight through an example, where the exact asymptotic form of the tail probability can be derived.  Our concentration analysis indicates that, in the case of heavy-tailed noises, the polynomial dependence on the failure probability $\delta$ is generally unavoidable for the error rate of SGD."
"210635","Cascaded Diffusion Models for High Fidelity Image Generation","Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, Tim Salimans","https://jmlr.org//papers/volume23/21-0635/21-0635.pdf","https://cascaded-diffusion.github.io/","We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation benchmark, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep, and classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256x256, outperforming VQ-VAE-2."
"210669","Overparameterization of Deep ResNet: Zero Loss and Mean-field Analysis","Zhiyan Ding, Shi Chen, Qin Li, Stephen J. Wright","https://jmlr.org//papers/volume23/21-0669/21-0669.pdf","","Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global optimizer with perfect fit (zero-loss) in many practical situations. We examine this phenomenon for the case of Residual Neural Networks (ResNet) with smooth activation functions in a limiting regime in which both the number of layers (depth) and the number of weights in each layer (width) go to infinity. First, we use a mean-field-limit argument to prove that the gradient descent for parameter training becomes a gradient flow for a probability distribution that is characterized by a partial differential equation (PDE) in the large-NN limit. Next, we show that under certain assumptions, the solution to the PDE converges in the training time to a zero-loss solution. Together, these results suggest that the training of the ResNet gives a near-zero loss if the ResNet is large enough. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability."
"210735","Innovations Autoencoder and its Application in One-class Anomalous Sequence  Detection","Xinyi Wang, Lang Tong","https://jmlr.org//papers/volume23/21-0735/21-0735.pdf","","An innovations sequence of a time series is a sequence of independent and identically distributed random variables with which the original time series has a causal representation.  The innovation at a time is statistically independent of the  history of the time series.  As such, it represents the new information contained at present but not in the past.  Because of its simple probability structure, the innovations sequence is the most efficient signature of the original. Unlike the principle or independent component representations, an innovations sequence preserves not only the complete statistical properties but also the temporal order of the original time series. An long-standing open problem is to find a computationally tractable way to extract an innovations sequence of non-Gaussian processes.  This paper presents a deep learning approach, referred to as Innovations Autoencoder (IAE), that extracts innovations sequences using a causal convolutional neural network. An application of IAE to the one-class anomalous sequence detection problem with unknown anomaly and anomaly-free models is also presented."
"210758","Analytically Tractable Hidden-States Inference in Bayesian Neural Networks","Luong-Ha Nguyen, James-A. Goulet","https://jmlr.org//papers/volume23/21-0758/21-0758.pdf","","With few exceptions, neural networks have been relying on backpropagation and gradient descent as the inference engine in order to learn the model parameters, because closed-form Bayesian inference for neural networks has been considered to be intractable. In this paper, we show how we can leverage the tractable approximate Gaussian inference's (TAGI) capabilities to infer hidden states, rather than only using it for inferring the network's parameters. One novel aspect is that it allows inferring hidden states through the imposition of constraints designed to achieve specific objectives, as illustrated through three examples: (1) the generation of adversarial-attack examples, (2) the usage of a neural network as a black-box optimization method, and (3) the application of inference on continuous-action reinforcement learning. In these three examples, the constrains are in (1), a target label chosen to fool a neural network, and in (2 and 3) the derivative of the network with respect to its input that is set to zero in order to infer the optimal input values that are either maximizing or minimizing it. These applications showcase how tasks that were previously reserved to gradient-based optimization approaches can now be approached with analytically tractable inference."
"210791","Toolbox for Multimodal Learn (scikit-multimodallearn)","Dominique Benielli, Baptiste Bauvin, Sokol Koço, Riikka Huusari, Cécile Capponi, Hachem Kadri, François Laviolette","https://jmlr.org//papers/volume23/21-0791/21-0791.pdf","https://github.com/dbenielli/scikit-multimodallearn","scikit-multimodallearn is a Python library for multimodal supervised learning, licensed under Free BSD, and compatible with the well-known scikit-learn toolbox (Fabian Pedregosa, 2011). This paper details the content of the library, including a specific multimodal data formatting and classification and regression algorithms. Use cases and examples are also provided."
"210840","LinCDE: Conditional Density Estimation via Lindsey's Method","Zijun Gao, Trevor Hastie","https://jmlr.org//papers/volume23/21-0840/21-0840.pdf","","Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few.  In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey's method (LinCDE). LinCDE admits flexible modeling of the density family and can capture distributional characteristics like modality and shape. In particular, when suitably parametrized, LinCDE will produce smooth and non-negative density estimates. Furthermore, like boosted regression trees, LinCDE does automatic feature selection. We demonstrate LinCDE's efficacy through extensive simulations and three real data examples."
"210862","DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python","Philipp Bach, Victor Chernozhukov, Malte S. Kurz, Martin Spindler","https://jmlr.org//papers/volume23/21-0862/21-0862.pdf","https://github.com/DoubleML/doubleml-for-py","DoubleML is an open-source Python library implementing the double machine learning framework of Chernozhukov et al. (2018) for a variety of causal models. It contains functionalities for valid statistical inference on causal parameters when the estimation of nuisance parameters is based on machine learning methods. The object-oriented implementation of DoubleML provides a high flexibility in terms of model specifications and makes it easily extendable. The package is distributed under the MIT license and relies on core libraries from the scientific Python ecosystem: scikit-learn, numpy, pandas, scipy, statsmodels and joblib. Source code, documentation and an extensive user guide can be found at https://github.com/DoubleML/doubleml-for-py and https://docs.doubleml.org."
"210888","SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization","Marius Lindauer, Katharina Eggensperger, Matthias Feurer, André Biedenkapp, Difan Deng, Carolin Benjamins, Tim Ruhkopf, René Sass, Frank Hutter","https://jmlr.org//papers/volume23/21-0888/21-0888.pdf","https://github.com/automl/SMAC3","Algorithm parameters, in particular hyperparameters of machine learning algorithms, can substantially impact their performance. To support users in determining well-performing hyperparameter configurations for their algorithms, datasets and applications at hand, SMAC3 offers a robust and flexible framework for Bayesian Optimization, which can improve performance within a few evaluations. It offers several facades and pre-sets for typical use cases, such as optimizing hyperparameters, solving low dimensional continuous (artificial) global optimization problems and configuring algorithms to perform well across multiple problem instances. The SMAC3 package is available under a permissive BSD-license at https://github.com/automl/SMAC3."
"210936","Bayesian Pseudo Posterior Mechanism under Asymptotic Differential Privacy","Terrance D. Savitsky, Matthew R.Williams, Jingchen Hu","https://jmlr.org//papers/volume23/21-0936/21-0936.pdf","","We propose a Bayesian pseudo posterior mechanism to generate record-level synthetic databases equipped with an $(\epsilon,\pi)-$ probabilistic differential privacy (pDP) guarantee, where $\pi$ denotes the probability that any observed database exceeds $\epsilon$.  The pseudo posterior mechanism employs a data record-indexed, risk-based weight vector with weight values $\in [0, 1]$ that surgically downweight the likelihood contributions for high-risk records for model estimation and the generation of record-level synthetic data for public release. The pseudo posterior synthesizer constructs a weight for each datum record by using the Lipschitz bound for that record under a log-pseudo likelihood utility function that generalizes the exponential mechanism (EM) used to construct a formally private data generating mechanism.  By selecting weights to remove likelihood contributions with non-finite log-likelihood values, we guarantee a finite local privacy guarantee for our pseudo posterior mechanism at every sample size.  Our results may be applied to any synthesizing model envisioned by the data disseminator in a computationally tractable way that only involves estimation of a pseudo posterior distribution for parameters, $\theta$, unlike recent approaches that use naturally-bounded utility functions implemented through the EM.  We specify conditions that guarantee the asymptotic contraction of $\pi$ to $0$ over the space of databases, such that the form of the guarantee provided by our method is asymptotic. We illustrate our pseudo posterior mechanism on the sensitive family income variable from the Consumer Expenditure Surveys database published by the U.S. Bureau of Labor Statistics. We show that utility is better preserved in the synthetic data for our pseudo posterior mechanism as compared to the EM, both estimated using the same non-private synthesizer, due to our use of targeted downweighting."
"211155","solo-learn: A Library of Self-supervised Methods for Visual Representation Learning","Victor Guilherme Turrisi da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, Elisa Ricci","https://jmlr.org//papers/volume23/21-1155/21-1155.pdf","https://github.com/vturrisi/solo-learn","This paper presents solo-learn, a library of self-supervised methods for visual representation learning. Implemented in Python, using Pytorch and Pytorch lightning, the library fits both research and industry needs by featuring distributed training pipelines with mixed-precision, faster data loading via Nvidia DALI, online linear evaluation for better prototyping, and many additional training tricks.  Our goal is to provide an easy-to-use library comprising a large amount of Self-supervised Learning (SSL) methods, that can be easily extended and fine-tuned by the community. solo-learn opens up avenues for exploiting large-budget SSL solutions on inexpensive smaller infrastructures and seeks to democratize SSL by making it accessible to all. The source code is available at https://github.com/vturrisi/solo-learn."
"211427","Inherent Tradeoffs in Learning Fair Representations","Han Zhao, Geoffrey J. Gordon","https://jmlr.org//papers/volume23/21-1427/21-1427.pdf","","Real-world applications of machine learning tools in high-stakes domains are often regulated to be fair, in the sense that the predicted target should satisfy some quantitative notion of parity with respect to a protected attribute. However, the exact tradeoff between fairness and accuracy is not entirely clear, even for the basic paradigm of classification problems. In this paper, we characterize an inherent tradeoff between statistical parity and accuracy in the classification setting by providing a lower bound on the sum of group-wise errors of any fair classifiers. Our impossibility theorem could be interpreted as a certain uncertainty principle in fairness: if the base rates differ among groups, then any fair classifier satisfying statistical parity has to incur a large error on at least one of the groups. We further extend this result to give a lower bound on the joint error of any (approximately) fair classifiers, from the perspective of learning fair representations. To show that our lower bound is tight, assuming oracle access to Bayes (potentially unfair) classifiers, we also construct an algorithm that returns a randomized classifier which is both optimal (in terms of accuracy) and fair. Interestingly, when the protected attribute can take more than two values, an extension of this lower bound does not admit an analytic solution. Nevertheless, in this case, we show that the lower bound can be efficiently computed by solving a linear program, which we term as the TV-Barycenter problem, a barycenter problem under the TV-distance. On the upside, we prove that if the group-wise Bayes optimal classifiers are close, then learning fair representations leads to an alternative notion of fairness, known as the accuracy parity, which states that the error rates are close between groups. Finally, we also conduct experiments on real-world datasets to confirm our theoretical findings."
"19297","A Statistical Approach for Optimal Topic Model Identification","Craig M. Lewis, Francesco Grossetti","https://jmlr.org//papers/volume23/19-297/19-297.pdf","","Latent Dirichlet Allocation is a popular machine-learning technique that identifies latent structures in a corpus of documents. This paper addresses the ongoing concern that formal procedures for determining the optimal LDA configuration do not exist by introducing a set of parametric tests that rely on the assumed multinomial distribution specification underlying the original LDA model. Our methodology defines a set of rigorous statistical procedures that identify and evaluate the optimal topic model. The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index."
"19480","Causal Classification: Treatment Effect Estimation vs. Outcome Prediction","Carlos Fernández-Loría, Foster Provost","https://jmlr.org//papers/volume23/19-480/19-480.pdf","","The goal of causal classification is to identify individuals whose outcome would be positively changed by a treatment. Examples include targeting advertisements and targeting retention incentives to reduce churn. Causal classification is challenging because we observe individuals under only one condition (treated or untreated), so we do not know who was influenced by the treatment, but we may estimate the potential outcomes under each condition to decide whom to treat by estimating treatment effects. Curiously, we often see practitioners using simple outcome prediction instead, for example, predicting if someone will purchase if shown the ad. Rather than disregarding this as naive behavior, we present a theoretical analysis comparing treatment effect estimation and outcome prediction when addressing causal classification. We focus on the key question: ""When (if ever) is simple outcome prediction preferable to treatment effect estimation for causal classification?"" The analysis reveals a causal bias--variance tradeoff. First, when the treatment effect estimation depends on two outcome predictions, larger sampling variance may lead to more errors than the (biased) outcome prediction approach. Second, a stronger signal-to-noise ratio in outcome prediction implies that the bias can help with intervention decisions when outcomes are informative of effects. The theoretical results, as well as simulations, illustrate settings where outcome prediction should actually be better, including cases where (1) the bias may be partially corrected by choosing a different threshold, (2) outcomes and treatment effects are correlated, and (3) data to estimate counterfactuals are limited. A major practical implication is that, for some applications, it might be feasible to make good intervention decisions without any data on how individuals actually behave when intervened.  Finally, we show that for a real online advertising application, outcome prediction models indeed excel at causal classification."
"19513","A Unifying Framework for Variance-Reduced Algorithms for Findings Zeroes of Monotone operators","Xun Zhang, William B. Haskell, Zhisheng Ye","https://jmlr.org//papers/volume23/19-513/19-513.pdf","","It is common to encounter large-scale monotone inclusion problems where the objective has a finite sum structure.  We develop a general framework for variance-reduced forward-backward splitting algorithms for this problem.  This framework includes a number of existing deterministic and variance-reduced algorithms for function minimization as special cases, and it is also applicable to more general problems such as saddle-point problems and variational inequalities.  With a carefully constructed Lyapunov function, we show that the algorithms covered by our framework enjoy a linear convergence rate in expectation under mild assumptions. We further consider Catalyst acceleration and asynchronous implementation to reduce the algorithmic complexity and computation time. We apply our proposed framework to a policy evaluation problem and a  strongly monotone two-player game, both of which fall outside the realm of function minimization."
"19597","Sparse Additive Gaussian Process Regression","Hengrui Luo, Giovanni Nattino, Matthew T. Pratola","https://jmlr.org//papers/volume23/19-597/19-597.pdf","","In this paper we introduce a novel model for Gaussian process (GP) regression in the fully Bayesian setting. Motivated by the ideas of sparsification, localization and Bayesian additive modeling, our model is built around a recursive partitioning (RP) scheme. Within each RP partition, a sparse GP (SGP) regression model is fitted. A Bayesian additive framework then combines multiple layers of partitioned SGPs, capturing both global trends and local refinements with efficient computations. The model addresses both the problem of efficiency in fitting a full Gaussian process regression model and the problem of prediction performance associated with a single SGP. Our approach mitigates the issue of pseudo-input selection and avoids the need for complex inter-block correlations in existing methods.  The crucial trade-off becomes choosing between many simpler local model components or fewer complex global model components, which the practitioner can sensibly tune. Implementation is via a Metropolis-Hasting Markov chain Monte-Carlo algorithm with Bayesian back-fitting. We compare our model against popular alternatives on simulated and real datasets, and find the performance is competitive, while the fully Bayesian procedure enables the quantification of model uncertainties."
"19599","The AIM and EM Algorithms for Learning from Coarse Data","Manfred Jaeger","https://jmlr.org//papers/volume23/19-599/19-599.pdf","https://github.com/manfred-jaeger-aalborg/aim_for_gauss","Statistical learning from incomplete data is typically performed under an assumption of ignorability for the mechanism that causes missing values. Notably, the expectation maximization (EM) algorithm is based on the assumption that values are missing at random. Most approaches that tackle non-ignorable mechanisms are based on specific modeling assumptions for these mechanisms. The adaptive imputation and maximization (AIM) algorithm has  been introduced in earlier  work as a general paradigm for learning from incomplete data without any assumptions on the process that causes observations to be incomplete.   In this paper we give a thorough analysis of the theoretical properties of the AIM algorithm, and its  relationship with EM. We identify conditions under which EM and AIM are in fact equivalent, and show that when these conditions are not met, then AIM can produce consistent estimates in non-ignorable incomplete data scenarios where EM becomes inconsistent. Convergence results for AIM are obtained that closely mirror the available convergence  guarantees for EM. We develop the general theory of the AIM algorithm for discrete data settings, and then develop a general discretization approach that allows to apply the method also to incomplete continuous data.  We demonstrate the practical usability of the AIM algorithm by prototype implementations for  parameter learning from continuous Gaussian data, and from discrete Bayesian network data. Extensive experiments  show that the theoretical differences between AIM and EM can be observed in practice, and that a combination of the two methods leads to robust performance for both ignorable and non-ignorable mechanisms."
"19697","Additive Nonlinear Quantile Regression in Ultra-high Dimension","Ben Sherwood, Adam Maidman","https://jmlr.org//papers/volume23/19-697/19-697.pdf","","We propose a method for simultaneous estimation and variable selection of an additive quantile regression model that can be used with high dimensional data. Quantile regression is an appealing method for analyzing high dimensional data because it can correctly model heteroscedastic relationships, is robust to outliers in the response, sparsity levels can change with quantiles, and it provides a thorough analysis of the conditional distribution of the response. An additive nonlinear model can capture more complex relationships, while avoiding the curse of dimensionality. The additive nonlinear model is fit using B-splines and a nonconvex group penalty is used for simultaneous estimation and variable selection. We derive the asymptotic properties of the estimator, including an oracle property, under general conditions that allow for the number of covariates, $p_n$, and the number of true covariates, $q_n$, to increase with the sample size, $n$. In addition, we propose a coordinate descent algorithm that reduces the computational cost compared to the linear programming approach typically used for solving quantile regression problems. The performance of the method is tested using Monte Carlo simulations, an analysis of fat content of meat conditional on a 100 channel spectrum of absorbances and predicting TRIM32 expression using gene expression data from the eyes of rats."
"19750","Stochastic Zeroth-Order Optimization under Nonstationarity and Nonconvexity","Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi, Prasant Mohapatra","https://jmlr.org//papers/volume23/19-750/19-750.pdf","","Stochastic zeroth-order optimization algorithms have been predominantly analyzed under the assumption that the objective function being optimized is time-invariant. Motivated by dynamic matrix sensing and completion problems, and online reinforcement learning problems, in this work, we propose and analyze stochastic zeroth-order optimization algorithms when the objective being optimized changes with time. Considering general nonconvex functions, we propose nonstationary versions of regret measures based on first-order and second-order optimal solutions, and provide the corresponding regret bounds.  For the case of first-order optimal solution based regret measures, we provide regret bounds in both the low- and high-dimensional settings. For the case of second-order optimal solution based regret, we propose zeroth-order versions of the stochastic cubic-regularized Newton's method based on estimating the Hessian matrices in the bandit setting via second-order Gaussian Stein's identity. Our nonstationary regret bounds in terms of second-order optimal solutions have interesting consequences for avoiding saddle points in the nonstationary setting."
"19843","On the Complexity of Approximating Multimarginal Optimal Transport","Tianyi Lin, Nhat Ho, Marco Cuturi, Michael I. Jordan","https://jmlr.org//papers/volume23/19-843/19-843.pdf","","We study the complexity of approximating the multimarginal optimal transport (MOT) distance, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the standard linear programming (LP) representation of the MOT problem is not a minimum-cost flow problem when $m \geq 3$. This negative result implies that some combinatorial algorithms, e.g., network simplex method, are not suitable for approximating the MOT problem, while the worst-case complexity bound for the deterministic interior-point algorithm remains a quantity of $\tilde{\mathcal{O}}(n^{3m})$. We then propose two simple and deterministic algorithms for approximating the MOT problem. The first algorithm, which we refer to as multimarginal Sinkhorn algorithm, is a provably efficient multimarginal generalization of the Sinkhorn algorithm. We show that it achieves a complexity bound of $\tilde{\mathcal{O}}(m^3n^m\varepsilon^{-2})$ for a tolerance $\varepsilon \in (0, 1)$. This provides a first near-linear time complexity bound guarantee for approximating the MOT problem and matches the best known complexity bound for the Sinkhorn algorithm in the classical OT setting when $m = 2$. The second algorithm, which we refer to as accelerated multimarginal Sinkhorn algorithm, achieves the acceleration by incorporating an estimate sequence and the complexity bound is $\tilde{\mathcal{O}}(m^3n^{m+1/3}\varepsilon^{-4/3})$. This bound is better than that of the first algorithm in terms of $1/\varepsilon$, and accelerated alternating minimization algorithm (Tupitsa et al., 2020)  in terms of $n$. Finally, we compare our new algorithms with the commercial LP solver Gurobi. Preliminary results on synthetic data and real images demonstrate the effectiveness and efficiency of our algorithms."
"20064","New Insights for the Multivariate Square-Root Lasso","Aaron J. Molstad","https://jmlr.org//papers/volume23/20-064/20-064.pdf","https://github.com/ajmolstad/MSRL","We study the multivariate square-root lasso, a method for fitting the multivariate response linear regression model with dependent errors. This estimator minimizes the nuclear norm of the residual matrix plus a convex penalty. Unlike existing methods that require explicit estimates of the error precision (inverse covariance) matrix, the multivariate square-root lasso implicitly accounts for error dependence and is the solution to a convex optimization problem. We establish error bounds which reveal that like the univariate square-root lasso, the multivariate square-root lasso is pivotal with respect to the unknown error covariance matrix. In addition, we propose a variation of the alternating direction method of multipliers algorithm to compute the estimator and discuss an accelerated first order algorithm that can be applied in certain cases. In both simulation studies and a genomic data application, we show that the multivariate square-root lasso can outperform more computationally intensive methods that require explicit estimation of the error precision matrix."
"20069","Are All Layers Created Equal?","Chiyuan Zhang, Samy Bengio, Yoram Singer","https://jmlr.org//papers/volume23/20-069/20-069.pdf","","Understanding deep neural networks is a major research objective with notable experimental and theoretical attention in recent years. The practical success of excessively large networks underscores the need for better theoretical analyses and justifications. In this paper we focus on layer-wise functional structure and behavior in overparameterized deep models. To do so, we study empirically the layers' robustness to post-training re-initialization and re-randomization of the parameters. We provide experimental results which give evidence for the heterogeneity of layers. Morally, layers of large deep neural networks can be categorized as either ""robust"" or ""critical"". Resetting the robust layers to their initial values does not result in adverse decline in performance. In many cases, robust layers hardly change throughout training. In contrast, re-initializing critical layers vastly degrades the performance of the network with test error essentially dropping to random guesses. Our study provides further evidence that mere parameter counting or norm calculations are too coarse in studying generalization of deep models, and ""flatness"" and robustness analysis of trained models need to be examined while taking into account the respective network architectures."
"20099","Scaling-Translation-Equivariant Networks with Decomposed Convolutional Filters","Wei Zhu, Qiang Qiu, Robert Calderbank, Guillermo Sapiro, Xiuyuan Cheng","https://jmlr.org//papers/volume23/20-099/20-099.pdf","","Encoding the scale information explicitly into the representation learned by a convolutional neural network (CNN) is beneficial for many computer vision tasks especially when dealing with multiscale inputs. We study, in this paper, a scaling-translation-equivariant ($\mathcal{ST}$-equivariant) CNN with joint convolutions across the space and  the scaling group, which is shown to be both sufficient and necessary to achieve equivariance for the regular representation of the scaling-translation group $\mathcal{ST}$. To reduce the model complexity and computational burden,  we decompose the convolutional filters under two pre-fixed separable bases and truncate the expansion to  low-frequency components. A further benefit of the truncated filter expansion is the improved deformation robustness of the equivariant representation, a property which is theoretically analyzed and empirically verified. Numerical experiments demonstrate that the proposed scaling-translation-equivariant network with decomposed convolutional filters (ScDCFNet) achieves significantly improved performance in multiscale image classification and better interpretability than regular CNNs at a reduced model size."
"201027","Asymptotic Network Independence and Step-Size for a Distributed Subgradient Method","Alex Olshevsky","https://jmlr.org//papers/volume23/20-1027/20-1027.pdf","https://github.com/alexolshevsky/NetworkIndependenceSubgradient/blob/main/Step_size_inversion.ipynb","We consider whether distributed subgradient methods can achieve a linear speedup over a centralized subgradient method. While it might be hoped that distributed network of $n$ nodes that can compute $n$ times more subgradients in parallel compared to a single node might, as a result, be $n$ times faster,  existing bounds for distributed optimization methods are often consistent with a slowdown rather than speedup compared to a single node.  We show that a  distributed subgradient method has this “linear speedup” property when using a class of square-summable-but-not-summable step-sizes which include $1/t^{\beta}$ when $\beta \in (1/2,1)$; for such step-sizes, we show that after a  transient period whose size depends on the spectral gap of the network, the method achieves a performance guarantee that does not depend on the network or the number of nodes. We also show that the same method can fail to have this “asymptotic network independence” property under the optimally decaying step-size $1/\sqrt{t}$ and, as a consequence, can fail to provide a linear speedup compared to a single node with $1/\sqrt{t}$ step-size."
"20108","Generalized Sparse Additive Models","Asad Haris, Noah Simon, Ali Shojaie","https://jmlr.org//papers/volume23/20-108/20-108.pdf","https://github.com/asadharis/GSAM","We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework."
"201103","Multiple-Splitting Projection Test for High-Dimensional Mean Vectors","Wanjun Liu, Xiufan Yu, Runze Li","https://jmlr.org//papers/volume23/20-1103/20-1103.pdf","","We propose a multiple-splitting projection test (MPT) for one-sample mean vectors in high-dimensional settings. The idea of projection test is to project high-dimensional samples to a 1-dimensional space using an optimal projection direction such that traditional tests can be carried out with projected samples. However, estimation of the optimal projection direction has not been systematically studied in the literature. In this work, we bridge the gap by proposing a consistent estimation via regularized quadratic optimization. To retain type I error rate, we adopt a data-splitting strategy when constructing test statistics. To mitigate the power loss due to data-splitting, we further propose a test via multiple splits to enhance the testing power. We show that the $p$-values resulted from multiple splits are exchangeable.  Unlike existing methods which tend to conservatively combine dependent $p$-values, we develop an exact level $\alpha$ test that explicitly utilizes the exchangeability structure to achieve better power. Numerical studies show that the proposed test well retains the type I error rate and is more powerful than state-of-the-art tests."
"201135","Batch Normalization Preconditioning for Neural Network Training","Susanna Lange, Kyle Helfrich, Qiang Ye","https://jmlr.org//papers/volume23/20-1135/20-1135.pdf","","Batch normalization (BN) is a popular and ubiquitous method in deep learning that has been shown to decrease training time and improve generalization performance of neural networks. Despite its success, BN is not theoretically well understood. It is not suitable for use with very small mini-batch sizes or online learning. In this paper, we propose a new method called Batch Normalization Preconditioning (BNP). Instead of applying normalization explicitly through a batch normalization layer as is done in BN, BNP applies normalization by conditioning the parameter gradients directly during training. This is designed to improve the Hessian matrix of the loss function and hence convergence during training. One benefit is that BNP is not constrained on the mini-batch size and works in the online learning setting. Furthermore, its connection to BN provides theoretical insights on how BN improves training and how BN is applied to special architectures such as convolutional neural networks. For a theoretical foundation, we also present a novel Hessian condition number based convergence theory for a locally convex but not strong-convex loss, which is applicable to networks with a scale-invariant property."
"201180","A Kernel Two-Sample Test for Functional Data","George Wynne, Andrew B. Duncan","https://jmlr.org//papers/volume23/20-1180/20-1180.pdf","","We propose a nonparametric two-sample test procedure based on Maximum Mean Discrepancy (MMD) for testing the hypothesis that two samples of functions have the same underlying distribution, using kernels defined on function spaces. This construction is motivated by a scaling analysis of the efficiency of MMD-based tests for datasets of increasing dimension. Theoretical properties of kernels on function spaces and their associated MMD  are established and employed to ascertain the efficacy of the newly proposed test, as well as to assess the effects of using functional reconstructions based on discretised function samples.  The theoretical results are demonstrated over a range of synthetic and real world datasets."
"201340","All You Need is a Good Functional Prior for Bayesian Deep Learning","Ba-Hien Tran, Simone Rossi, Dimitrios Milios, Maurizio Filippone","https://jmlr.org//papers/volume23/20-1340/20-1340.pdf","https://github.com/tranbahien/you-need-a-good-prior","The Bayesian treatment of neural networks dictates that a prior distribution is specified over their weight and bias parameters. This poses a challenge because modern neural networks are characterized by a large number of parameters, and the choice of these priors has an uncontrolled effect on the induced functional prior, which is the distribution of the functions obtained by sampling the parameters from their prior distribution. We argue that this is a hugely limiting aspect of Bayesian deep learning, and this work tackles this limitation in a practical and effective way. Our proposal is to reason in terms of functional priors, which are easier to elicit, and to “tune” the priors of neural network parameters in a way that they reflect such functional priors. Gaussian processes offer a rigorous framework to define prior distributions over functions, and we propose a novel and robust framework to match their prior with the functional prior of neural networks based on the minimization of their Wasserstein distance. We provide vast experimental evidence that coupling these priors with scalable Markov chain Monte Carlo sampling offers systematically large performance improvements over alternative choices of priors and state-of-the-art approximate Bayesian deep learning approaches. We consider this work a considerable step in the direction of making the long-standing challenge of carrying out a fully Bayesian treatment of neural networks, including convolutional neural networks, a concrete possibility."
"201358","Mutual Information Constraints for Monte-Carlo Objectives to Prevent Posterior Collapse Especially in Language Modelling","Gábor Melis, András György, Phil Blunsom","https://jmlr.org//papers/volume23/20-1358/20-1358.pdf","","Posterior collapse is a common failure mode of density models trained as variational autoencoders, wherein they model the data without relying on their latent variables, rendering these variables useless. We focus on two factors contributing to posterior collapse, that have been studied separately in the literature. First, the underspecification of the model, which in an extreme but common case allows posterior collapse to be the theoretical optimium. Second, the looseness of the variational lower bound and the related underestimation of the utility of the latents. We weave these two strands of research together, specifically the tighter bounds of multi-sample Monte-Carlo objectives and constraints on the mutual information between the observable and the latent variables. The main obstacle is that the usual method of estimating the mutual information as the average Kullback-Leibler divergence between the easily available variational posterior q(z|x) and the prior does not work with Monte-Carlo objectives because their q(z|x) is not a direct approximation to the model's true posterior p(z|x). Hence, we construct estimators of the Kullback-Leibler divergence of the true posterior from the prior by recycling samples used in the objective, with which we train models of continuous and discrete latents at much improved rate-distortion and no posterior collapse. While alleviated, the tradeoff between modelling the data and using the latents still remains, and we urge for evaluating inference methods across a range of mutual information values."
"201375","Joint Inference of Multiple Graphs from Matrix Polynomials","Madeline Navarro, Yuhao Wang, Antonio G. Marques, Caroline Uhler, Santiago Segarra","https://jmlr.org//papers/volume23/20-1375/20-1375.pdf","","Inferring graph structure from observations on the nodes is an important and popular network science task. Departing from the more common inference of a single graph, we study the problem of jointly inferring multiple graphs from the observation of signals at their nodes (graph signals), which are assumed to be stationary in the sought graphs. Graph stationarity implies that the mapping between the covariance of the signals and the sparse matrix representing the underlying graph is given by a matrix polynomial. A prominent example is that of Markov random fields, where the inverse of the covariance yields the sparse matrix of interest. From a modeling perspective, stationary graph signals can be used to model linear network processes evolving on a set of (not necessarily known) networks. Leveraging that matrix polynomials commute, a convex optimization method along with sufficient conditions that guarantee the recovery of the true graphs are provided when perfect covariance information is available. Particularly important from an empirical viewpoint, we provide high-probability bounds on the recovery error as a function of the number of signals observed and other key problem parameters. Numerical experiments demonstrate the effectiveness of the proposed method with perfect covariance information as well as its robustness in the noisy regime."
"201384","Efficient Change-Point Detection for Tackling Piecewise-Stationary Bandits","Lilian Besson, Emilie Kaufmann, Odalric-Ambrym Maillard, Julien Seznec","https://jmlr.org//papers/volume23/20-1384/20-1384.pdf","https://github.com/EmilieKaufmann/PiecewiseStationaryBandits","We introduce GLRklUCB, a novel algorithm for the piecewise iid non-stationary bandit problem with bounded rewards. This algorithm combines an efficient bandit algorithm, klUCB, with an efficient, parameter-free, change-point detector, the Bernoulli Generalized Likelihood Ratio Test, for which we provide new theoretical guarantees of independent interest. Unlike previous non-stationary bandit algorithms using a change-point detector, GLRklUCB does not need to be calibrated based on prior knowledge on the arms' means. We prove that this algorithm can attain a $O(\sqrt{TA\Upsilon_T\log(T)})$ regret in $T$ rounds on some “easy” instances in which there is sufficient delay between two change-points, where $A$ is the number of arms and $\Upsilon_T$ the number of change-points, without prior knowledge of $\Upsilon_T$. In contrast with recently proposed algorithms that are agnostic to $\Upsilon_T$, we perform a numerical study showing that GLRklUCB is also very efficient in practice, beyond easy instances."
"201393","Multi-Agent Online Optimization with Delays: Asynchronicity, Adaptivity, and Optimism","Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, Panayotis Mertikopoulos","https://jmlr.org//papers/volume23/20-1393/20-1393.pdf","","In this paper, we provide a general framework for studying multi-agent online learning problems in the presence of delays and asynchronicities. Specifically, we propose and analyze a class of adaptive dual averaging schemes in which agents only need to accumulate gradient feedback received from the whole system, without requiring any between-agent coordination. In the single-agent case, the adaptivity of the proposed method allows us to extend a range of existing results to problems with potentially unbounded delays between playing an action and receiving the corresponding feedback. In the multi-agent case, the situation is significantly more complicated because agents may not have access to a global clock to use as a reference point; to overcome this, we focus on the information that is available for producing each prediction rather than the actual delay associated with each feedback. This allows us to derive adaptive learning strategies with optimal regret bounds, even in a fully decentralized, asynchronous environment. Finally, we also analyze an “optimistic” variant of the proposed algorithm which is capable of exploiting the predictability of problems with a slower variation and leads to improved regret bounds."
"201426","Stacking for Non-mixing Bayesian Computations: The Curse and Blessing of Multimodal Posteriors","Yuling Yao, Aki Vehtari, Andrew Gelman","https://jmlr.org//papers/volume23/20-1426/20-1426.pdf","","When working with multimodal Bayesian posterior distributions, Markov chain Monte Carlo (MCMC) algorithms have difficulty moving between modes, and default variational or mode-based approximate inferences will understate posterior uncertainty. And, even if the most important modes can be found, it is difficult to evaluate their relative weights in the posterior. Here we propose an approach using parallel runs of MCMC, variational, or mode-based inference to hit as many modes or separated regions as possible and then combine these using Bayesian stacking, a scalable method for constructing a weighted average of distributions. The result from stacking efficiently samples from multimodal posterior distribution, minimizes cross validation prediction error, and represents the posterior uncertainty better than variational inference, but it is not necessarily equivalent, even asymptotically, to fully Bayesian inference. We present theoretical consistency with an example where the stacked inference approximates the true data generating process from the misspecified model and a non-mixing sampler, from which the predictive performance is better than full Bayesian inference, hence the multimodality can be considered a blessing rather than a curse under model misspecification. We demonstrate practical implementation in several model families: latent Dirichlet allocation, Gaussian process regression, hierarchical regression, horseshoe variable selection, and neural networks."
"201474","Posterior Asymptotics for Boosted Hierarchical Dirichlet Process Mixtures","Marta Catalano, Pierpaolo De Blasi, Antonio Lijoi, Igor Pruenster","https://jmlr.org//papers/volume23/20-1474/20-1474.pdf","","Bayesian hierarchical models are powerful tools for learning common latent features across multiple data sources. The Hierarchical Dirichlet Process (HDP) is invoked when the number of latent components is a priori unknown. While there is a rich literature on finite sample properties and performance of hierarchical processes, the analysis of their frequentist posterior asymptotic properties is still at an early stage. Here we establish theoretical guarantees for recovering the true data generating process when the data are modeled as mixtures over the HDP or a generalization of the HDP, which we term boosted because of the faster growth in the number of discovered latent features. By extending Schwartz's theory to partially exchangeable sequences we show that posterior contraction rates are crucially affected by the relationship between the sample sizes corresponding to the different groups. The effect varies according to the smoothness level of the true data distributions. In the supersmooth case,  when the generating densities are Gaussian mixtures, we recover the parametric rate up to a logarithmic factor, provided that the sample sizes are related in a polynomial fashion. Under ordinary smoothness assumptions more caution is needed as a polynomial deviation in the sample sizes could drastically deteriorate the convergence to the truth."
"20204","Dependent randomized rounding for clustering and partition systems with knapsack constraints","David G. Harris, Thomas Pensyl, Aravind Srinivasan, Khoa Trinh","https://jmlr.org//papers/volume23/20-204/20-204.pdf","","Clustering problems are fundamental to unsupervised learning. There is an increased emphasis on fairness in machine learning and AI; one representative notion of fairness is that no single group should be over-represented among the cluster-centers. This, and much more general clustering problems, can be formulated with “knapsack"" and “partition"" constraints. We develop new randomized algorithms targeting such problems, and study two in particular: multi-knapsack median and multi-knapsack center. Our rounding algorithms give new approximation and pseudo-approximation algorithms for these problems. One key technical tool, which may be of independent interest, is a new tail bound analogous to Feige (2006) for sums of random variables with unbounded variances. Such bounds can be useful in inferring properties of large networks using few samples."
"20231","FuDGE: A Method to Estimate a Functional Differential Graph in a High-Dimensional Setting","Boxin Zhao, Y. Samuel Wang, Mladen Kolar","https://jmlr.org//papers/volume23/20-231/20-231.pdf","https://github.com/boxinz17/FuDGE","We consider the problem of estimating the difference between two undirected functional graphical models with shared structures. In many applications, data are naturally regarded as a vector of random functions rather than as a vector of scalars. For example, electroencephalography (EEG) data are treated more appropriately as functions of time. In such a problem, not only can the number of functions measured per sample be large, but each function is itself an infinite dimensional object, making estimation of model parameters challenging. This is further complicated by the fact that curves are usually observed only at discrete time points. We first define a functional differential graph that captures the differences between two functional graphical models and formally characterize when the functional differential graph is well defined. We then propose a method, FuDGE, that directly estimates the functional differential graph without first estimating each individual graph. This is particularly beneficial in settings where the individual graphs are dense but the differential graph is sparse. We show that FuDGE consistently estimates the functional differential graph even in a high-dimensional setting for both fully observed and discretely observed function paths. We illustrate the finite sample properties of our method through simulation studies. We also propose a competing method, the Joint Functional Graphical Lasso, which generalizes the Joint Graphical Lasso to the functional setting. Finally, we apply our method to EEG data to uncover differences in functional brain connectivity between a group of individuals with alcohol use disorder and a control group."
"20290","Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping","Yichi Zhang, Molei Liu, Matey Neykov, Tianxi Cai","https://jmlr.org//papers/volume23/20-290/20-290.pdf","https://github.com/moleibobliu/PASS","Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold-standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital."
"20543","Distributed Bayesian Varying Coefficient Modeling Using a Gaussian Process Prior","Rajarshi Guhaniyogi, Cheng Li, Terrance D. Savitsky, Sanvesh Srivastava","https://jmlr.org//papers/volume23/20-543/20-543.pdf","","Varying coefficient models (VCMs) are widely used for estimating nonlinear regression functions for functional data. Their Bayesian variants using Gaussian process priors on the functional coefficients, however, have received limited attention in massive data applications, mainly due to the prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We address this problem using a divide-and-conquer Bayesian approach. We first create a large number of data subsamples with much smaller sizes. Then, we formulate the VCM as a linear mixed-effects model and develop a data augmentation algorithm for obtaining MCMC draws on all the subsets in parallel. Finally, we aggregate the MCMC-based estimates of subset posteriors into a single Aggregated Monte Carlo (AMC) posterior, which is used as a computationally efficient alternative to the true posterior distribution. Theoretically, we derive minimax optimal posterior convergence rates for the AMC posteriors of both the varying coefficients and the mean regression function. We provide quantification on the orders of subset sample sizes and the number of subsets. The empirical results show that the combination schemes that satisfy our theoretical assumptions, including the AMC posterior, have better estimation performance than their main competitors across diverse simulations and in a real data analysis."
"20682","A Distribution Free Conditional Independence Test with Applications to Causal Discovery","Zhanrui Cai, Runze Li, Yaowu Zhang","https://jmlr.org//papers/volume23/20-682/20-682.pdf","","This paper is concerned with test of the conditional independence. We first establish an equivalence between the conditional independence and the mutual independence. Based on the equivalence, we propose an index to measure the conditional dependence by quantifying the mutual dependence among the transformed variables. The proposed index has several appealing properties. (a) It is distribution free since the limiting null distribution of the proposed index does not depend on the population distributions of the data. Hence the critical values can be tabulated by simulations. (b) The proposed index ranges from zero to one, and equals zero if and only if the conditional independence holds. Thus, it has nontrivial power under the alternative hypothesis. (c) It is robust to outliers and heavy-tailed data since it is invariant to conditional strictly monotone transformations. (d) It has low computational cost since it incorporates a simple closed-form expression and can be implemented in quadratic time. (e) It is insensitive to tuning parameters involved in the calculation of the proposed index. (f) The new index is applicable for multivariate random vectors as well as for discrete data. All these properties enable us to use the new index as statistical inference tools for various data. The effectiveness of the method is illustrated through extensive simulations and a real application on causal discovery."
"20786","Robust and scalable manifold learning via landmark diffusion for long-term medical signal processing","Chao Shen, Yu-Ting Lin, Hau-Tieng Wu","https://jmlr.org//papers/volume23/20-786/20-786.pdf","","Motivated by analyzing long-term physiological time series, we design a robust and scalable spectral embedding algorithm that we refer to as RObust and Scalable Embedding via LANdmark Diffusion (Roseland).  The key is designing a diffusion process on the dataset where the diffusion is done via a small subset called the landmark set. Roseland is theoretically justified under the manifold model, and its computational complexity is comparable with commonly applied subsampling scheme such as the Nystr\""om extension. Specifically, when there are $n$ data points in $\mathbb{R}^q$ and $n^\beta$ points in the landmark set, where $\beta\in (0,1)$, the computational complexity of Roseland is $O(n^{1+2\beta}+qn^{1+\beta})$, while that of Nystrom is $O(n^{2.81\beta}+qn^{1+2\beta})$. To demonstrate the potential of Roseland, we apply it to { three} datasets and compare it with several other existing algorithms. First, we apply Roseland to the task of spectral clustering using the MNIST dataset (70,000 images), achieving 85\% accuracy when the dataset is clean and 78\% accuracy when the dataset is noisy. Compared with other subsampling schemes, overall Roseland achieves a better performance. Second, we apply Roseland to the task of image segmentation using images from COCO. Finally, we demonstrate how to apply Roseland to explore long-term arterial blood pressure waveform dynamics during a liver transplant operation lasting for 12 hours. In conclusion, Roseland is scalable and robust, and it has a potential for analyzing large datasets."
"20797","CD-split and HPD-split: Efficient Conformal Regions in High Dimensions","Rafael Izbicki, Gilson Shimizu, Rafael B. Stern","https://jmlr.org//papers/volume23/20-797/20-797.pdf","https://github.com/rizbicki/predictionBands","Conformal methods create prediction bands that control average coverage assuming solely i.i.d. data. Although the literature has mostly focused on  prediction intervals, more general regions can often better represent uncertainty. For instance, a bimodal target is better represented by the union of two intervals. Such prediction regions are obtained by CD-split, which combines the split method and a data-driven partition of the feature space which scales to high dimensions. CD-split however contains many tuning parameters, and their role is not clear. In this paper, we provide new insights on CD-split by exploring its theoretical properties. In particular, we show that CD-split converges asymptotically to the oracle highest predictive density set and satisfies local and asymptotic conditional validity. We also present simulations that show how to tune CD-split. Finally, we introduce HPD-split, a variation of CD-split that requires less tuning, and show that it shares the same theoretical guarantees as CD-split. In a wide variety of our simulations, CD-split and HPD-split have better conditional coverage and yield smaller prediction regions than other methods."
"20843","Generalized Ambiguity Decomposition for Ranking Ensemble Learning","Hongzhi Liu, Yingpeng Du, Zhonghai Wu","https://jmlr.org//papers/volume23/20-843/20-843.pdf","","Error decomposition analysis is a key problem for ensemble learning, which indicates that proper combination of multiple models can achieve better performance than any individual one. Existing theoretical research of ensemble learning focuses on regression or classification tasks. There is limited theoretical research for ranking ensemble. In this paper, we first generalize the ambiguity decomposition theory from regression ensemble to ranking ensemble, which proves the effectiveness of ranking ensemble with consideration of list-wise ranking information. According to the generalized theory, we propose an explicit diversity measure for ranking ensemble, which can be used to enhance the diversity of ensemble and improve the performance of ensemble model. Furthermore, we adopt an adaptive learning scheme to learn query-dependent ensemble weights, which can fit into the generalized theory and help to further improve the performance of ensemble model. Extensive experiments on recommendation and information retrieval tasks demonstrate the effectiveness and theoretical advantages of the proposed method compared with several state-of-the-art methods."
"20852","Machine Learning on Graphs: A Model and Comprehensive Taxonomy","Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, Kevin Murphy","https://jmlr.org//papers/volume23/20-852/20-852.pdf","","There has been a surge of recent interest in graph representation learning (GRL). GRL methods have generally fallen into three main categories, based on the availability of labeled data. The first, network embedding, focuses on learning unsupervised representations of relational structure. The second, graph regularized neural networks, leverages graphs to augment neural network losses with a regularization objective for semi-supervised learning. The third, graph neural networks, aims to learn differentiable functions over discrete topologies with arbitrary structure. However, despite the popularity of these areas there has been surprisingly little work on unifying the three paradigms. Here, we aim to bridge the gap between network embedding, graph regularization and graph neural networks. We propose a comprehensive taxonomy of GRL methods, aiming to unify several disparate bodies of work. Specifically, we propose the GraphEDM framework, which generalizes popular algorithms for semi-supervised learning (e.g. GraphSage, GCN, GAT), and unsupervised learning (e.g. DeepWalk, node2vec) of graph representations into a single consistent approach. To illustrate the generality of GraphEDM, we fit over thirty existing methods into this framework. We believe that this unifying view both provides a solid foundation for understanding the intuition behind these methods, and enables future research in the area."
"20910","Accelerating Adaptive Cubic Regularization of Newton's Method via Random Sampling","Xi Chen, Bo Jiang, Tianyi Lin, Shuzhong Zhang","https://jmlr.org//papers/volume23/20-910/20-910.pdf","","In this paper, we consider an unconstrained optimization model where the objective is a sum of a large number of possibly nonconvex functions, though overall the objective is assumed to be smooth and convex. Our bid to solving such model uses the framework of cubic regularization of Newton's method. As well known, the crux in cubic regularization is its utilization of the Hessian information, which may be computationally expensive for large-scale problems. To tackle this, we resort to approximating the Hessian matrix via sub-sampling. In particular, we propose to compute an approximated Hessian matrix by either uniformly or non-uniformly sub-sampling the components of the objective. Based upon such sampling strategy, we develop accelerated adaptive cubic regularization approaches and provide theoretical guarantees on global iteration complexity of $\O(\epsilon^{-1/3})$ with high probability, which matches that of the original accelerated cubic regularization methods Jiang et al. (2020) using the full Hessian information. Interestingly, we also show that in the worst case scenario our algorithm still achieves an $O(\epsilon^{-5/6}\log(\epsilon^{-1}))$ iteration complexity bound. The proof techniques are new to our knowledge and can be of independent interets. Experimental results on the regularized logistic regression problems demonstrate a clear effect of acceleration on several real data sets."
"20940","When Hardness of Approximation Meets Hardness of Learning","Eran Malach, Shai Shalev-Shwartz","https://jmlr.org//papers/volume23/20-940/20-940.pdf","","A supervised learning algorithm has access to a distribution of labeled examples, and needs to return a function (hypothesis) that correctly labels the examples. The hypothesis of the learner is taken from some fixed class of functions (e.g., linear classifiers, neural networks etc.). A failure of the learning algorithm can occur due to two possible reasons: wrong choice of hypothesis class (hardness of approximation), or failure to find the best function within the hypothesis class (hardness of learning). Although both approximation and learnability are important for the success of the algorithm, they are typically studied separately. In this work, we show a single hardness property that implies both hardness of approximation using linear classes and shallow networks, and hardness of learning using correlation queries and gradient-descent. This allows us to obtain new results on hardness of approximation and learnability of parity functions, DNF formulas and $AC^0$ circuits."
"210030","Gauss-Legendre Features for Gaussian Process Regression","Paz Fink Shustin, Haim Avron","https://jmlr.org//papers/volume23/21-0030/21-0030.pdf","","Gaussian processes provide a powerful probabilistic kernel learning framework, which allows learning high quality nonparametric regression models via methods such as Gaussian process regression. Nevertheless, the learning phase of Gaussian process regression requires massive computations which are not realistic for large datasets. In this paper, we present a Gauss-Legendre quadrature based approach for scaling up Gaussian process regression via a low rank approximation of the kernel matrix. We utilize the structure of the low rank approximation to achieve effective hyperparameter learning, training and prediction. Our method is very much inspired by the well-known random Fourier features approach, which also builds low-rank approximations via numerical integration. However, our method is capable of generating high quality approximation to the kernel using an amount of features which is poly-logarithmic in the number of training points, while similar guarantees will require an amount that is at the very least linear in the number of training points when using random Fourier features. Furthermore, the structure of the low-rank approximation that our method builds is subtly different from the one generated by random Fourier features, and this enables much more efficient hyperparameter learning. The utility of our method for learning with low-dimensional datasets is demonstrated using numerical experiments."
"210052","Regularized K-means Through Hard-Thresholding","Jakob Raymaekers, Ruben H. Zamar","https://jmlr.org//papers/volume23/21-0052/21-0052.pdf","https://cran.microsoft.com/web/packages/clusterHD/index.html","We study a framework for performing regularized K-means, based on direct  penalization of the size of the cluster centers. Different penalization strategies are considered and compared in a theoretical analysis and an extensive Monte Carlo simulation study. Based on the results, we propose a new method called hard-threshold K-means (HTK-means), which uses an ℓ0 penalty to induce sparsity. HTK-means is a fast and competitive sparse clustering method which is easily interpretable, as is illustrated on several real data examples. In this context, new graphical displays are presented and used to gain further insight into the data sets."
"210054","Multiple Testing in Nonparametric Hidden Markov Models: An Empirical Bayes Approach","Kweku Abraham, Ismaël Castillo, Elisabeth Gassiat","https://jmlr.org//papers/volume23/21-0054/21-0054.pdf","","Given a nonparametric Hidden Markov Model (HMM) with two states, the question of constructing efficient multiple testing procedures is considered, treating the states as unknown null and alternative hypotheses. A procedure is introduced, based on nonparametric empirical Bayes ideas, that controls the False Discovery Rate (FDR) at a user-specified level. Guarantees on power are also provided, in the form of a control of the true positive rate. One of the key steps in the construction requires supremum-norm convergence of preliminary estimators of the emission densities of the HMM. We provide the existence of such estimators, with convergence at the optimal minimax rate, for the case of a HMM with $J\ge 2$ states, which is of independent interest."
"210055","Attraction-Repulsion Spectrum in Neighbor Embeddings","Jan Niklas Böhm, Philipp Berens, Dmitry Kobak","https://jmlr.org//papers/volume23/21-0055/21-0055.pdf","https://github.com/berenslab/ne-spectrum/","Neighbor embeddings are a family of methods for visualizing complex high-dimensional data sets using kNN graphs. To find the low-dimensional embedding, these algorithms combine an attractive force between neighboring pairs of points with a repulsive force between all points. One of the most popular examples of such algorithms is t-SNE. Here we empirically show that changing the balance between the attractive and the repulsive forces in t-SNE using the exaggeration parameter yields a spectrum of embeddings, which is characterized by a simple trade-off: stronger attraction can better represent continuous manifold structures, while stronger repulsion can better represent discrete cluster structures and yields higher kNN recall. We find that UMAP embeddings correspond to t-SNE with increased attraction; mathematical analysis shows that this is because the negative sampling optimization strategy employed by UMAP strongly lowers the effective repulsion. Likewise, ForceAtlas2, commonly used for visualizing developmental single-cell transcriptomic data, yields embeddings corresponding to t-SNE with the attraction increased even more. At the extreme of this spectrum lie Laplacian eigenmaps. Our results demonstrate that many prominent neighbor embedding algorithms can be placed onto the attraction-repulsion spectrum, and highlight the inherent trade-offs between them."
"210082","Rethinking Nonlinear Instrumental Variable Models through Prediction Validity","Chunxiao Li, Cynthia Rudin, Tyler H. McCormick","https://jmlr.org//papers/volume23/21-0082/21-0082.pdf","","Instrumental variables (IV) are widely used in the social and health sciences in situations where a researcher would like to measure a causal effect but cannot perform an experiment. For valid causal inference in an IV model, there must be external (exogenous) variation that (i) has a sufficiently large impact on the variable of interest (called the relevance assumption) and where (ii) the only pathway through which the external variation impacts the outcome is via the variable of interest (called the exclusion restriction).  For statistical inference, researchers must also make assumptions about the functional form of the relationship between the three variables. Current practice assumes (i) and (ii) are met, then postulates a functional form with limited input from the data. In this paper, we describe a framework that leverages machine learning to validate these typically unchecked but consequential assumptions in the IV framework, providing the researcher empirical evidence about the quality of the instrument given the data at hand. Central to the proposed approach is the idea of prediction validity. Prediction validity checks that error terms -- which should be independent from the instrument -- cannot be modeled with machine learning any better than a model that is identically zero. We use prediction validity to develop both one-stage and two-stage approaches for IV, and demonstrate their performance on an example relevant to climate change policy."
"210084","Unlabeled Data Help in Graph-Based Semi-Supervised Learning: A Bayesian Nonparametrics Perspective","Daniel Sanz-Alonso, Ruiyi Yang","https://jmlr.org//papers/volume23/21-0084/21-0084.pdf","","In this paper we analyze the graph-based approach to semi-supervised learning under a manifold assumption. We adopt a Bayesian perspective and demonstrate that, for a suitable choice of prior constructed with sufficiently many unlabeled data, the posterior contracts around the truth at a rate that is minimax optimal up to a logarithmic factor. Our theory covers both regression and classification."
"210085","PECOS: Prediction for Enormous and Correlated Output Spaces","Hsiang-Fu Yu, Kai Zhong, Jiong Zhang, Wei-Cheng Chang, Inderjit S. Dhillon","https://jmlr.org//papers/volume23/21-0085/21-0085.pdf","https://libpecos.org/","Many large-scale applications amount to finding relevant results from an enormous output space of potential candidates. For example, finding the best matching product from a large catalog or suggesting related search phrases on a search engine. The size of the output space for these problems can range from millions to billions, and can even be infinite in some applications. Moreover, training data is often limited for the “long-tail” items in the output space. Fortunately, items in the output space are often correlated thereby presenting an opportunity to alleviate the data sparsity issue. In this paper, we propose the Prediction for Enormous and Correlated Output Spaces (PECOS) framework, a versatile and modular machine learning framework for solving prediction problems for very large output spaces, and apply it to the eXtreme Multilabel Ranking (XMR) problem: given an input instance, find and rank the most relevant items from an enormous but fixed and finite output space. We propose a three phase framework for PECOS: (i) in the first phase, PECOS organizes the output space using a semantic indexing scheme, (ii) in the second phase, PECOS uses the indexing to narrow down the output space by orders of magnitude using a machine learned matching scheme, and (iii) in the third phase, PECOS ranks the matched items using a final ranking scheme. The versatility and modularity of PECOS allows for easy plug-and-play of various choices for the indexing, matching, and ranking phases. The indexing and matching phases alleviate the data sparsity issue by leveraging correlations across different items in the output space. For the critical matching phase, we develop a recursive machine learned matching strategy with both linear and neural matchers. When applied to eXtreme Multilabel Ranking where the input instances are in textual form, we find that the recursive Transformer matcher gives state-of-the-art accuracy results, at the cost of two orders of magnitude increased training time compared to the recursive linear matcher. For example, on a dataset where the output space is of size 2.8 million, the recursive Transformer matcher results in a 6% increase in precision@1 (from 48.6% to 54.2%) over the recursive linear matcher but takes 100x more time to train. Thus it is up to the practitioner to evaluate the trade-offs and decide whether the increased training time and infrastructure cost is warranted for their application; indeed, the flexibility of the PECOS framework seamlessly allows different strategies to be used. We also develop very fast inference procedures which allow us to perform XMR predictions in real time; for example, inference takes less than 1 millisecond per input on the dataset with 2.8 million labels. The PECOS software is available at https://libpecos.org."
"210093","Distributed Learning of Finite Gaussian Mixtures","Qiong Zhang, Jiahua Chen","https://jmlr.org//papers/volume23/21-0093/21-0093.pdf","https://github.com/SarahQiong/SCGMM","Advances in information technology have led to extremely large datasets that are often kept in different storage centers. Existing statistical methods must be adapted to overcome the resulting computational obstacles while retaining statistical validity and efficiency. In this situation, the split-and-conquer strategy is among the most effective solutions to many statistical problems, including quantile processes, regression analysis, principal eigenspaces, and exponential families. This paper applies this strategy to develop a distributed learning procedure of finite Gaussian mixtures. We recommend a reduction strategy and invent an effective majorization-minimization algorithm. The new estimator is consistent and retains root-n consistency under some general conditions. Experiments based on simulated and real-world datasets show that the proposed estimator has comparable statistical performance with the global estimator based on the full dataset, if the latter is feasible. It can even outperform the global estimator for the purpose of clustering if the model assumption does not fully match the real-world data. It also has better statistical and computational performance than some existing split-and-conquer approaches."
"210129","Total Stability of SVMs and Localized SVMs","Hannes Köhler, Andreas Christmann","https://jmlr.org//papers/volume23/21-0129/21-0129.pdf","","Regularized kernel-based methods such as support vector machines (SVMs) typically depend on the underlying probability measure $\mathrm{P}$ (respectively an empirical measure $\mathrm{D}_n$ in applications) as well as on the regularization parameter $\lambda$ and the kernel $k$. Whereas classical statistical robustness only considers the effect of small perturbations in $\mathrm{P}$, the present paper investigates the influence of simultaneous slight variations in the whole triple $(\mathrm{P},\lambda,k)$, respectively $(\mathrm{D}_n,\lambda_n,k)$, on the resulting predictor. Existing results from the literature are considerably generalized and improved. In order to also make them applicable to big data, where regular SVMs suffer from their super-linear computational requirements, we show how our results can be transferred to the context of localized learning. Here, the effect of slight variations in the applied regionalization, which might for example stem from changes in $\mathrm{P}$ respectively $\mathrm{D}_n$, is considered as well."
"210133","Towards An Efficient Approach for the Nonconvex lp Ball Projection: Algorithm and Analysis","Xiangyu Yang, Jiashan Wang, Hao Wang","https://jmlr.org//papers/volume23/21-0133/21-0133.pdf","","This paper primarily focuses on computing the Euclidean projection of a vector onto the lp ball in which p ∈ (0,1). Such a problem emerges as the core building block in statistical machine learning and signal processing tasks because of its ability to promote the sparsity of the desired solution. However, efficient numerical algorithms for finding the projections are still not available, particularly in large-scale optimization. To meet this challenge, we first derive the first-order necessary optimality conditions of this problem. Based on this characterization, we develop a novel numerical approach for computing the stationary point by solving a sequence of projections onto the reweighted l1-balls. This method is practically simple to implement and computationally efficient. Moreover, the proposed algorithm is shown to converge uniquely under mild conditions and has a worst-case O(1/\sqrt{k}) convergence rate. Numerical experiments demonstrate the efficiency of our proposed algorithm."
"210175","Sufficient reductions in regression with mixed predictors","Efstathia Bura, Liliana Forzani, Rodrigo Garcia Arancibia, Pamela Llop, Diego Tomassi","https://jmlr.org//papers/volume23/21-0175/21-0175.pdf","https://github.com/lforzani/SDR_mixed_predictions","Most data sets comprise of measurements on continuous and categorical variables. Yet, modeling high-dimensional mixed predictors has received limited attention in the regression and classification statistical literature. We study the general regression problem of inferring on a variable of interest based on high dimensional mixed continuous and binary predictors. The aim is to find a lower dimensional function of the mixed predictor vector that contains all the modeling information in the mixed predictors for the response, which can be either continuous or categorical. The approach we propose identifies sufficient reductions by reversing the regression and modeling the mixed predictors conditional on the response. We derive the maximum likelihood estimator of the sufficient reductions, asymptotic tests for dimension, and a regularized estimator, which simultaneously achieves variable (feature) selection and dimension reduction (feature extraction). We study the performance of the proposed method and compare it with other approaches through simulations and  real data examples."
"210186","The EM Algorithm is Adaptively-Optimal for Unbalanced Symmetric Gaussian Mixtures","Nir Weinberger, Guy Bresler","https://jmlr.org//papers/volume23/21-0186/21-0186.pdf","","This paper studies the problem of estimating the means $\pm\theta_{*}\in\mathbb{R}^{d}$ of a symmetric two-component Gaussian mixture $\delta_{*}\cdot N(\theta_{*},I)+(1-\delta_{*})\cdot N(-\theta_{*},I)$, where the weights $\delta_{*}$ and $1-\delta_{*}$ are unequal. Assuming that $\delta_{*}$ is known, we show that the population version of the EM algorithm globally converges if the initial estimate has non-negative inner product with the mean of the larger weight component. This can be achieved by the trivial initialization $\theta_{0}=0$. For the empirical iteration based on $n$ samples, we show that when initialized at $\theta_{0}=0$, the EM algorithm adaptively achieves the minimax error rate $\tilde{O}\Big(\min\Big\{\frac{1}{(1-2\delta_{*})}\sqrt{\frac{d}{n}},\frac{1}{\|\theta_{*}\|}\sqrt{\frac{d}{n}},\left(\frac{d}{n}\right)^{1/4}\Big\}\Big)$ in no more than $O\Big(\frac{1}{\|\theta_{*}\|(1-2\delta_{*})}\Big)$ iterations (with high probability). We also consider the EM iteration for estimating the weight $\delta_{*}$, assuming a fixed mean $\theta$ (which is possibly mismatched to $\theta_{*}$). For the empirical iteration of $n$ samples, we show that the minimax error rate $\tilde{O}\Big(\frac{1}{\|\theta_{*}\|}\sqrt{\frac{d}{n}}\Big)$ is achieved in no more than $O\Big(\frac{1}{\|\theta_{*}\|^{2}}\Big)$ iterations. These results robustify and complement recent results of Wu and Zhou (2019) obtained for the equal weights case $\delta_{*}=1/2$."
"21023","Efficient Least Squares for Estimating Total Effects under Linearity and Causal Sufficiency","F. Richard Guo, Emilija Perković","https://jmlr.org//papers/volume23/21-023/21-023.pdf","https://cran.r-project.org/package=eff2","Recursive linear structural equation models are widely used to postulate causal mechanisms underlying observational data. In these models, each variable equals a linear combination of a subset of the remaining variables plus an error term. When there is no unobserved confounding or selection bias, the error terms are assumed to be independent. We consider estimating a total causal effect in this setting. The causal structure is assumed to be known only up to a maximally oriented partially directed acyclic graph (MPDAG), a general class of graphs that can represent a Markov equivalence class of directed acyclic graphs (DAGs) with added background knowledge. We propose a simple estimator based on recursive least squares, which can consistently estimate any identified total causal effect, under point or joint intervention. We show that this estimator is the most efficient among all regular estimators that are based on the sample covariance, which includes covariate adjustment and the estimators employed by the joint-IDA algorithm. Notably, our result holds without assuming Gaussian errors."
"210282","Globally Injective ReLU Networks","Michael Puthawala, Konik Kothari, Matti Lassas, Ivan Dokmanić, Maarten de Hoop","https://jmlr.org//papers/volume23/21-0282/21-0282.pdf","","Injectivity plays an important role in generative models where it enables inference; in inverse problems and compressed sensing with generative priors it is a precursor to well posedness. We establish sharp characterizations of injectivity of fully-connected and convolutional ReLU layers and networks. First, through a layerwise analysis, we show that an expansivity factor of two is necessary and sufficient for injectivity by constructing appropriate weight matrices. We show that global injectivity with iid Gaussian matrices, a commonly used tractable model, requires larger expansivity between 3.4 and 10.5. We also characterize the stability of inverting an injective network via worst-case Lipschitz constants of the inverse. We then use arguments from differential topology to study injectivity of deep networks and prove that any Lipschitz map can be approximated by an injective ReLU network. Finally, using an argument based on random projections, we show that an end-to-end---rather than layerwise---doubling of the dimension suffices for injectivity. Our results establish a theoretical basis for the study of nonlinear inverse and inference problems using neural networks."
"210314","Riemannian Stochastic Proximal Gradient Methods for Nonsmooth Optimization over the Stiefel Manifold","Bokun Wang, Shiqian Ma, Lingzhou Xue","https://jmlr.org//papers/volume23/21-0314/21-0314.pdf","","Riemannian optimization has drawn a lot of attention due to its wide applications in practice. Riemannian stochastic first-order algorithms have been studied in the literature to solve large-scale machine learning problems over Riemannian manifolds. However, most of the existing Riemannian stochastic algorithms require the objective function to be differentiable, and they do not apply to the case where the objective function is nonsmooth. In this paper, we present two Riemannian stochastic proximal gradient methods for minimizing nonsmooth function over the Stiefel manifold. The two methods, named R-ProxSGD and R-ProxSPB, are generalizations of proximal SGD and proximal SpiderBoost in Euclidean setting to the Riemannian setting. Analysis on the incremental first-order oracle (IFO) complexity of the proposed algorithms is provided. Specifically, the R-ProxSPB algorithm finds an $\epsilon$-stationary point with $O(\epsilon^{-3})$ IFOs in the online case, and $O(n+\sqrt{n}\epsilon^{-2})$ IFOs in the finite-sum case with $n$ being the number of summands in the objective. Experimental results on online sparse PCA and robust low-rank matrix completion show that our proposed methods significantly outperform the existing methods that use Riemannian subgradient information."
"210387","IALE: Imitating Active Learner Ensembles","Christoffer Löffler, Christopher Mutschler","https://jmlr.org//papers/volume23/21-0387/21-0387.pdf","https://github.com/crispchris/IALE","Active learning prioritizes the labeling of the most informative data samples. However, the performance of active learning heuristics depends on both the structure of the underlying model architecture and the data. We propose IALE, an imitation learning scheme that imitates the selection of the best-performing expert heuristic at each stage of the learning cycle in a batch-mode pool-based setting. We use Dagger to train a transferable policy on a dataset and later apply it to different datasets and deep classifier architectures. The policy reflects on the best choices from multiple expert heuristics given the current state of the active learning process, and learns to select samples in a complementary way that unifies the expert strategies. Our experiments on well-known image datasets show that we outperform state of the art imitation learners and heuristics."
"210403","Bayesian subset selection and variable importance for interpretable prediction and classification","Daniel R. Kowal","https://jmlr.org//papers/volume23/21-0403/21-0403.pdf","https://github.com/drkowal/BayesSubsets","Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model M, we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single “best” subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via M. For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy."
"210425","Conditions and Assumptions for Constraint-based Causal Structure Learning","Kayvan Sadeghi, Terry Soo","https://jmlr.org//papers/volume23/21-0425/21-0425.pdf","","We formalize constraint-based structure learning of the ""true"" causal graph from observed data when unobserved variables are also existent. We provide conditions for a ""natural"" family of constraint-based structure-learning algorithms that output graphs that are Markov equivalent to the causal graph. Under the faithfulness assumption, this natural family contains all exact structure-learning algorithms. We also provide a set of assumptions, under which any natural structure-learning algorithm outputs Markov equivalent graphs to the causal graph. These assumptions can be thought of as a relaxation of faithfulness, and most of them can be directly tested from (the underlying distribution) of the data, particularly when one focuses on structural causal models. We specialize the definitions and results for structural causal models."
"210511","EiGLasso for Scalable Sparse Kronecker-Sum Inverse Covariance Estimation","Jun Ho Yoon, Seyoung Kim","https://jmlr.org//papers/volume23/21-0511/21-0511.pdf","https://github.com/SeyoungKimLab/EiGLasso","In many real-world data, complex dependencies are present both among samples and among features. The Kronecker sum or the Cartesian product of two graphs, each modeling dependencies across features and across samples, has been used as an inverse covariance matrix for a matrix-variate Gaussian distribution as an alternative to Kronecker-product inverse covariance matrix due to its more intuitive sparse structure. However, the existing methods for sparse Kronecker-sum inverse covariance estimation are limited in that they do not scale to more than a few hundred features and samples and that unidentifiable parameters pose challenges in estimation. In this paper, we introduce EiGLasso, a highly scalable method for sparse Kronecker-sum inverse covariance estimation, based on Newton's method combined with eigendecomposition of the sample and feature graphs to exploit the Kronecker-sum structure. EiGLasso further reduces computation time by approximating the Hessian matrix, based on the eigendecomposition of the two graphs. EiGLasso achieves quadratic convergence with the exact Hessian and linear convergence with the approximate Hessian. We describe a simple new approach to estimating the unidentifiable parameters that generalizes the existing methods. On simulated and real-world data, we demonstrate that EiGLasso achieves two to three orders-of-magnitude speed-up, compared to the existing methods."
"210542","Advantage of Deep Neural Networks for Estimating Functions with Singularity on Hypersurfaces","Masaaki Imaizumi, Kenji Fukumizu","https://jmlr.org//papers/volume23/21-0542/21-0542.pdf","","We develop a minimax rate analysis to describe the reason that deep neural networks (DNNs) perform better than other standard methods. For nonparametric regression problems, it is well known that many standard methods attain the minimax optimal rate of estimation errors for smooth functions, and thus, it is not straightforward to identify the theoretical advantages of DNNs. This study tries to fill this gap by considering the estimation for a class of non-smooth functions that have singularities on hypersurfaces. Our findings are as follows: (i) We derive the generalization error of a DNN estimator and prove that its convergence rate is almost optimal. (ii) We elucidate a phase diagram of estimation problems, which describes the situations where the DNNs outperform a general class of estimators, including kernel methods, Gaussian process methods, and others. We additionally show that DNNs outperform harmonic analysis based estimators. This advantage of DNNs comes from the fact that a shape of singularity can be successfully handled by their multi-layered structure."
"210622","Sum of Ranked Range Loss for Supervised Learning","Shu Hu, Yiming Ying, Xin Wang, Siwei Lyu","https://jmlr.org//papers/volume23/21-0622/21-0622.pdf","https://github.com/discovershu/SoRR","In forming learning objectives, one oftentimes needs to aggregate a set of individual values to a single output. Such cases occur in the aggregate loss, which  combines individual losses of a learning model over each training sample, and in the individual loss for multi-label learning, which combines prediction scores over all class labels. In this work, we introduce the sum of ranked range (SoRR) as a general approach to form learning objectives. A ranked range is a consecutive sequence of sorted values of a set of real numbers. The minimization of SoRR is solved with the difference of convex algorithm (DCA). We explore two applications in machine learning of the minimization of the SoRR framework, namely the AoRR aggregate loss for binary/multi-class classification at the sample level and the TKML individual loss for multi-label/multi-class classification at the label level. A combination loss of AoRR and TKML is proposed as a new learning objective for improving the robustness of multi-label learning in the face of outliers in sample and labels alike. Our empirical results highlight the effectiveness of the proposed optimization frameworks and demonstrate the applicability of proposed losses using synthetic and real data sets."
"210630","The Two-Sided Game of Googol","José Correa, Andrés Cristi, Boris Epstein, José Soto","https://jmlr.org//papers/volume23/21-0630/21-0630.pdf","","The secretary problem or game of Googol are classic models for online selection problems. In this paper we consider a variant of the problem and explore its connections to data-driven online selection. Specifically, we are given $n$ cards with arbitrary non-negative numbers written on both sides. The cards are randomly placed on $n$ consecutive positions on a table, and for each card, the visible side is also selected at random. The player sees the visible side of all cards and wants to select the card with the maximum hidden value. To this end, the player flips the first card, sees its hidden value and decides whether to pick it or drop it and continue with the next card. We study algorithms for two natural objectives: maximizing the probability of selecting the maximum hidden value, and maximizing the expectation of the selected hidden value. For the former objective we obtain a simple $0.45292$-competitive algorithm. For the latter, we obtain a $0.63518$-competitive algorithm. Our main contribution is to set up a model allowing to transform probabilistic optimal stopping problems into purely combinatorial ones. For instance, we can apply our results to obtain lower bounds for the single sample prophet secretary problem."
"210631","ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction","Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright, Yi Ma","https://jmlr.org//papers/volume23/21-0631/21-0631.pdf","https://github.com/Ma-Lab-Berkeley","This work attempts to provide a plausible theoretical framework that aims to interpret modern deep (convolutional) networks from the principles of data compression and discriminative representation. We argue that for high-dimensional multi-class data, the optimal linear discriminative representation maximizes the coding rate difference between the whole dataset and the average of all the subsets. We show that the basic iterative gradient ascent scheme for optimizing the rate reduction objective naturally leads to a multi-layer deep network, named ReduNet, which shares common characteristics of modern deep networks. The deep layered architectures, linear and nonlinear operators, and even parameters of the network are all explicitly constructed layer-by-layer via forward propagation, although they are amenable to fine-tuning via back propagation. All components of so-obtained “white-box” network have precise optimization, statistical, and geometric interpretation. Moreover, all linear operators of the so-derived network naturally become multi-channel convolutions when we enforce classification to be rigorously shift-invariant. The derivation in the invariant setting suggests a trade-off between sparsity and invariance, and also indicates that such a deep convolution network is significantly more efficient to construct and learn in the spectral domain. Our preliminary simulations and experiments clearly verify the effectiveness of both the rate reduction objective and the associated ReduNet. All code and data are available at https://github.com/Ma-Lab-Berkeley."
"210681","Cauchy–Schwarz Regularized Autoencoder","Linh Tran, Maja Pantic, Marc Peter Deisenroth","https://jmlr.org//papers/volume23/21-0681/21-0681.pdf","","Recent work in unsupervised learning has focused on efficient inference and learning in latent variables models. Training these models by maximizing the evidence (marginal likelihood) is typically intractable. Thus, a common approximation is to maximize the Evidence Lower BOund (ELBO) instead. Variational autoencoders (VAE) are a powerful and widely-used class of generative models that optimize the ELBO efficiently for large datasets. However, the VAE's default Gaussian choice for the prior imposes a strong constraint on its ability to represent the true posterior, thereby degrading overall performance. A Gaussian mixture model (GMM) would be a richer prior but cannot be handled efficiently within the VAE framework because of the intractability of the Kullback-Leibler divergence for GMMs. We deviate from the common VAE framework in favor of one with an analytical solution for Gaussian mixture prior. To perform efficient inference for GMM priors, we introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs. This new objective allows us to incorporate richer, multi-modal priors into the autoencoding framework. We provide empirical studies on a range of datasets and show that our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis."
"210732","An Error Analysis of Generative Adversarial Networks for Learning Distributions","Jian Huang, Yuling Jiao, Zhen Li, Shiao Liu, Yang Wang, Yunfei Yang","https://jmlr.org//papers/volume23/21-0732/21-0732.pdf","","This paper studies how well generative adversarial networks (GANs) learn probability distributions from finite samples. Our main results establish the convergence rates of GANs under a collection of integral probability metrics defined through H\""{o}lder classes, including the Wasserstein distance as a special case. We also show that GANs are able to adaptively learn data distributions with low-dimensional structures or have H\""{o}lder densities, when the network architectures are chosen properly. In particular, for distributions concentrated around a low-dimensional set, we show that the learning rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension. Our analysis is based on a new oracle inequality decomposing the estimation error into the generator and discriminator approximation error and the statistical error, which may be of independent interest."
"210847","OVERT: An Algorithm for Safety Verification of Neural Network Control Policies for Nonlinear Systems","Chelsea Sidrane, Amir Maleki, Ahmed Irfan, Mykel J. Kochenderfer","https://jmlr.org//papers/volume23/21-0847/21-0847.pdf","https://github.com/sisl/OVERTVerify.jl","Deep learning methods can be used to produce control policies, but certifying their safety is challenging. The resulting networks are nonlinear and often very large. In response to this challenge, we present OVERT: a sound algorithm for safety verification of nonlinear discrete-time closed loop dynamical systems with neural network control policies. The novelty of OVERT lies in combining ideas from the classical formal methods literature with ideas from the newer neural network verification literature.  The central concept of OVERT is to abstract nonlinear functions with a set of optimally tight piecewise linear bounds. Such piecewise linear bounds are designed for seamless integration into ReLU neural network verification tools.  OVERT can be used to prove bounded-time safety properties by either computing reachable sets or solving feasibility queries directly.  We demonstrate various examples of safety verification for several classical benchmark examples.  OVERT compares favorably to existing methods both in computation time and in tightness of the reachable set."
"210904","Under-bagging Nearest Neighbors for Imbalanced Classification","Hanyuan Hang, Yuchao Cai, Hanfang Yang, Zhouchen Lin","https://jmlr.org//papers/volume23/21-0904/21-0904.pdf","","In this paper, we propose an ensemble learning algorithm called under-bagging $k$-nearest neighbors (under-bagging $k$-NN) for imbalanced classification problems. On the theoretical side, by developing a new learning theory analysis, we show that with properly chosen parameters, i.e., the number of nearest neighbors $k$, the expected sub-sample size $s$, and the bagging rounds $B$, optimal convergence rates for under-bagging $k$-NN can be achieved under mild assumptions w.r.t. the arithmetic mean (AM) of recalls. Moreover, we show that with a relatively small $B$, the expected sub-sample size $s$ can be much smaller than the number of training data $n$ at each bagging round, and the number of nearest neighbors $k$ can be reduced simultaneously, especially when the data are highly imbalanced, which leads to substantially lower time complexity and roughly the same space complexity. On the practical side, we conduct numerical experiments to verify the theoretical results on the benefits of the under-bagging technique by the promising AM performance and efficiency of our proposed algorithm."
"21092","A spectral-based analysis of the separation between two-layer neural networks and linear methods","Lei Wu, Jihao Long","https://jmlr.org//papers/volume23/21-092/21-092.pdf","","We propose a spectral-based approach to analyze how two-layer neural networks separate from linear methods in terms of approximating high-dimensional functions. We show that quantifying this separation can be reduced to estimating the Kolmogorov width of two-layer neural networks, and the latter can be further characterized by using the spectrum of an associated kernel. Different from previous work, our approach allows obtaining upper bounds, lower bounds, and identifying explicit hard functions in a united manner. We provide a systematic study of how the choice of activation functions affects the separation, in particular the  dependence on the input dimension. Specifically, for nonsmooth activation functions, we extend known results to more activation functions with sharper bounds. As concrete examples, we prove that any single neuron can  instantiate the separation between neural networks and random feature models. For smooth activation functions, one surprising finding is that the separation is negligible unless the norms of inner-layer weights are polynomially large  with respect to the input dimension. By contrast, the separation for nonsmooth activation functions is independent of the norms of inner-layer weights."
"210998","Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","William Fedus, Barret Zoph, Noam Shazeer","https://jmlr.org//papers/volume23/21-0998/21-0998.pdf","https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py","In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) models defy this and instead select different parameters for each incoming example. The result is a sparsely-activated model---with an outrageous number of parameters---but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by  complexity, communication costs, and training instability. We address these with the introduction of the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques mitigate the instabilities, and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the ""Colossal Clean Crawled Corpus"", and achieve a 4x speedup over the T5-XXL model."
"211027","Online Mirror Descent and Dual Averaging: Keeping Pace in the Dynamic Case","Huang Fang, Nicholas J. A. Harvey, Victor S. Portella, Michael P. Friedlander","https://jmlr.org//papers/volume23/21-1027/21-1027.pdf","","Online mirror descent (OMD) and dual averaging (DA)---two fundamental algorithms for online convex optimization---are known to have very similar (and sometimes identical) performance guarantees when used with a fixed learning rate. Under dynamic learning rates, however, OMD is provably inferior to DA and suffers linear regret, even in common settings such as prediction with expert advice. We modify the OMD algorithm through a simple technique that we call stabilization. We give essentially the same abstract regret bound for OMD with stabilization and for DA by modifying the classical OMD convergence analysis in a careful and modular way that allows for straightforward and flexible proofs. Simple corollaries of these bounds show that OMD with stabilization and DA enjoy the same performance guarantees in many applications---even under dynamic learning rates. We also shed light on the similarities between OMD and DA and show simple conditions under which stabilized-OMD and DA generate the same iterates. Finally, we show how to effectively use dual-stabilization with composite cost functions with simple adaptations to both the algorithm and its analysis."
"211109","Depth separation beyond radial functions","Luca Venturi, Samy Jelassi, Tristan Ozuch, Joan Bruna","https://jmlr.org//papers/volume23/21-1109/21-1109.pdf","","High-dimensional depth separation results for neural networks show that certain functions can be efficiently approximated by two-hidden-layer networks but not by one-hidden-layer ones in high-dimensions. Existing results of this type mainly focus on functions with an underlying radial or one-dimensional structure, which are usually not encountered in practice. The first contribution of this paper is to extend such results to a more general class of functions, namely functions with piece-wise oscillatory structure, by building on the proof strategy of (Eldan and Shamir, 2016). We complement these results by showing that, if the domain radius and the rate of oscillation of the objective function are constant, then approximation by one-hidden-layer networks holds at a $\mathrm{poly}(d)$ rate for any fixed error threshold. The mentioned results show that one-hidden-layer networks fail to approximate high-energy functions whose Fourier representation is spread in the frequency domain, while they succeed at approximating functions having a sparse Fourier representation. However, the choice of the domain represents a source of gaps between these positive and negative approximation results. We conclude the paper focusing on a compact approximation domain, namely the sphere $\S$ in dimension $d$, where we provide a characterization of both functions which are efficiently approximable by one-hidden-layer networks and of functions which are provably not, in terms of their Fourier expansion."
"211138","Provable Tensor-Train Format Tensor Completion by Riemannian Optimization","Jian-Feng Cai, Jingyang Li, Dong Xia","https://jmlr.org//papers/volume23/21-1138/21-1138.pdf","","The tensor train (TT) format enjoys appealing advantages in handling structural high-order tensors. The recent decade has witnessed the wide applications of TT-format tensors from diverse disciplines, among which tensor completion has drawn considerable attention. Numerous fast algorithms, including the Riemannian gradient descent (RGrad),  have been proposed for the TT-format tensor completion. However, the theoretical guarantees of these algorithms are largely missing or sub-optimal, partly due to the complicated and recursive algebraic operations in TT-format decomposition. Moreover, existing results established for the tensors of other formats, for example, Tucker and CP, are inapplicable because the algorithms treating TT-format tensors are substantially different and more involved. In this paper, we provide, to our best knowledge, the first theoretical guarantees of the convergence of RGrad algorithm for TT-format tensor completion, under a nearly optimal sample size condition. The RGrad algorithm converges linearly with a constant contraction rate that is free of tensor condition number without the necessity of re-conditioning. We also propose a novel approach, referred to as the  sequential second-order moment method, to attain a warm initialization under a similar sample size requirement. As a byproduct, our result even significantly refines the prior investigation of RGrad algorithm for matrix completion.  Lastly,  statistically (near) optimal rate is derived for RGrad algorithm if the observed entries consist of random sub-Gaussian noise.  Numerical experiments confirm our theoretical discovery and showcase the computational speedup gained by the TT-format decomposition."
"211177","Darts: User-Friendly Modern Machine Learning for Time Series","Julien Herzen, Francesco Lässig, Samuele Giuliano Piazzetta, Thomas Neuer, Léo Tafti, Guillaume Raille, Tomas Van Pottelbergh, Marek Pasieka, Andrzej Skrodzki, Nicolas Huguenin, Maxime Dumonal, Jan Kościsz, Dennis Bader, Frédérick Gusset, Mounir Benheddi, Camila Williamson, Michal Kosinski, Matej Petrik, Gaël Grosch","https://jmlr.org//papers/volume23/21-1177/21-1177.pdf","https://github.com/unit8co/darts","We present Darts, a Python machine learning library for time series, with a focus on forecasting. Darts offers a variety of models, from classics such as ARIMA to state-of-the-art deep neural networks. The emphasis of the library is on offering modern machine learning functionalities, such as supporting multidimensional series, fitting models on multiple series, training on large datasets, incorporating external data, ensembling models, and providing a rich support for probabilistic forecasting. At the same time, great care goes into the API design to make it user-friendly and easy to use. For instance, all models can be used using fit()/predict(), similar to scikit-learn."
"211199","Foolish Crowds Support Benign Overfitting","Niladri S. Chatterji, Philip M. Long","https://jmlr.org//papers/volume23/21-1199/21-1199.pdf","","We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime. We apply this result to obtain a lower bound for basis pursuit (the minimum $\ell_1$-norm interpolant) that implies that its excess risk can converge at an exponentially slower rate than OLS (the minimum $\ell_2$-norm interpolant), even when the ground truth is sparse.  Our analysis exposes the benefit of an effect analogous to the “wisdom of the crowd”, except here the harm arising from fitting the noise is ameliorated by spreading it among many directions---the variance reduction arises from a foolish crowd."
"211212","Neural Estimation of Statistical Divergences","Sreejith Sreekumar, Ziv Goldfeld","https://jmlr.org//papers/volume23/21-1212/21-1212.pdf","","Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular $\mathsf{f}$-divergences---Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate."
"211232","Kernel Packet: An Exact and Scalable Algorithm for Gaussian Process Regression with Matérn Correlations","Haoyuan Chen, Liang Ding, Rui Tuo","https://jmlr.org//papers/volume23/21-1232/21-1232.pdf","","We develop an exact and scalable algorithm for one-dimensional Gaussian process regression with Matérn correlations whose smoothness parameter $\nu$ is a half-integer. The proposed algorithm only requires $\mathcal{O}(\nu^3 n)$ operations and $\mathcal{O}(\nu n)$ storage. This leads to a linear-cost solver since $\nu$ is chosen to be fixed and usually very small in most applications. The proposed method can be applied to multi-dimensional problems if a full grid or a sparse grid design is used. The proposed method is based on a novel theory for Matérn correlation functions. We find that a suitable rearrangement of these correlation functions can produce a compactly supported function, called a ""kernel packet"". Using a set of kernel packets as basis functions leads to a sparse representation of the covariance matrix that results in the proposed algorithm. Simulation studies show that the proposed algorithm, when applicable, is significantly superior to the existing alternatives in both the computational time and predictive accuracy."
"211290","Power Iteration for Tensor PCA","Jiaoyang Huang, Daniel Z. Huang, Qing Yang, Guang Cheng","https://jmlr.org//papers/volume23/21-1290/21-1290.pdf","","In this paper, we study the power iteration algorithm for the asymmetric spiked tensor model, as introduced in  Richard and Montanari (2014). We give necessary and sufficient conditions for the convergence of the power iteration algorithm. When the power iteration algorithm converges, for the rank one spiked tensor model, we show the estimators for the spike strength and linear functionals of the signal are asymptotically Gaussian; for the multi-rank spiked tensor model, we show the estimators are asymptotically mixtures of Gaussian. This new phenomenon is different from the spiked matrix model. Using these asymptotic results of our estimators, we construct valid and efficient confidence intervals for spike strengths and linear functionals of the signals."
"211312","On the Approximation of Cooperative Heterogeneous Multi-Agent Reinforcement Learning (MARL) using Mean Field Control (MFC)","Washim Uddin Mondal, Mridul Agarwal, Vaneet Aggarwal, Satish V. Ukkusuri","https://jmlr.org//papers/volume23/21-1312/21-1312.pdf","","Mean field control (MFC) is an effective way to mitigate the curse of dimensionality of cooperative multi-agent reinforcement learning (MARL) problems. This work considers a collection of $N_{\mathrm{pop}}$  heterogeneous agents that can be segregated into $K$ classes such that the $k$-th class contains $N_k$ homogeneous agents. We  aim to prove approximation guarantees of the MARL problem for this heterogeneous system by its corresponding MFC problem.  We consider three  scenarios where the reward and transition dynamics of all agents are respectively taken to be functions of $(1)$ joint state and action distributions across all classes, $(2)$ individual distributions of each class, and $(3)$ marginal distributions of the entire population. We show that, in these cases, the $K$-class MARL problem can be approximated by MFC with errors given  as $e_1=\mathcal{O}(\frac{\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}}{N_{\mathrm{pop}}}\sum_{k}\sqrt{N_k})$, $e_2=\mathcal{O}(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\sum_{k}\frac{1}{\sqrt{N_k}})$ and $e_3=\mathcal{O}\left(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\left[\frac{A}{N_{\mathrm{pop}}}\sum_{k\in[K]}\sqrt{N_k}+\frac{B}{\sqrt{N_{\mathrm{pop}}}}\right]\right)$, respectively,  where $A, B$ are some constants and $|\mathcal{X}|,|\mathcal{U}|$ are the sizes of state and action spaces of each agent. Finally, we design a Natural Policy Gradient (NPG) based algorithm that, in the three cases stated above, can converge to an optimal MARL policy within $\mathcal{O}(e_j)$ error with a sample complexity of $\mathcal{O}(e_j^{-3})$, $j\in\{1,2,3\}$, respectively."
"211365","Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks","Alexander Shevchenko, Vyacheslav Kungurtsev, Marco Mondelli","https://jmlr.org//papers/volume23/21-1365/21-1365.pdf","","Understanding the properties of neural networks trained via stochastic gradient descent (SGD) is at the heart of the theory of deep learning. In this work, we take a mean-field view, and consider a two-layer ReLU network trained via noisy-SGD for a univariate regularized regression problem. Our main result is that SGD with vanishingly small noise injected in the gradients is biased towards a simple solution: at convergence, the ReLU network implements a piecewise linear map of the inputs, and the number of “knot” points -- i.e., points where the tangent of the ReLU network estimator changes -- between two consecutive training inputs is at most three. In particular, as the number of neurons of the network grows, the SGD dynamics is captured by the solution of a gradient flow and, at convergence, the distribution of the weights approaches the unique minimizer of a related free energy, which has a Gibbs form. Our key technical contribution consists in the analysis of the estimator resulting from this minimizer: we show that its second derivative vanishes everywhere, except at some specific locations which represent  the “knot” points. We also provide empirical evidence that knots at locations distinct from the data points might occur, as predicted by our theory."
"18045","Let's Make Block Coordinate Descent Converge Faster: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence","Julie Nutini, Issam Laradji, Mark Schmidt","https://jmlr.org//papers/volume23/18-045/18-045.pdf","https://github.com/IssamLaradji/BlockCoordinateDescent","Block coordinate descent (BCD) methods are widely used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure. Three main algorithmic choices influence the performance of BCD methods: the block partitioning strategy, the block selection rule, and the block update rule. In this paper we explore all three of these  building blocks and propose variations for each that can significantly improve the progress made by each BCD iteration. We (i) propose new greedy block-selection strategies that guarantee more progress per iteration than the Gauss-Southwell rule; (ii) explore practical issues like how to implement the new rules when using ""variable"" blocks; (iii)  explore the use of message-passing to  compute matrix or Newton updates efficiently on huge blocks for problems with sparse dependencies between variables; and (iv) consider optimal active manifold identification, which leads to bounds on the ""active-set complexity"" of BCD methods and leads to superlinear convergence for certain problems with sparse solutions (and in some cases finite termination at an optimal solution). We support all of our findings with numerical results for the classic machine learning problems of least squares, logistic regression, multi-class logistic regression, label propagation, and L1-regularization."
"191047","An Optimization-centric View on Bayes' Rule: Reviewing and Generalizing Variational Inference","Jeremias Knoblauch, Jack Jewson, Theodoros Damoulas","https://jmlr.org//papers/volume23/19-1047/19-1047.pdf","https://github.com/JeremiasKnoblauch/GVIPublic","We advocate an optimization-centric view of Bayesian inference. Our inspiration is the representation of Bayes' rule as infinite-dimensional optimization (Csiszar, 1975; Donsker and Varadhan, 1975; Zellner, 1988). Equipped with this perspective, we study Bayesian inference when one does not have access to (1) well-specified priors, (2) well-specified likelihoods, (3) infinite computing power. While these three assumptions underlie the standard Bayesian paradigm, they are typically inappropriate for modern Machine Learning applications. We propose addressing this through an optimization-centric generalization of Bayesian posteriors that we call the Rule of Three (RoT). The RoT can be justified axiomatically and recovers Bayesian, PAC-Bayesian and VI posteriors as special cases. While the RoT is primarily a conceptual and theoretical device, it also encompasses a novel sub-class of tractable posteriors which we call Generalized Variational Inference (GVI) posteriors. Just as the RoT, GVI posteriors are specified by three arguments: a loss, a divergence and a variational family. They also possess a number of desirable properties, including modularity, Frequentist consistency and an interpretation as approximate ELBO. We explore applications of GVI posteriors, and show that they can be used to improve robustness and posterior marginals on Bayesian Neural Networks and Deep Gaussian Processes."
"19644","Manifold Coordinates with Physical Meaning","Samson J. Koelle, Hanyu Zhang, Marina Meila, Yu-Chia Chen","https://jmlr.org//papers/volume23/19-644/19-644.pdf","https://github.com/sjkoelle/montlake/","Manifold embedding algorithms map high-dimensional data down to coordinates in a much lower-dimensional space. One of the aims of dimension reduction is to find intrinsic coordinates that describe the data manifold. The coordinates returned by the embedding algorithm are abstract, and finding their physical or domain-related meaning is not formalized and often left to domain experts. This paper studies the problem of recovering the meaning of the new low-dimensional representation in an  automatic, principled fashion.  We propose a method to explain embedding coordinates of a manifold as non-linear compositions of functions from a user-defined dictionary. We show that this problem can be set up as a sparse linear Group Lasso recovery problem, find sufficient recovery conditions, and demonstrate its effectiveness on data."
"19816","Transfer Learning in Information Criteria-based Feature Selection","Shaohan Chen, Nikolaos V. Sahinidis, Chuanhou Gao","https://jmlr.org//papers/volume23/19-816/19-816.pdf","https://github.com/Shaohan-Chen/Transfer-learning-in-Mallows-Cp","This paper investigates the effectiveness of transfer learning based on information criteria. We propose a procedure that combines transfer learning with Mallows' Cp (TLCp) and prove that it outperforms the conventional Mallows' Cp criterion in terms of accuracy and stability. Our theoretical results indicate that, for any sample size in the target domain, the proposed TLCp estimator performs better than the Cp estimator by the mean squared error (MSE) metric {in the case of orthogonal predictors}, provided that i) the dissimilarity between the tasks from source domain and target domain is small, and ii) the procedure parameters (complexity penalties) are tuned according to certain explicit rules. Moreover, we show that our transfer learning framework can be extended to other feature selection criteria, such as the Bayesian information criterion. By analyzing the solution of the orthogonalized Cp, we identify an estimator that asymptotically approximates the solution of the Cp criterion in the case of non-orthogonal predictors. Similar results are obtained for the non-orthogonal TLCp. Finally, simulation studies and applications with real data demonstrate the usefulness of the TLCp scheme."
"201360","Recovery and Generalization in Over-Realized Dictionary Learning","Jeremias Sulam, Chong You, Zhihui Zhu","https://jmlr.org//papers/volume23/20-1360/20-1360.pdf","","In over two decades of research, the field of dictionary learning has gathered a large collection of successful applications, and theoretical guarantees for model recovery are known only whenever optimization is carried out in the same model class as that of the underlying dictionary. This work characterizes the surprising phenomenon that dictionary recovery can be facilitated by searching over the space of larger over-realized models. This observation is general and independent of the specific dictionary learning algorithm used. We thoroughly demonstrate this observation in practice and provide an analysis of this phenomenon by tying recovery measures to generalization bounds. In particular, we show that model recovery can be upper-bounded by the empirical risk, a model-dependent quantity and the generalization gap, reflecting our empirical findings. We further show that an efficient and provably correct distillation approach can be employed to recover the correct atoms from the over-realized model. As a result, our meta-algorithm provides dictionary estimates with consistently better recovery of the ground-truth model."
"201368","Low-rank Tensor Learning with Nonconvex Overlapped Nuclear Norm Regularization","Quanming Yao, Yaqing Wang, Bo Han, James T. Kwok","https://jmlr.org//papers/volume23/20-1368/20-1368.pdf","https://github.com/quanmingyao/FasTer","Nonconvex regularization has been popularly used in low-rank matrix learning. However, extending  it for low-rank tensor learning is still computationally expensive. To address this problem, we develop an efficient solver for use with a nonconvex extension of the overlapped nuclear norm regularizer. Based on the proximal average algorithm, the proposed algorithm can avoid expensive tensor folding/unfolding operations. A special “sparse plus low-rank"" structure is maintained throughout the iterations, and allows fast computation of the individual proximal steps. Empirical convergence is further improved with the use of adaptive momentum. We  provide convergence guarantees to critical points on smooth losses and also on objectives satisfying the Kurdyka-Lojasiewicz condition. While the optimization problem is nonconvex and nonsmooth, we show that its critical points still have good statistical performance on the tensor completion problem. Experiments on various synthetic and real-world data sets show that the proposed algorithm is efficient in both time and space and more accurate than the existing state-of-the-art."
"20277","On the Efficiency of Entropic Regularized Algorithms for Optimal Transport","Tianyi Lin, Nhat Ho, Michael I. Jordan","https://jmlr.org//papers/volume23/20-277/20-277.pdf","","We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of a greedy variant of Sinkhorn, known as Greenkhorn, from $\tilde{O}(n^2\varepsilon^{-3})$ to $\tilde{O}(n^2\varepsilon^{-2})$. Notably, our result can match the best known complexity bound of Sinkhorn and help clarify why Greenkhorn significantly outperforms Sinkhorn in practice in terms of row/column updates as observed by Altschuler et al. (2017). Second, we propose a new algorithm, which we refer to as APDAMD and which generalizes an adaptive primal-dual accelerated gradient descent (APDAGD) algorithm (Dvurechensky et al., 2018) with a prespecified mirror mapping $\phi$. We prove that APDAMD achieves the complexity bound of $\tilde{O}(n^2\sqrt{\delta}\varepsilon^{-1})$ in which $\delta>0$ stands for the regularity of $\phi$. In addition, we show by a counterexample that the complexity bound of $\tilde{O}(\min\{n^{9/4}\varepsilon^{-1}, n^2\varepsilon^{-2}\})$ proved for APDAGD before is invalid and give a refined complexity bound of $\tilde{O}(n^{5/2}\varepsilon^{-1})$. Further, we develop a deterministic accelerated variant of Sinkhorn via appeal to estimated sequence and prove the complexity bound of $\tilde{O}(n^{7/3}\varepsilon^{-4/3})$. As such, we see that accelerated variant of Sinkhorn outperforms Sinkhorn and Greenkhorn in terms of $1/\varepsilon$ and APDAGD and accelerated alternating minimization (AAM) (Guminov et al., 2021) in terms of $n$. Finally, we conduct the experiments on synthetic and real data and the numerical results show the efficiency of Greenkhorn, APDAMD and accelerated Sinkhorn in practice."
"20321","Exact simulation of diffusion first exit times: algorithm acceleration","Samuel Herrmann, Cristina Zucca","https://jmlr.org//papers/volume23/20-321/20-321.pdf","https://github.com/SamHerr/Diff-FirstExitTime-Acceleration","In order to describe or estimate different quantities related to a specific random variable, it is of prime interest to numerically generate such a variate. In specific situations, the exact generation of random variables might be either momentarily unavailable or too expensive in terms of computation time. It therefore needs to be replaced by an approximation procedure. As was previously the case, the ambitious exact simulation of first exit times for diffusion processes was unreachable though it concerns many applications in different fields like mathematical finance, neuroscience or reliability. The usual way to describe first exit times was to use discretization schemes, that are of course approximation procedures. Recently, Herrmann and Zucca proposed a new algorithm, the so-called GDET-algorithm (General Diffusion Exit Time), which permits to simulate exactly the first exit time for one-dimensional diffusions. The only drawback of exact simulation methods using an acceptance-rejection sampling is their time consumption. In this paper the authors highlight an acceleration procedure for the GDET-algorithm based on a multi-armed bandit model. The efficiency of this acceleration is pointed out through numerical examples."
"20411","No Weighted-Regret Learning in Adversarial Bandits with Delays","Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet","https://jmlr.org//papers/volume23/20-411/20-411.pdf","","Consider a scenario where a player chooses an action in each round $t$ out of $T$ rounds and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have “no weighted-regret” in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves an expected regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves an expected regret of $O\left(\sqrt{\log  K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays."
"20617","Non-asymptotic and Accurate Learning of Nonlinear Dynamical Systems","Yahya Sattar, Samet Oymak","https://jmlr.org//papers/volume23/20-617/20-617.pdf","","We consider the problem of learning a nonlinear dynamical system governed by a nonlinear state equation $h_{t+1}=\phi(h_t,u_t;\theta)+w_t$. Here $\theta$ is the unknown system dynamics, $h_t$ is the state, $u_t$ is the input and $w_t$ is the additive noise vector. We study gradient based algorithms to learn the system dynamics $\theta$ from samples obtained from a single finite trajectory. If the system is run by a stabilizing input policy, then using a mixing-time argument we show that temporally-dependent samples can be approximated by i.i.d. samples. We then develop new guarantees for the uniform convergence of the gradient of the empirical loss induced by these i.i.d. samples. Unlike existing works, our bounds are noise sensitive which allows for learning the ground-truth dynamics with high accuracy and small sample complexity. When combined, our results facilitate efficient learning of a broader class of nonlinear dynamical systems as compared to the prior works. We specialize our guarantees to  entrywise nonlinear activations and verify our theory in various numerical experiments."
"20944","The Importance of Being Correlated: Implications of Dependence in Joint Spectral Inference across Multiple Networks","Konstantinos Pantazis, Avanti Athreya, Jesus Arroyo, William N Frost, Evan S Hill, Vince Lyzinski","https://jmlr.org//papers/volume23/20-944/20-944.pdf","","Spectral inference on multiple networks is a rapidly-developing subfield of graph statistics. Recent work has demonstrated that joint, or simultaneous, spectral embedding of multiple independent networks can deliver more accurate estimation than individual spectral decompositions of those same networks. Such inference procedures typically rely heavily on independence assumptions across the multiple network realizations, and even in this case, little attention has been paid to the induced network correlation that can be a consequence of such joint embeddings. In this paper, we present a generalized omnibus embedding methodology and we provide a detailed analysis of this embedding across both independent and correlated networks, the latter of which significantly extends the reach of such procedures, and we describe how this omnibus embedding can itself induce correlation. This leads us to distinguish betwee inherent correlation---that is, the correlation that arises naturally in multisample network data---and induced correlation, which is an artifice of the joint embedding methodology. We show that the generalized omnibus embedding procedure is flexible and robust, and we prove both consistency and a central limit theorem for the embedded points. We examine how induced and inherent correlation can impact inference for network time series data, and we provide network analogues of classical questions such as the effective sample size for more generally correlated data. Further, we show how an appropriately calibrated generalized omnibus embedding can detect changes in real biological networks that previous embedding procedures could not discern, confirming that the effect of inherent and induced correlation can be subtle and transformative. By allowing for and deconstructing both forms of correlation, our methodology widens the scope of spectral techniques for network inference, with import in theory and practice."
"20984","A Perturbation-Based Kernel Approximation Framework","Roy Mitz, Yoel Shkolnisky","https://jmlr.org//papers/volume23/20-984/20-984.pdf","https://github.com/roymitz/perturbation_kernel_approximation","Kernel methods are powerful tools in various data analysis tasks. Yet, in many cases, their time and space complexity render them impractical for large datasets. Various kernel approximation methods were proposed to overcome this issue, with the most prominent method being the Nystr{\""o}m method. In this paper, we derive a perturbation-based kernel approximation framework building upon results from classical perturbation theory. We provide an  error analysis for this framework, and prove that in fact, it generalizes the Nystr{\""o}m method and several of its variants. Furthermore, we show that our framework gives rise to new kernel approximation schemes, that can be tuned to take advantage of the structure of the approximated kernel matrix. We support our theoretical results numerically and demonstrate the advantages of our approximation framework on both synthetic and real-world data."
"210225","Reverse-mode differentiation in arbitrary tensor network format: with application to supervised learning","Alex A. Gorodetsky, Cosmin Safta, John D. Jakeman","https://jmlr.org//papers/volume23/21-0225/21-0225.pdf","","This paper describes an efficient reverse-mode differentiation algorithm for contraction operations for arbitrary and unconventional tensor network topologies. The approach leverages the tensor contraction tree of Evenbly and Pfeifer (2014), which provides an instruction set for the contraction sequence of a network. We show that this tree can be efficiently leveraged for differentiation of a full tensor network contraction using a recursive scheme that exploits (1) the bilinear property of contraction and (2) the property that trees have a single path from root to leaves. While differentiation of tensor-tensor contraction is already possible in most automatic differentiation packages, we show that exploiting these two additional properties in the specific context of contraction sequences can improve efficiency.  Following a description of the algorithm and computational complexity analysis, we investigate its utility for gradient-based supervised learning for low-rank function recovery and for fitting real-world unstructured datasets. We demonstrate improved performance over alternating least-squares optimization approaches and the capability to handle heterogeneous and arbitrary tensor network formats. When compared to alternating minimization algorithms, we find that the gradient-based approach requires a smaller oversampling ratio (number of samples compared to number model parameters) for recovery. This increased efficiency extends to fitting unstructured data of varying dimensionality and when employing a variety of tensor network formats. Here, we show improved learning using the hierarchical Tucker method over the tensor-train in high-dimensional settings on a number of benchmark problems."
"210226","A Momentumized, Adaptive, Dual Averaged Gradient Method","Aaron Defazio, Samy Jelassi","https://jmlr.org//papers/volume23/21-0226/21-0226.pdf","https://github.com/facebookresearch/madgrad","We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly."
"21037","A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning","Andrew Patterson, Adam White, Martha White","https://jmlr.org//papers/volume23/21-037/21-037.pdf","https://github.com/rlai-lab/Generalized-Projected-Bellman-Errors","Many reinforcement learning algorithms rely on value estimation, however, the most widely used algorithms---namely temporal difference algorithms---can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation based on the linear mean squared projected Bellman error (MSPBE) and are sound under linear function approximation. Extending these methods to the nonlinear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective---the mean-squared Bellman error (MSBE)---which naturally facilitate nonlinear approximation. In this work, we build on these insights and introduce a new generalized MSPBE that extends the linear MSPBE to the nonlinear setting. We show how this generalized objective unifies previous work and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective, and show that it is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation."
"210382","Adversarial Robustness Guarantees for Gaussian Processes","Andrea Patane, Arno Blaas, Luca Laurenti, Luca Cardelli, Stephen Roberts, Marta Kwiatkowska","https://jmlr.org//papers/volume23/21-0382/21-0382.pdf","https://github.com/andreapatane/check-GPclass","Gaussian processes (GPs) enable principled computation of model uncertainty, making them attractive for safety-critical applications. Such scenarios demand that GP decisions are not only accurate, but also robust to perturbations. In this paper we present a framework to analyse adversarial robustness of GPs, defined as invariance of the model's decision to bounded perturbations. Given a compact subset of the input space $T\subseteq \mathbb{R}^d$, a point $x^*$ and a GP, we provide provable guarantees of adversarial robustness of the GP by computing lower and upper bounds on its prediction range in $T$. We develop a branch-and-bound scheme to refine the bounds and show, for any $\epsilon > 0$, that our algorithm is guaranteed to converge to values $\epsilon$-close to the actual values in finitely many iterations. The algorithm is anytime and can handle both regression and classification tasks, with analytical formulation for most kernels used in practice. We evaluate our methods on a collection of synthetic and standard benchmark data sets, including SPAM, MNIST and FashionMNIST. We study the effect of approximate inference techniques on robustness and demonstrate how our method can be used for interpretability. Our empirical results suggest that the adversarial robustness of GPs increases with accurate posterior estimation."
"210386","On the Robustness to Misspecification of α-posteriors and Their Variational Approximations","Marco Avella Medina, José Luis Montiel Olea, Cynthia Rush, Amilcar Velez","https://jmlr.org//papers/volume23/21-0386/21-0386.pdf","","$\alpha$-posteriors and their variational approximations distort standard posterior inference by downweighting the likelihood and introducing variational approximation errors. We show that such distortions, if tuned appropriately, reduce the Kullback--Leibler (KL) divergence from the true, but perhaps infeasible, posterior distribution when there is potential parametric model misspecification. To make this point, we derive a Bernstein--von Mises theorem showing convergence in total variation distance of $\alpha$-posteriors and their variational approximations to limiting Gaussian distributions. We use these limiting distributions to evaluate the KL divergence between true and reported posteriors. We show that the KL divergence is minimized by choosing $\alpha$ strictly smaller than one, assuming there is a vanishingly small probability of model misspecification. The optimized value of $\alpha$ becomes smaller as the misspecification becomes more severe. The optimized KL divergence increases logarithmically in the magnitude of misspecification and not linearly as with the usual posterior. Moreover, the optimized variational approximations of $\alpha$-posteriors can induce additional robustness to model misspecification beyond that obtained by optimally downweighting the likelihood."
"210419","Online Nonnegative CP-dictionary Learning for Markovian Data","Hanbaek Lyu, Christopher Strohmeier, Deanna Needell","https://jmlr.org//papers/volume23/21-0419/21-0419.pdf","https://github.com/HanbaekLyu/OnlineCPDL","Online Tensor Factorization (OTF) is a  fundamental tool in learning low-dimensional interpretable features from streaming multi-modal data. While various algorithmic and theoretical aspects of OTF have been investigated recently, a general convergence guarantee to stationary points of the objective function without any incoherence or sparsity assumptions is still lacking even for the i.i.d. case. In this work, we introduce a novel algorithm that learns a CANDECOMP/PARAFAC (CP) basis from a given stream of tensor-valued data under general constraints, including nonnegativity constraints that induce interpretability of the learned CP basis. We prove that our algorithm converges almost surely to the set of stationary points of the objective function under the hypothesis that the sequence of data tensors is generated by an underlying Markov chain. Our setting covers the classical i.i.d. case as well as a wide range of application contexts including data streams generated by independent or MCMC sampling. Our result closes a gap between OTF and Online Matrix Factorization in global convergence analysis for CP-decompositions. Experimentally, we show that our algorithm converges much faster than standard algorithms for nonnegative tensor factorization tasks on both synthetic and real-world data. Also, we demonstrate the utility of our algorithm on a diverse set of examples from image, video, and time-series data, illustrating how one may learn qualitatively different CP-dictionaries from the same tensor data by exploiting the tensor structure in multiple ways."
"210486","Implicit Differentiation for Fast Hyperparameter Selection in Non-Smooth Convex Learning","Quentin Bertrand, Quentin Klopfenstein, Mathurin Massias, Mathieu Blondel, Samuel Vaiter, Alexandre Gramfort, Joseph Salmon","https://jmlr.org//papers/volume23/21-0486/21-0486.pdf","https://github.com/qb3/sparse-ho","Finding the optimal hyperparameters of a model can be cast as a bilevel optimization problem, typically solved using zero-order techniques. In this work we study first-order methods when the inner optimization problem is convex but non-smooth. We show that the forward-mode differentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian. Using implicit differentiation, we show it is possible to leverage the non-smoothness of the inner problem to speed up the computation. Finally, we provide a bound on the error made on the hypergradient when the inner optimization problem is solved approximately. Results on regression and classification problems reveal computational benefits for hyperparameter optimization, especially when multiple hyperparameters are required."
"210663","EV-GAN: Simulation of extreme events with ReLU neural networks","Michaël Allouche, Stéphane Girard, Emmanuel Gobet","https://jmlr.org//papers/volume23/21-0663/21-0663.pdf","","Feedforward neural networks based on Rectified linear units (ReLU) cannot efficiently approximate quantile functions which are not bounded, especially in the case of heavy-tailed distributions. We thus propose a new parametrization for the generator of a Generative adversarial network (GAN) adapted to this framework, basing on extreme-value theory. An analysis of the uniform error between the extreme quantile and its GAN approximation is provided: We establish that the rate of convergence of the error is mainly driven by the second-order parameter of the data distribution. The above results are illustrated on simulated data and real financial data. It appears that our approach outperforms the classical GAN in a wide range of situations including high-dimensional and dependent data."
"210730","Universal Approximation of Functions on Sets","Edward Wagstaff, Fabian B. Fuchs, Martin Engelcke, Michael A. Osborne, Ingmar Posner","https://jmlr.org//papers/volume23/21-0730/21-0730.pdf","","Modelling functions of sets, or equivalently, permutation-invariant functions, is a long-standing challenge in machine learning. Deep Sets is a popular method which is known to be a universal approximator for continuous set functions. We provide a theoretical analysis of Deep Sets which shows that this universal approximation property is only guaranteed if the model's latent space is sufficiently high-dimensional. If the latent space is even one dimension lower than necessary, there exist piecewise-affine functions for which Deep Sets performs no better than a naïve constant baseline, as judged by worst-case error. Deep Sets may be viewed as the most efficient incarnation of the Janossy pooling paradigm. We identify this paradigm as encompassing most currently popular set-learning methods. Based on this connection, we discuss the implications of our results for set learning more broadly, and identify some open questions on the universality of Janossy pooling in general."
"210808","Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning","Sébastien Forestier, Rémy Portelas, Yoan Mollard, Pierre-Yves Oudeyer","https://jmlr.org//papers/volume23/21-0808/21-0808.pdf","","Intrinsically motivated spontaneous exploration is a key enabler of autonomous developmental learning in human children. It enables the discovery of skill repertoires through autotelic learning, i.e. the self-generation, self-selection, self-ordering and self-experimentation of learning goals. We present an algorithmic approach called Intrinsically Motivated Goal Exploration Processes (IMGEP) to enable similar properties of autonomous learning in machines. The IMGEP architecture relies on several principles: 1) self-generation of goals, generalized as parameterized fitness functions; 2) selection of goals based on intrinsic rewards; 3) exploration with incremental goal-parameterized policy search and exploitation with a batch learning algorithm; 4) systematic reuse of information acquired when targeting a goal for improving towards other goals. We present a particularly efficient form of IMGEP, called AMB, that uses a population-based policy and an object-centered spatio-temporal modularity. We provide several implementations of this architecture and demonstrate their ability to automatically generate a learning curriculum within several experimental setups. One of these experiments includes a real humanoid robot exploring multiple spaces of goals with several hundred continuous dimensions and with distractors. While no particular target goal is provided to these autotelic agents, this curriculum allows the discovery of diverse skills that act as stepping stones for learning more complex skills, e.g. nested tool use."
"210934","Truncated Emphatic Temporal Difference Methods for Prediction and Control","Shangtong Zhang, Shimon Whiteson","https://jmlr.org//papers/volume23/21-0934/21-0934.pdf","https://github.com/ShangtongZhang/DeepRL","Emphatic Temporal Difference (TD) methods are a class of off-policy Reinforcement Learning (RL) methods involving the use of followon traces.  Despite the theoretical success of emphatic TD methods in addressing the notorious deadly triad of off-policy RL, there are still two open problems. First, followon traces typically suffer from large variance, making them hard to use in practice.  Second, though Yu (2015) confirms the asymptotic convergence of some emphatic TD methods for prediction problems, there is still no finite sample analysis for any emphatic TD method for prediction, much less control. In this paper,  we address those two open problems simultaneously via using truncated followon traces in emphatic TD methods. Unlike the original followon traces, which depend on all previous history, truncated followon traces depend on only finite history, reducing variance and enabling the finite sample analysis of our proposed emphatic TD methods for both prediction and control."
"210947","Policy Evaluation and Temporal-Difference Learning in Continuous Time and Space: A Martingale Approach","Yanwei Jia, Xun Yu Zhou","https://jmlr.org//papers/volume23/21-0947/21-0947.pdf","https://www.dropbox.com/sh/5vyaw0yognhcabf/AACsArMcNmEuSwpXxcRq-qT1a?dl=0","We propose a unified framework to study policy evaluation (PE) and the associated temporal difference (TD) methods for reinforcement learning in continuous time and space. We show that PE is equivalent to maintaining the martingale condition of a process. From this perspective, we find that the mean-square TD error approximates the quadratic variation of the martingale and thus is not a suitable objective for PE. We present two methods to use the martingale characterization for designing PE algorithms. The first one minimizes a “martingale loss function"", whose solution is proved to be the best approximation of the true value function in the mean--square sense. This method interprets the classical gradient Monte-Carlo algorithm. The second method is based on a system of equations called the “martingale orthogonality conditions"" with test functions. Solving these equations in different ways recovers various classical TD algorithms, such as TD($\lambda$), LSTD, and GTD. Different choices of test functions determine in what sense the resulting solutions approximate the true value function. Moreover, we prove that any convergent time-discretized algorithm converges to its continuous-time counterpart as the mesh size goes to zero, and we provide the convergence rate. We demonstrate the theoretical results and corresponding algorithms with numerical experiments and applications."
"210991","Principal Components Bias in Over-parameterized Linear Models, and its Manifestation in Deep Neural Networks","Guy Hacohen, Daphna Weinshall","https://jmlr.org//papers/volume23/21-0991/21-0991.pdf","","Recent work suggests that convolutional neural networks of different architectures learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our analysis reveals that, when the hidden layers are wide enough, the convergence rate of this model's parameters is exponentially faster along the directions of the larger principal components of the data, at a rate governed by the corresponding singular values. We term this convergence pattern the Principal Components bias (PC-bias). Empirically, we show how the PC-bias streamlines the order of learning of both linear and non-linear networks, more prominently at earlier stages of learning. We then compare our results to the simplicity bias, showing that both biases can be seen independently, and affect the order of learning in different ways. Finally, we discuss how the PC-bias may explain some benefits of early stopping and its connection to PCA, and why deep networks converge more slowly with random labels."
"210999","Statistical Rates of Convergence for Functional Partially Linear Support Vector Machines for Classification","Yingying Zhang, Yan-Yong Zhao, Heng Lian","https://jmlr.org//papers/volume23/21-0999/21-0999.pdf","","In this paper, we consider the learning rate of support vector machines with both a functional predictor and a high-dimensional multivariate vectorial predictor. Similar to the literature on learning in reproducing kernel Hilbert spaces, a source condition and a capacity condition are used to characterize the convergence rate of the estimator. It is highly non-trivial to establish the possibly faster rate of the linear part. Using a key basic inequality comparing losses at two carefully constructed points, we establish the learning rate of the linear part which is the same as if the functional part is known. The proof relies on empirical processes and the Rademacher complexity bound in the semi-nonparametric setting as analytic tools, Young's inequality for operators, as well as a novel  “approximate convexity"" assumption."
"211043","A universally consistent learning rule with a universally monotone error","Vladimir Pestov","https://jmlr.org//papers/volume23/21-1043/21-1043.pdf","","We present a universally consistent learning rule whose expected error is monotone non-increasing with the sample size under every data distribution. The question of existence of such rules was brought up in 1996 by Devroye, Györfi and Lugosi (who called them “smart”). Our rule is fully deterministic, a data-dependent partitioning rule constructed in an arbitrary domain (a standard Borel space) using a cyclic order. The central idea is to only partition at each step those cyclic intervals that exhibit a sufficient empirical diversity of labels, thus avoiding a region where the error function is convex."
"211124","ktrain: A Low-Code Library for Augmented Machine Learning","Arun S. Maiya","https://jmlr.org//papers/volume23/21-1124/21-1124.pdf","https://github.com/amaiya/ktrain","We present ktrain, a low-code Python library that makes machine learning more accessible and easier to apply. As a wrapper to TensorFlow and many other libraries (e.g., transformers, scikit-learn, stellargraph), it is designed to make sophisticated, state-of-the-art machine learning models simple to build, train, inspect, and apply by both beginners and experienced practitioners. Featuring modules that support text data (e.g., text classification, sequence tagging, open-domain question-answering), vision data (e.g., image classification), graph data (e.g., node classification, link prediction), and tabular data, ktrain presents a simple unified interface enabling one to quickly solve a wide range of tasks in as little as three or four ""commands"" or lines of code."
"211159","Structure Learning for Directed Trees","Martin E. Jakobsen, Rajen D. Shah, Peter Bühlmann, Jonas Peters","https://jmlr.org//papers/volume23/21-1159/21-1159.pdf","https://github.com/MartinEmilJakobsen/CAT","Knowing the causal structure of a system is of fundamental interest in many areas of science and can aid the design of prediction algorithms that work well under manipulations to the system. The causal structure becomes identifiable from the observational distribution under certain restrictions. To learn the structure from data, score-based methods evaluate different graphs according to the quality of their fits. However, for large, continuous, and nonlinear models, these rely on heuristic optimization approaches with no general guarantees of recovering the true causal structure. In this paper, we consider structure learning of directed trees. We propose a fast and scalable method based on Chu–Liu–Edmonds’ algorithm we call causal additive trees (CAT). For the case of Gaussian errors, we prove consistency in an asymptotic regime with a vanishing identifiability gap. We also introduce two methods for testing substructure hypotheses with asymptotic family-wise error rate control that is valid post-selection and in unidentified settings. Furthermore, we study the identifiability gap, which quantifies how much better the true causal model fits the observational distribution, and prove that it is lower bounded by local properties of the causal model. Simulation studies demonstrate the favorable performance of CAT compared to competing structure learning methods."
"211189","Fairness-Aware PAC Learning from Corrupted Data","Nikola Konstantinov, Christoph H. Lampert","https://jmlr.org//papers/volume23/21-1189/21-1189.pdf","","Addressing fairness concerns about machine learning models is a crucial step towards their long-term adoption in real-world automated systems. While many approaches have been developed for training fair models from data, little is known about the robustness of these methods to data corruption. In this work we consider fairness-aware learning under worst-case data manipulations. We show that an adversary can in some situations force any learner to return an overly biased classifier, regardless of the sample size and with or without degrading accuracy, and that the strength of the excess bias increases for learning problems with underrepresented protected groups in the data. We also prove that our hardness results are tight up to constant factors. To this end, we study two natural learning algorithms that optimize for both accuracy and fairness and show that these algorithms enjoy guarantees that are order-optimal in terms of the corruption ratio and the protected groups frequencies in the large data limit."
"211270","Topologically penalized regression on manifolds","Olympio Hacquard, Krishnakumar Balasubramanian, Gilles Blanchard, Clément Levrard, Wolfgang Polonik","https://jmlr.org//papers/volume23/21-1270/21-1270.pdf","https://github.com/OlympioH/Lap_reg_topo_pen","We study a regression problem on a compact manifold M. In order to take advantage of the underlying geometry and topology of the data, the regression task is performed on the basis of the first several eigenfunctions of the Laplace-Beltrami operator of the manifold, that are regularized with topological penalties. The proposed penalties are based on the topology of the sub-level sets of either the eigenfunctions or the estimated function. The overall approach is shown to yield promising and competitive performance on various applications to both synthetic and real data sets. We also provide theoretical guarantees on the regression function estimates, on both its prediction error and its smoothness (in a topological sense). Taken together, these results support the relevance of our approach in the case where the targeted function is “topologically smooth”."
"211282","Explicit Convergence Rates of Greedy and Random Quasi-Newton Methods","Dachao Lin, Haishan Ye, Zhihua Zhang","https://jmlr.org//papers/volume23/21-1282/21-1282.pdf","","Optimization is important in machine learning problems, and quasi-Newton methods have a reputation as the most efficient numerical methods for smooth unconstrained optimization. In this paper, we study the explicit superlinear convergence rates of quasi-Newton methods and address two open problems mentioned by Rodomanov and Nesterov (2021b). First, we extend Rodomanov and Nesterov (2021b)’s results to random quasi-Newton methods, which include common DFP, BFGS, SR1 methods. Such random methods employ a random direction for updating the approximate Hessian matrix in each iteration. Second, we focus on the specific quasi-Newton methods: SR1 and BFGS methods. We provide improved versions of greedy and random methods with provable better explicit (local) superlinear convergence rates. Our analysis is closely related to the approximation of a given Hessian matrix, unconstrained quadratic objective, as well as the general strongly convex, smooth, and strongly self-concordant functions."
"211390","Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Estimation from Incomplete Measurements","Tian Tong, Cong Ma, Ashley Prater-Bennette, Erin Tripp, Yuejie Chi","https://jmlr.org//papers/volume23/21-1390/21-1390.pdf","https://github.com/Titan-Tong/ScaledGD","Tensors, which provide a powerful and flexible model for representing multi-attribute data and multi-way interactions, play an indispensable role in modern data science  across various fields in science and engineering. A fundamental task is to faithfully recover the tensor from highly incomplete measurements in a statistically and computationally efficient manner. Harnessing the low-rank structure of tensors in the Tucker decomposition, this paper develops a scaled gradient descent (ScaledGD) algorithm to directly recover the tensor factors with tailored spectral initializations, and shows that it provably converges at a linear rate independent of the condition number of the ground truth tensor for two canonical problems --- tensor completion and tensor regression --- as soon as the sample size is above the order of $n^{3/2}$ ignoring other parameter dependencies, where $n$ is the dimension of the tensor. This leads to an extremely scalable approach to low-rank tensor estimation compared with prior art, which suffers from at least one of the following drawbacks: extreme sensitivity to ill-conditioning, high per-iteration costs in terms of memory and computation, or poor sample complexity guarantees. To the best of our knowledge, ScaledGD is the first algorithm that achieves near-optimal statistical and computational complexities simultaneously for low-rank tensor completion with the Tucker decomposition. Our algorithm highlights the power of appropriate preconditioning in accelerating nonconvex statistical estimation, where the iteration-varying preconditioners promote desirable invariance properties of the trajectory with respect to the underlying symmetry in low-rank tensor factorization."
"19104","Solving L1-regularized SVMs and Related Linear Programs: Revisiting the Effectiveness of Column and Constraint Generation","Antoine Dedieu, Rahul Mazumder, Haoyue Wang","https://jmlr.org//papers/volume23/19-104/19-104.pdf","","The linear Support Vector Machine (SVM) is a classic classification technique in machine learning. Motivated by applications in high dimensional statistics, we consider penalized SVM problems involving the minimization of a hinge-loss function with a convex sparsity-inducing regularizer such as: the L1-norm on the coefficients, its grouped generalization and the sorted L1-penalty (aka Slope). Each problem can be expressed as a Linear Program (LP) and is computationally challenging when the number of features and/or samples is large---the current state of algorithms for these problems is rather nascent when compared to the usual L2-regularized linear SVM. To this end, we propose new computational algorithms for these LPs by bringing together techniques from (a) classical column (and constraint) generation methods and (b) first-order methods for non-smooth convex optimization---techniques that appear to be rarely used together for solving large scale LPs. These components have their respective strengths; and while they are found to be useful as separate entities, they appear to be more powerful in practice when used together in the context of solving large-scale LPs such as the ones studied herein. Our approach complements the strengths of (a) and (b)---leading to a scheme that seems to significantly outperform commercial solvers as well as specialized implementations for these problems. We present numerical results on a series of real and synthetic data sets demonstrating the surprising effectiveness of classic column/constraint generation methods in the context of challenging LP-based machine learning tasks."
"19350","Improved Classification Rates for Localized SVMs","Ingrid Blaschzyk, Ingo Steinwart","https://jmlr.org//papers/volume23/19-350/19-350.pdf","","Localized support vector machines solve SVMs on many spatially defined small chunks and besides their computational benefit compared to global SVMs one of their main characteristics is the freedom of choosing arbitrary kernel and regularization parameter on each cell. We take advantage of this observation to derive global learning rates for localized SVMs with Gaussian kernels and hinge loss. It turns out that our rates outperform under suitable sets of assumptions known classification rates for localized SVMs, for global SVMs, and other learning algorithms based on e.g., plug-in rules or trees. The localized SVM rates are achieved under a set of margin conditions, which describe the behavior of the data-generating distribution, and no assumption on the existence of a density is made. Moreover, we show that our rates are obtained adaptively, that is without knowing the margin parameters in advance. The statistical analysis of the excess risk relies on a simple partitioning based technique, which splits the input space into a subset that is close to the decision boundary and into a subset that is sufficiently far away. A crucial condition to derive then improved global rates is a margin condition that relates the distance to the decision boundary to the amount of noise."
"19511","Generalization Bounds and Representation Learning for Estimation of  Potential Outcomes and Causal Effects","Fredrik D. Johansson, Uri Shalit, Nathan Kallus, David Sontag","https://jmlr.org//papers/volume23/19-511/19-511.pdf","","Practitioners in diverse fields such as healthcare, economics and education are eager to apply machine learning to improve decision making. The cost and impracticality of performing experiments and a recent monumental increase in electronic record keeping has brought attention to the problem of evaluating decisions based on non-experimental observational data. This is the setting of this work. In particular, we study estimation of individual-level potential outcomes and causal effects---such as a single patient's response to alternative medication---from recorded contexts, decisions and outcomes. We give generalization bounds on the error in estimated outcomes based on distributional distance measures between re-weighted samples of groups receiving different treatments. We provide conditions under which our bounds are tight and show how they relate to results for unsupervised domain adaptation. Led by our theoretical results, we devise algorithms which learn representations and weighting functions that minimize our bounds by regularizing the representation's induced treatment group distance, and encourage sharing of information between treatment groups. Finally, an experimental evaluation on real and synthetic data shows the value of our proposed representation architecture and regularization scheme."
"19571","Unbiased estimators for random design regression","Michał Dereziński, Manfred K. Warmuth, Daniel Hsu","https://jmlr.org//papers/volume23/19-571/19-571.pdf","","In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over $d$-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points, which is a significant bottleneck in model averaging. In this paper we show that it is possible to draw a non-i.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of $d$ points, drawn jointly according to a certain determinantal point process constructed from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volume-rescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and $\epsilon>0$ there is a random design consisting of $O(d\log d+ d/\epsilon)$ points from which an unbiased estimator can be constructed whose expected square loss over the entire distribution is bounded by $1+\epsilon$ times the loss of the optimum. We provide efficient algorithms for constructing such unbiased estimators in a number of practical settings.  In one such setting, we let the input distribution be uniform over a large dataset of $n\gg d$ points. Here, we obtain the first unbiased least squares estimator that can be constructed in time nearly-linear in the data size, resulting in strong guarantees for model averaging. We achieve these computational gains by introducing a new algorithmic technique, called distortion-free intermediate sampling, which is the first method to enable sampling from determinantal point processes in time polynomial in the sample size."
"19918","A Worst Case Analysis of Calibrated Label Ranking Multi-label Classification Method","Lucas Henrique Sousa Mello, Flávio Miguel Varejão, Alexandre Loureiros Rodrigues","https://jmlr.org//papers/volume23/19-918/19-918.pdf","","Most multi-label classification methods are evaluated on real datasets, which is a good practice for comparing the performance among methods on the average scenario. Due to the large amount of factors to consider, this empirical approach does not explain, nor does show the factors impacting the performance. A reasonable way to understand some of the performance’s factors of multi-label methods independently of the context is to find a mathematical proof about them. In this paper, mathematical proofs are given for the multi-label method ranking by pairwise comparison and its extension for classification named by calibrated label ranking, showing their performance on a worst case scenario for five multi-label metrics. The pairwise approach adopted by ranking by pairwise comparison enables the algorithm to achieve the optimal performance on Spearman rank correlation. However, the findings presented in this paper clearly show that the same pairwise approach adopted by the algorithm is also a crucial factor contributing to a very poor performance on other multi-label metrics."
"20021","D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data","Hai Shu, Zhe Qu, Hongtu Zhu","https://jmlr.org//papers/volume23/20-021/20-021.pdf","https://github.com/shu-hai/D-GCCA","Modern biomedical studies often collect multi-view data, that is, multiple types of data measured on the same set of objects. A popular model in high-dimensional multi-view data analysis is to decompose each view’s data matrix into a low-rank common-source matrix generated by latent factors common across all data views, a low-rank distinctive-source matrix corresponding to each view, and an additive noise matrix. We propose a novel decomposition method for this model, called decomposition-based generalized canonical correlation analysis (D-GCCA). The D-GCCA rigorously defines the decomposition on the L2 space of random variables in contrast to the Euclidean dot product space used by most existing methods, thereby being able to provide the estimation consistency for the low-rank matrix recovery. Moreover, to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods, however, inadequately consider such orthogonality and may thus suffer from substantial loss of undetected common-source variation. Our D-GCCA takes one step further than generalized canonical correlation analysis by separating common and distinctive components among canonical variables, while enjoying an appealing interpretation from the perspective of principal component analysis. Furthermore, we propose to use the variable-level proportion of signal variance explained by common or distinctive latent factors for selecting the variables most influenced. Consistent estimators of our D-GCCA method are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale data. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples."
"201060","Scalable and Efficient Hypothesis Testing with Random Forests","Tim Coleman, Wei Peng, Lucas Mentch","https://jmlr.org//papers/volume23/20-1060/20-1060.pdf","","Throughout the last decade, random forests have established themselves as among the most accurate and popular supervised learning methods. While their black-box nature has made their mathematical analysis difficult, recent work has established important statistical properties like consistency and asymptotic normality by considering subsampling in lieu of bootstrapping. Though such results open the door to traditional inference procedures, all formal methods suggested thus far place severe restrictions on the testing framework and their computational overhead often precludes their practical scientific use. Here we propose a hypothesis test to formally assess feature significance, which uses permutation tests to circumvent computationally infeasible estimates of nuisance parameters. This test is intended to be analogous to the F-test for linear regression. We establish asymptotic validity of the test via exchangeability arguments and show that the test maintains high power with orders of magnitude fewer computations. Importantly, the procedure scales easily to big data settings where large training and testing sets may be employed, conducting statistically valid inference without the need to construct additional models. Simulations and applications to ecological data, where random forests have recently shown promise, are provided."
"201121","Interlocking Backpropagation: Improving depthwise model-parallelism","Aidan N. Gomez, Oscar Key, Kuba Perlin, Stephen Gou, Nick Frosst, Jeff Dean, Yarin Gal","https://jmlr.org//papers/volume23/20-1121/20-1121.pdf","","The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism can suffer from poor resource utilisation, which leads to wasted resources. In this work, we improve upon recent developments in an idealised model-parallel optimisation setting: local learning. Motivated by poor resource utilisation in the global setting and poor task performance in the local setting, we introduce a class of intermediary strategies between local and global learning referred to as interlocking backpropagation. These strategies preserve many of the compute-efficiency advantages of local optimisation, while recovering much of the task performance achieved by global optimisation. We assess our strategies on both image classification ResNets and Transformer language models, finding that our strategy consistently out-performs local learning in terms of task performance, and out-performs global learning in training efficiency."
"201239","Projection-free Distributed Online Learning with Sublinear Communication Complexity","Yuanyu Wan, Guanghui Wang, Wei-Wei Tu, Lijun Zhang","https://jmlr.org//papers/volume23/20-1239/20-1239.pdf","","To deal with complicated constraints via locally light computations in distributed online learning, a recent study has presented a projection-free algorithm called distributed online conditional gradient (D-OCG), and achieved an $O(T^{3/4})$ regret bound for convex losses, where $T$ is the number of total rounds. However, it requires $T$ communication rounds, and cannot utilize the strong convexity of losses. In this paper, we propose an improved variant of D-OCG, namely D-BOCG, which can attain the same $O(T^{3/4})$ regret bound with only $O(\sqrt{T})$ communication rounds for convex losses, and a better regret bound of $O(T^{2/3}(\log T)^{1/3})$ with fewer $O(T^{1/3}(\log T)^{2/3})$ communication rounds for strongly convex losses. The key idea is to adopt a delayed update mechanism that reduces the communication complexity, and redefine the surrogate loss function in D-OCG for exploiting the strong convexity. Furthermore, we provide lower bounds to demonstrate that the $O(\sqrt{T})$ communication rounds required by D-BOCG are optimal (in terms of $T$) for achieving the $O(T^{3/4})$ regret with convex losses, and the $O(T^{1/3}(\log T)^{2/3})$ communication rounds required by D-BOCG are near-optimal (in terms of $T$) for achieving the $O(T^{2/3}(\log T)^{1/3})$ regret with strongly convex losses up to polylogarithmic factors. Finally, to handle the more challenging bandit setting, in which only the loss value is available, we incorporate the classical one-point gradient estimator into D-BOCG, and obtain similar theoretical guarantees."
"201258","Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training","Diego Granziol, Stefan Zohren, Stephen Roberts","https://jmlr.org//papers/volume23/20-1258/20-1258.pdf","","We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory.  We demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian. We also derive similar results for the Generalised Gauss-Newton matrix approximation of the Hessian. As a consequence of our theorems we derive an analytical expressions for the maximal learning rates as a function of batch size, informing practical training regimens for both stochastic gradient descent (linear scaling) and adaptive algorithms, such as Adam (square root scaling), for smooth, non-convex deep neural networks. Whilst the linear scaling for stochastic gradient descent has been derived under more restrictive conditions, which we generalise, the square root scaling rule for adaptive optimisers is, to our knowledge, completely novel. We validate our claims on the VGG/WideResNet architectures on the CIFAR-100 and ImageNet data sets. Based on our investigations of the sub-sampled Hessian we develop a stochastic Lanczos quadrature based on the fly learning rate and momentum learner, which avoids the need for expensive multiple evaluations for these key hyper-parameters and shows good preliminary results on the Pre-Residual Architecture for CIFAR-100. We further investigate the similarity between the Hessian spectrum of a multi-layer perceptron, trained on Gaussian mixture data, compared to that of deep neural networks trained on natural images. We find striking similarities, with both exhibiting rank degeneracy, a bulk spectrum and outliers to that spectrum. Furthermore, we show that ZCA whitening can remove such outliers early on in training before class separation occurs, but that outliers persist in later training."
"201265","Training and Evaluation of Deep Policies Using Reinforcement Learning and Generative Models","Ali Ghadirzadeh, Petra Poklukar, Karol Arndt, Chelsea Finn, Ville Kyrki, Danica Kragic, Mårten Björkman","https://jmlr.org//papers/volume23/20-1265/20-1265.pdf","","We present a data-efficient framework for solving sequential decision-making problems which exploits the combination of reinforcement learning (RL) and latent variable generative models. The framework, called GenRL, trains deep  policies by introducing an action latent variable such that the feed-forward policy search can be divided into two parts: (i) training a sub-policy that outputs a distribution over the action latent variable given a state of the system, and (ii) unsupervised training of a generative model that outputs a sequence of motor actions conditioned on the latent action variable. GenRL enables safe exploration and alleviates the data-inefficiency problem as it exploits prior knowledge about valid sequences of motor actions. Moreover, we provide a set of measures for evaluation of generative models such that we are able to predict the performance of the RL policy training prior to the actual training on a physical robot. We experimentally determine the characteristics of generative models that have most influence on the performance of the final policy training on two robotics tasks: shooting a hockey puck and throwing a basketball. Furthermore, we empirically demonstrate that GenRL is the only method which can safely and efficiently solve the robotics tasks compared to two state-of-the-art RL methods."
"201353","Improved Generalization Bounds for Adversarially Robust Learning","Idan Attias, Aryeh Kontorovich, Yishay Mansour","https://jmlr.org//papers/volume23/20-1353/20-1353.pdf","","We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples.  The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017).  Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{\epsilon^4}\log(\frac{|H|}{\delta}))$ to $O\big(\frac{1}{\epsilon^2}(kVC(H)\log^{\frac{3}{2}+\alpha}(kVC(H))+\log(\frac{1}{\delta})\big)$ for any $\alpha > 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest.  For binary classification, the algorithm of Feige et al. (2015)  uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample."
"201466","Signature Moments to Characterize Laws of Stochastic Processes","Ilya Chevyrev, Harald Oberhauser","https://jmlr.org//papers/volume23/20-1466/20-1466.pdf","","The sequence of moments of a vector-valued random variable can characterize its law. We study the analogous problem for path-valued random variables, that is stochastic processes, by using so-called robust signature moments. This allows us to derive a metric of maximum mean discrepancy type for laws of stochastic processes and study the topology it induces on the space of laws of stochastic processes. This metric can be kernelized using the signature kernel which allows to efficiently compute it. As an application, we provide a non-parametric two-sample hypothesis test for laws of stochastic processes."
"20219","Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms","Ping Ma, Yongkai Chen, Xinlian Zhang, Xin Xing, Jingyi Ma, Michael W. Mahoney","https://jmlr.org//papers/volume23/20-219/20-219.pdf","","The statistical analysis of Randomized Numerical Linear Algebra (RandNLA) algorithms within the past few years has mostly focused on their performance as point estimators. However, this is insufficient for conducting statistical inference, e.g., constructing confidence intervals and hypothesis testing, since the distribution of the estimator is lacking. In this article, we develop an asymptotic analysis to derive the distribution of RandNLA sampling estimators for the least-squares problem. In particular, we derive the asymptotic distribution of a general sampling estimator with arbitrary sampling probabilities in a fixed design setting. The analysis is conducted in two complementary settings, i.e., when the objective of interest is to approximate the full sample estimator, and when it is to infer the underlying ground truth model parameters. For each setting, we show that the sampling estimator is asymptotically normally distributed under mild regularity conditions. Moreover, the sampling estimator is asymptotically unbiased in both settings. Based on our asymptotic analysis, we use two criteria, the Asymptotic Mean Squared Error (AMSE) and the Expected Asymptotic Mean Squared Error (EAMSE), to identify optimal sampling probabilities. Several of these optimal sampling probability distributions are new to the literature, e.g., the root leverage sampling estimator and the predictor length sampling estimator. Our theoretical results clarify the role of leverage in the sampling process, and our empirical results demonstrate improvements over existing methods."
"20664","Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning over a Finite-Time Horizon","Matteo Basei, Xin Guo, Anran Hu, Yufei Zhang","https://jmlr.org//papers/volume23/20-664/20-664.pdf","","We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both  the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of magnitude $\mathcal{O}((\ln M)(\ln\ln M) )$, with $M$ being the number of learning episodes. The analysis consists of two components:  perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares  algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm."
"20717","KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints","Aurélien Garivier, Hédi Hadiji, Pierre Ménard, Gilles Stoltz","https://jmlr.org//papers/volume23/20-717/20-717.pdf","","We consider $K$-armed stochastic bandits and consider cumulative regret bounds up to time $T$. We are interested in strategies achieving simultaneously a distribution-free regret bound of optimal order $\sqrt{KT}$ and a distribution-dependent regret that is asymptotically optimal, that is, matching the $\kappa \ln T$ lower bound by Lai and Robbins (1985) and Burnetas and Katehakis (1996), where $\kappa$ is the optimal problem-dependent constant. This constant $\kappa$ depends on the model $\mathcal{D}$ considered (the family of possible distributions over the arms). Ménard and Garivier (2017) provided strategies achieving such a bi-optimality in the parametric case of models given by one-dimensional exponential families, while Lattimore (2016, 2018) did so for the family of (sub)Gaussian distributions with variance less than $1$. We extend this result to the non-parametric case of all distributions over $[0,1]$. We do so by combining the MOSS strategy by Audibert and Bubeck (2009), which enjoys a distribution-free regret bound of optimal order $\sqrt{KT}$, and the KL-UCB strategy by Cappé et al. (2013), for which we provide in passing the first analysis of an optimal distribution-dependent $\kappa\ln T$ regret bound in the model of all distributions over $[0,1]$. We were able to obtain this non-parametric bi-optimality result while working hard to streamline the proofs (of previously known regret bounds and thus of the new analyses carried out); a second merit of the present contribution is therefore to provide a review of proofs of classical regret bounds for index-based strategies for $K$-armed stochastic bandits."
"20748","Matrix Completion with Covariate Information and  Informative Missingness","Huaqing Jin, Yanyuan Ma, Fei Jiang","https://jmlr.org//papers/volume23/20-748/20-748.pdf","https://github.com/JINhuaqing/MNAR","We study the problem of matrix completion when the missingness of the matrix entries is dependent on the unobserved response values themselves and hence the missingness itself is informative. Furthermore, we allow to take into account the covariate information to establish its relation with the response and hence enable prediction. We devise a novel procedure to simultaneously complete the partially observed matrix and assess the covariate effect. Allowing the matrix dimensions as well as the number of covariates to grow ultra-high, under the classic low-rank matrix and sparse covariate effect assumptions, we rigorously establish the statistical guarantee of our procedure and the algorithmic convergence. The method is demonstrated via simulation studies and is used to analyze a Yelp data set and a MovieLens data set."
"20830","Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent","David Holzmüller, Ingo Steinwart","https://jmlr.org//papers/volume23/20-830/20-830.pdf","https://github.com/dholzmueller/nn_inconsistency","We prove that two-layer (Leaky)ReLU networks initialized by e.g. the widely used method proposed by He et al. (2015) and trained using gradient descent on a least-squares loss are not universally consistent. Specifically, we describe a large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape, since it is unable to move the biases far away from their initialization at zero. It turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear. We further provide numerical evidence that this happens in practical situations, for some multi-dimensional distributions and that stochastic gradient descent exhibits similar behavior. We also provide empirical results on how the choice of initialization and optimizer can influence this behavior."
"20964","Extensions to the Proximal Distance Method of Constrained Optimization","Alfonso Landeros, Oscar Hernan Madrid Padilla, Hua Zhou, Kenneth Lange","https://jmlr.org//papers/volume23/20-964/20-964.pdf","https://github.com/alanderos91/ProximalDistanceAlgorithms.jl","The current paper studies the problem of minimizing a loss $f(\boldsymbol{x})$ subject to constraints of the form $\boldsymbol{D}\boldsymbol{x} \in S$, where $S$ is a closed set, convex or not, and $\boldsymbol{D}$ is a matrix that fuses parameters. Fusion constraints can capture smoothness, sparsity, or more general constraint patterns. To tackle this generic class of problems, we combine the Beltrami-Courant penalty method of optimization with the proximal distance principle. The latter is driven by minimization of penalized objectives $f(\boldsymbol{x})+\frac{\rho}{2}\text{dist}(\boldsymbol{D}\boldsymbol{x},S)^2$ involving large tuning constants $\rho$ and the squared Euclidean distance of $\boldsymbol{D}\boldsymbol{x}$ from $S$. The next iterate $\boldsymbol{x}_{n+1}$ of the corresponding proximal distance algorithm is constructed from the current iterate $\boldsymbol{x}_n$ by minimizing the majorizing surrogate function $f(\boldsymbol{x})+\frac{\rho}{2}\|\boldsymbol{D}\boldsymbol{x}-\mathcal{P}_{S}(\boldsymbol{D}\boldsymbol{x}_n)\|^2$. For fixed $\rho$ and a subanalytic loss $f(\boldsymbol{x})$ and a subanalytic constraint set $S$, we prove convergence to a stationary point. Under stronger assumptions, we provide convergence rates and demonstrate linear local convergence.  We also construct a steepest descent variant to avoid costly linear system solves. To benchmark our algorithms, we compare their results to those delivered by the alternating direction method of multipliers. Our extensive numerical tests include problems on metric projection, convex regression, convex clustering, total variation image denoising, and projection of a matrix to a good condition number. These experiments demonstrate the superior speed and acceptable accuracy of our steepest variant on high-dimensional problems."
"210078","Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution","Yichen Zhou, Giles Hooker","https://jmlr.org//papers/volume23/21-0078/21-0078.pdf","https://github.com/siriuz42/boulevard","This paper examines a novel gradient boosting framework for regression. We regularize gradient boosted trees by introducing subsampling and employ a modified shrinkage algorithm so that at every boosting stage the estimate is given by an average of trees. The resulting algorithm, titled ""Boulevard'"", is shown to converge as the number of trees grows. This construction allows us to demonstrate a central limit theorem for this limit, providing a characterization of uncertainty for predictions. A simulation study and real world examples provide support for both the predictive accuracy of the model and its limiting behavior."
"210190","Statistical Optimality and Stability of Tangent Transform Algorithms in Logit Models","Indrajit Ghosh, Anirban Bhattacharya, Debdeep Pati","https://jmlr.org//papers/volume23/21-0190/21-0190.pdf","","A systematic approach to finding variational approximation in an otherwise intractable non-conjugate model is to exploit the general principle of convex duality by minorizing the marginal likelihood that renders the problem tractable. While such approaches are popular in the context of variational inference in non-conjugate Bayesian models, theoretical guarantees on statistical optimality and algorithmic convergence are lacking. Focusing on logistic regression models, we provide mild conditions on the data generating process to derive non-asymptotic upper bounds to the risk incurred by the variational optima. We demonstrate that these assumptions can be completely relaxed if one considers a slight variation of the algorithm by raising the likelihood to a fractional power. Next, we utilize the theory of dynamical systems to provide convergence guarantees for such algorithms in logistic and multinomial logit regression. In particular, we establish local asymptotic stability of the algorithm without any assumptions on the data-generating process. We explore a special case involving a semi-orthogonal design under which a global convergence is obtained. The theory is further illustrated using several numerical studies."
"210211","A Primer for Neural Arithmetic Logic Modules","Bhumika Mistry, Katayoun Farrahi, Jonathon Hare","https://jmlr.org//papers/volume23/21-0211/21-0211.pdf","https://github.com/bmistry4/nalm-benchmark","Neural Arithmetic Logic Modules have become a growing area of interest, though remain a niche field. These modules are neural networks which aim to achieve systematic generalisation in learning arithmetic and/or logic operations such as $\{+, -, \times, \div, \leq, \textrm{AND}\}$ while also being interpretable. This paper is the first in discussing the current state of progress of this field, explaining key works, starting with the Neural Arithmetic Logic Unit (NALU). Focusing on the shortcomings of the NALU, we provide an in-depth analysis to reason about design choices of recent modules. A cross-comparison between modules is made on experiment setups and findings, where we highlight inconsistencies in a fundamental experiment causing the inability to directly compare across papers. To alleviate the existing inconsistencies, we create a benchmark which compares all existing arithmetic NALMs. We finish by providing a novel discussion of existing applications for NALU and research directions requiring further exploration."
"210218","Estimating Density Models with Truncation Boundaries using Score Matching","Song Liu, Takafumi Kanamori, Daniel J. Williams","https://jmlr.org//papers/volume23/21-0218/21-0218.pdf","https://github.com/anewgithubname/Truncated-Score-Matching","Truncated densities are probability density functions defined on truncated domains. They share the same parametric form with their non-truncated counterparts up to a normalizing constant. Since the computation of their normalizing constants is usually infeasible, Maximum Likelihood Estimation cannot be easily applied to estimate truncated density models. Score Matching (SM) is a powerful tool for fitting parameters using only unnormalized models. However, it cannot be directly applied here as boundary conditions that derive a tractable SM objective are not satisfied by truncated densities. This paper studies parameter estimation for truncated probability densities using SM. The estimator minimizes a weighted Fisher divergence. The weight function is simply the shortest distance from a data point to the domain's boundary. We show this choice of weight function naturally arises from minimizing the Stein discrepancy and upper bounding the finite-sample estimation error. We demonstrate the usefulness of our method via numerical experiments and a study on the Chicago crime data set. We also show that the proposed density estimation can correct the outlier-trimming bias caused by aggressive outlier detection methods."
"210222","Adversarial Classification: Necessary Conditions and Geometric Flows","Nicolás García Trillos, Ryan Murray","https://jmlr.org//papers/volume23/21-0222/21-0222.pdf","","We study a version of adversarial classification where an adversary is empowered to corrupt data inputs up to some distance $\varepsilon$, using tools from variational analysis. In particular, we describe necessary conditions associated with the optimal classifier subject to such an adversary. Using the necessary conditions, we derive a geometric evolution equation which can be used to track the change in classification boundaries as $\varepsilon$ varies. This evolution equation may be described as an uncoupled system of differential equations in one dimension, or as a mean curvature type equation in higher dimension. In one dimension, and under mild assumptions on the data distribution, we rigorously prove that one can use the initial value problem starting from $\varepsilon=0$, which is simply the Bayes classifier, in order to solve for the global minimizer of the adversarial problem for small values of $\varepsilon$. In higher dimensions we provide a similar result, albeit conditional to the existence of regular solutions of the initial value problem. In the process of proving our main results we obtain a result of independent interest connecting the original adversarial problem with an optimal transport problem under no assumptions on whether classes are balanced or not. Numerical examples illustrating these ideas are also presented."
"210296","Active Structure Learning of Bayesian Networks in an Observational Setting","Noa Ben-David, Sivan Sabato","https://jmlr.org//papers/volume23/21-0296/21-0296.pdf","https://github.com/noabdavid/activeBNSL","We study active structure learning of Bayesian networks in an observational setting, in which there are external limitations on the number of variable values that can be observed from the same sample. Random samples are drawn from the joint distribution of the network variables, and the algorithm iteratively selects which variables to observe in the next   sample. We propose a new active learning algorithm for this setting, that finds with a high probability a structure with a score that is $\epsilon$-close to the optimal score. We show that for a class of distributions that we term stable, a sample complexity reduction of up to a factor of $\widetilde{\Omega}(d^3)$ can be obtained, where $d$ is the number of network variables. We further show that in the worst case, the sample complexity of the active algorithm is guaranteed to be almost the same as that of a naive baseline algorithm. To supplement the theoretical results, we report experiments that compare the performance of the new active algorithm to the naive baseline and demonstrate the sample complexity improvements. Code for the algorithm and for the experiments is provided at https://github.com/noabdavid/activeBNSL."
"210308","Learning to Optimize: A Primer and A Benchmark","Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, Wotao Yin","https://jmlr.org//papers/volume23/21-0308/21-0308.pdf","https://github.com/VITA-Group/Open-L2O","Learning to optimize (L2O) is an emerging approach that leverages machine learning to develop optimization methods, aiming at reducing the laborious iterations of hand engineering. It automates the design of an optimization method based on its performance on a set of training problems. This data-driven procedure generates methods that can efficiently solve problems similar to those in training. In sharp contrast, the typical and traditional designs of optimization methods are theory-driven, so they obtain performance guarantees over the classes of problems specified by the theory. The difference makes L2O suitable for repeatedly solving a particular optimization problem over a specific distribution of data, while it typically fails on out-of-distribution problems. The practicality of L2O depends on the type of target optimization, the chosen architecture of the method to learn, and the training procedure. This new paradigm has motivated a community of researchers to explore L2O and report their findings. This article is poised to be the first comprehensive survey and benchmark of L2O for continuous optimization. We set up taxonomies, categorize existing works and research directions, present insights, and identify open challenges. We benchmarked many existing L2O approaches on a few representative optimization problems. For reproducible research and fair benchmarking purposes, we released our software implementation and data in the package Open-L2O at https://github.com/VITA-Group/Open-L2O."
"210402","Clustering with Semidefinite Programming and Fixed Point Iteration","Pedro Felzenszwalb, Caroline Klivans, Alice Paul","https://jmlr.org//papers/volume23/21-0402/21-0402.pdf","","We introduce a novel method for clustering using a semidefinite programming (SDP) relaxation of the Max k-Cut problem.  The approach is based on a new methodology for rounding the solution of an SDP relaxation using iterated linear optimization.  We show the vertices of the Max k-Cut relaxation correspond to partitions of the data into at most k sets.  We also show the vertices are attractive fixed points of iterated linear optimization.  Each step of this iterative process solves a relaxation of the closest vertex problem and leads to a new clustering problem where the underlying clusters are more clearly defined.  Our experiments show that using fixed point iteration for rounding the Max k-Cut SDP relaxation leads to significantly better results when compared to randomized rounding."
"210431","Deep Limits and a Cut-Off Phenomenon for Neural Networks","Benny Avelin, Anders Karlsson","https://jmlr.org//papers/volume23/21-0431/21-0431.pdf","","We consider dynamical and geometrical aspects of deep learning. For many standard choices of layer maps we display semi-invariant metrics which quantify differences between data or decision functions. This allows us, when considering random layer maps and using non-commutative ergodic theorems, to deduce that certain limits exist when letting the number of layers tend to infinity. We also examine the random initialization of standard networks where we observe a surprising cut-off phenomenon in terms of the number of layers, the depth of the network. This could be a relevant parameter when choosing an appropriate number of layers for a given learning task, or for selecting a good initialization procedure. More generally, we hope that the notions and results in this paper can provide a framework, in particular a geometric one, for a part of the theoretical understanding of deep neural networks."
"210545","A Bregman Learning Framework for Sparse Neural Networks","Leon Bungert, Tim Roith, Daniel Tenbrinck, Martin Burger","https://jmlr.org//papers/volume23/21-0545/21-0545.pdf","https://github.com/TimRoith/BregmanLearning","We propose a learning framework based on stochastic Bregman iterations, also known as mirror descent, to train sparse neural networks with an inverse scale space approach. We derive a baseline algorithm called LinBreg, an accelerated version using momentum, and AdaBreg, which is a Bregmanized generalization of the Adam algorithm. In contrast to established methods for sparse training the proposed family of algorithms constitutes a regrowth strategy for neural networks that is solely optimization-based without additional heuristics.  Our Bregman learning framework starts the training with very few initial parameters, successively adding only significant ones to obtain a sparse and expressive network. The proposed approach is extremely easy and efficient, yet supported by the rich mathematical theory of inverse scale space methods. We derive a statistically profound sparse parameter initialization strategy and provide a rigorous stochastic convergence analysis of the loss decay and additional convergence proofs in the convex regime. Using only $3.4\%$ of the parameters of ResNet-18 we achieve $90.2\%$ test accuracy on CIFAR-10, compared to $93.6\%$ using the dense network. Our algorithm also unveils an autoencoder architecture for a denoising task. The proposed framework also has a huge potential for integrating sparse backpropagation and resource-friendly training. Code is available at https://github.com/TimRoith/BregmanLearning."
"210570","Gaussian process regression: Optimality, robustness, and relationship with kernel ridge regression","Wenjia Wang, Bing-Yi Jing","https://jmlr.org//papers/volume23/21-0570/21-0570.pdf","","Gaussian process regression is widely used in many fields, for example, machine learning, reinforcement learning and uncertainty quantification. One key component of Gaussian process regression is the unknown correlation function, which needs to be specified. In this paper, we investigate what would happen if the correlation function is misspecified. We derive upper and lower error bounds for Gaussian process regression with possibly misspecified correlation functions.  We find that when the sampling scheme is quasi-uniform, the optimal convergence rate can be attained even if the smoothness of the imposed correlation function exceeds that of the true correlation function. We also obtain convergence rates of kernel ridge regression with misspecified kernel function, where the underlying truth is a deterministic function. Our study reveals a close connection between the convergence rates of Gaussian process regression and kernel ridge regression, which is aligned with the relationship between sample paths of Gaussian process and the corresponding reproducing kernel Hilbert space. This work establishes a bridge between Bayesian learning based on Gaussian process and frequentist kernel methods with reproducing kernel Hilbert space."
"210577","Uniform deconvolution for Poisson Point Processes","Anna Bonnet, Claire Lacour, Franck Picard, Vincent Rivoirard","https://jmlr.org//papers/volume23/21-0577/21-0577.pdf","","We focus on the estimation of the intensity of a Poisson process in the presence of a uniform noise. We propose a kernel-based procedure fully calibrated in theory and practice. We show that our adaptive estimator is optimal from the oracle and minimax points of view, and provide new lower bounds when the intensity belongs to a Sobolev ball.  By developing the Goldenshluger-Lepski methodology in the case of deconvolution for Poisson processes, we propose an optimal data-driven selection of the kernel bandwidth. Our method is illustrated on the spatial distribution of replication origins and sequence motifs along the human genome."
"210711","Distributed Bootstrap for Simultaneous Inference Under High Dimensionality","Yang Yu, Shih-Kang Chao, Guang Cheng","https://jmlr.org//papers/volume23/21-0711/21-0711.pdf","https://github.com/skchao74/Distributed-bootstrap","We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available in Supplementary Material."
"210716","Universal Approximation Theorems for Differentiable Geometric Deep Learning","Anastasis Kratsios, Léonie Papon","https://jmlr.org//papers/volume23/21-0716/21-0716.pdf","","This paper addresses the growing need to process non-Euclidean data, by introducing a geometric deep learning (GDL) framework for building universal feedforward-type models compatible with differentiable manifold geometries. We show that our GDL models can approximate any continuous target function uniformly on compact sets of a controlled maximum diameter. We obtain curvature-dependent lower-bounds on this maximum diameter and upper-bounds on the depth of our approximating GDL models. Conversely, we find that there is always a continuous function between any two non-degenerate compact manifolds that any ""locally-defined"" GDL model cannot uniformly approximate. Our last main result identifies data-dependent conditions guaranteeing that the GDL model implementing our approximation breaks ""the curse of dimensionality."" We find that any ""real-world"" (i.e. finite) dataset always satisfies our condition and, conversely, any dataset satisfies our requirement if the target function is smooth. As applications, we confirm the universal approximation capabilities of the following GDL models: Ganea et al. (2018)'s hyperbolic feedforward networks, the architecture implementing Krishnan et al. (2015)'s deep Kalman-Filter, and deep softmax classifiers. We build universal extensions/variants of: the SPD-matrix regressor of Meyer et al. (2011), and Fletcher (2003)'s Procrustean regressor. In the Euclidean setting, our results imply a quantitative version of Kidger and Lyons (2020)'s approximation theorem and a data-dependent version of Yarotsky and Zhevnerchuk (2019)'s uncursed approximation rates."
"210738","InterpretDL: Explaining Deep Models in PaddlePaddle","Xuhong Li, Haoyi Xiong, Xingjian Li, Xuanyu Wu, Zeyu Chen, Dejing Dou","https://jmlr.org//papers/volume23/21-0738/21-0738.pdf","https://github.com/PaddlePaddle/InterpretDL","Techniques to explain the predictions of deep neural networks (DNNs) have been largely required for gaining insights into the black boxes. We introduce InterpretDL, a toolkit of explanation algorithms based on PaddlePaddle, with uniformed programming interfaces and ""plug-and-play"" designs. A few lines of codes are needed to obtain the explanation results without modifying the structure of the model. InterpretDL currently contains 16 algorithms, explaining training phases, datasets, global and local behaviors of post-trained deep models. InterpretDL also provides a number of tutorial examples and showcases to demonstrate the capability of InterpretDL working on a wide range of deep learning models, e.g., Convolutional Neural Networks (CNNs), Multi-Layer Preceptors (MLPs), Transformers, etc., for various tasks in both Computer Vision (CV) and Natural Language Processing (NLP). Furthermore, InterpretDL modularizes the implementations, making efforts to support the compatibility across frameworks. The project is available at https://github.com/PaddlePaddle/InterpretDL."
"210739","Meta-analysis of heterogeneous data: integrative sparse regression in high-dimensions","Subha Maity, Yuekai Sun, Moulinath Banerjee","https://jmlr.org//papers/volume23/21-0739/21-0739.pdf","https://github.com/smaityumich/MrLasso","We consider the task of meta-analysis in high-dimensional settings in which the data sources  are similar but non-identical. To borrow strength across such heterogeneous datasets, we introduce a global parameter that emphasizes interpretability and statistical efficiency in the presence of heterogeneity. We also propose a one-shot estimator of the global parameter that preserves the anonymity of the data sources and converges at a rate that depends on the size of the combined dataset.   For high-dimensional linear model settings, we demonstrate the superiority of our  identification restrictions in adapting to a previously seen data distribution as well as predicting for a new/unseen data distribution. Finally, we demonstrate the benefits of our approach on a large-scale drug treatment dataset involving several different cancer cell-lines."
"210755","A Forward Approach for Sufficient Dimension Reduction in Binary Classification","Jongkyeong Kang, Seung Jun Shin","https://jmlr.org//papers/volume23/21-0755/21-0755.pdf","","Since the proposal of the seminal sliced inverse regression (SIR), inverse-type methods have proved to be canonical in sufficient dimension reduction (SDR). However, they often underperform in binary classification because the binary responses yield two slices at most. In this article, we develop a forward SDR approach in binary classification based on weighted large-margin classifiers. First, we show that the gradient of a large-margin classifier is unbiased for SDR as long as the corresponding loss function is Fisher consistent. This leads us to propose the weighted outer-product of gradients (wOPG) estimator. The wOPG estimator can recover the central subspace exhaustively without linearity (or constant variance) conditions, which despite being routinely required, they are untestable assumption. We propose the gradient-based formulation for the large-margin classifier to estimate the gradient function of the classifier directly. We also establish the consistency of the proposed wOPG estimator and demonstrate its promising finite-sample performance through both simulated and real data examples."
"210795","A Nonconvex Framework for Structured Dynamic Covariance Recovery","Katherine Tsai, Mladen Kolar, Oluwasanmi Koyejo","https://jmlr.org//papers/volume23/21-0795/21-0795.pdf","https://github.com/koyejo-lab/dynamicCov.git","We propose a flexible, yet interpretable model for high-dimensional data with time-varying second-order statistics, motivated and applied to functional neuroimaging data. Our approach implements the neuroscientific hypothesis of discrete cognitive processes by factorizing covariances into sparse spatial and smooth temporal components. Although this factorization results in parsimony and domain interpretability, the resulting estimation problem is nonconvex. We design a two-stage optimization scheme with a tailored spectral initialization, combined with iteratively refined alternating projected gradient descent. We  prove a linear convergence rate up to a nontrivial statistical error for the proposed descent scheme and establish sample complexity guarantees for the estimator. Empirical results using simulated data and brain imaging data illustrate that our approach outperforms existing baselines."
"210819","Three rates of convergence or separation via U-statistics in a dependent framework","Quentin Duchemin, Yohann De Castro, Claire Lacour","https://jmlr.org//papers/volume23/21-0819/21-0819.pdf","https://github.com/quentin-duchemin/goodness-of-fit-MC","Despite the ubiquity of U-statistics in modern Probability and Statistics, their non-asymptotic analysis in a dependent framework may have been overlooked. In a recent work, a new concentration inequality for U-statistics of order two for uniformly ergodic discrete time Markov chains has been proved. In this paper, we put this theoretical breakthrough into action by pushing further the current state of knowledge in three different active fields of research. First, we establish a new exponential inequality for the estimation of spectra of integral operators with MCMC methods. The novelty is that this result holds for kernels with positive and negative eigenvalues, which is new as far as we know. In addition, we investigate generalization performance of online algorithms working with pairwise loss functions and Markov chain samples. We provide an online-to-batch conversion result by showing how we can extract a low risk hypothesis from the sequence of hypotheses generated by any online learner. We finally give a non-asymptotic analysis of a goodness-of-fit test on the density of the stationary measure of a Markov chain. We identify some classes of alternatives over which our test based on the $L^2$ distance has a prescribed power."
"211060","abess: A Fast Best-Subset Selection Library in Python and R","Jin Zhu, Xueqin Wang, Liyuan Hu, Junhao Huang, Kangkang Jiang, Yanhang Zhang, Shiyun Lin, Junxian Zhu","https://jmlr.org//papers/volume23/21-1060/21-1060.pdf","https://github.com/abess-team/abess","We introduce a new library named abess that implements a unified framework of best-subset selection for solving diverse machine learning problems, e.g., linear regression, classification, and principal component analysis. Particularly, abess certifiably gets the optimal solution within polynomial time with high probability under the linear model. Our efficient implementation allows abess to attain the solution of best-subset selection problems as fast as or even 20x faster than existing competing variable (model) selection toolboxes. Furthermore, it supports common variants like best subset of groups selection and $\ell_2$ regularized best-subset selection. The core of the library is programmed in C++. For ease of use, a Python library is designed for convenient integration with scikit-learn, and it can be installed from the Python Package Index (PyPI). In addition, a user-friendly R library is available at the Comprehensive R Archive Network (CRAN). The source code is available at: https://github.com/abess-team/abess."
"211065","Testing Whether a Learning Procedure is Calibrated","Jon Cockayne, Matthew M. Graham, Chris J. Oates, T. J. Sullivan, Onur Teymur","https://jmlr.org//papers/volume23/21-1065/21-1065.pdf","","A learning procedure takes as input a dataset and performs inference for the parameters $\theta$ of a model that is assumed to have given rise to the dataset. Here we consider learning procedures whose output is a probability distribution, representing uncertainty about $\theta$ after seeing the dataset. Bayesian inference is a prime example of such a procedure, but one can also construct other learning procedures that return distributional output. This paper studies conditions for a learning procedure to be considered calibrated, in the sense that the true data-generating parameters are plausible as samples from its distributional output. A learning procedure whose inferences and predictions are systematically over- or under-confident will fail to be calibrated. On the other hand, a learning procedure that is calibrated need not be statistically efficient. A hypothesis-testing framework is developed in order to assess, using simulation, whether a learning procedure is calibrated. Several vignettes are presented to illustrate different aspects of the framework."
"211120","Selective Machine Learning of the Average Treatment Effect with an Invalid Instrumental Variable","Baoluo Sun, Yifan Cui, Eric Tchetgen Tchetgen","https://jmlr.org//papers/volume23/21-1120/21-1120.pdf","","Instrumental variable  methods have been widely used to identify causal effects in the presence of unmeasured confounding. A key identification condition known as the exclusion restriction states that the instrument cannot have a direct effect on the outcome which is not mediated by the exposure in view. In the health and social sciences, such an assumption is often not credible. To address this concern, we consider identification conditions of the population average treatment effect with an invalid instrumental variable which does not satisfy  the exclusion restriction, and derive the efficient influence function targeting the identifying functional under a nonparametric observed data model. We propose a novel multiply robust locally efficient estimator of the average treatment effect that is consistent in the union of multiple parametric nuisance models, as well as a multiply debiased machine learning estimator for which the nuisance parameters are estimated using generic machine learning methods, that effectively exploit various forms of linear or nonlinear structured sparsity in the nuisance parameter space. When one cannot be confident that any of these machine learners is consistent at sufficiently fast rates to ensure $\surd{n}$-consistency for the average treatment effect, we introduce new criteria for selective machine learning which leverage the multiple robustness property in order to ensure small bias. The proposed methods are illustrated through extensive simulations  and a data analysis evaluating the causal effect of 401(k) participation on savings."
"211128","Contraction rates for sparse variational approximations in Gaussian process regression","Dennis Nieman, Botond Szabo, Harry van Zanten","https://jmlr.org//papers/volume23/21-1128/21-1128.pdf","","We study the theoretical properties of a variational Bayes method in the Gaussian Process regression model. We consider the inducing variables method and derive sufficient conditions for obtaining contraction rates for the corresponding variational Bayes (VB) posterior. As examples we show that for three particular covariance kernels (Matérn, squared exponential, random series prior) the VB approach can achieve optimal, minimax contraction rates for a sufficiently large number of appropriately chosen inducing variables. The theoretical findings are demonstrated by numerical experiments."
"211146","Stochastic DCA with Variance Reduction and Applications in Machine Learning","Hoai An  Le Thi, Hoang Phuc Hau Luu, Hoai Minh Le, Tao Pham Dinh","https://jmlr.org//papers/volume23/21-1146/21-1146.pdf","","We design stochastic Difference-of-Convex-functions Algorithms (DCA) for solving a class of structured  Difference-of-Convex-functions (DC) problems. As the standard DCA requires the full information of (sub)gradients which could be expensive in large-scale settings, stochastic approaches rely upon stochastic information instead. However, stochastic estimations generate additional variance terms making stochastic algorithms unstable. Therefore, we integrate some novel variance reduction techniques including SVRG and SAGA into our design. The almost sure convergence to critical points of the proposed algorithms is established and the algorithms' complexities are analyzed. To study the efficiency of our algorithms, we apply them to three important problems in machine learning: nonnegative principal component analysis, group variable selection in multiclass logistic regression, and sparse linear regression. Numerical experiments have shown the merits of our proposed algorithms in comparison with other state-of-the-art stochastic methods for solving nonconvex large-sum problems."
"211148","Nonconvex Matrix Completion with Linearly Parameterized Factors","Ji Chen, Xiaodong Li, Zongming Ma","https://jmlr.org//papers/volume23/21-1148/21-1148.pdf","","Techniques of matrix completion aim to impute a large portion of missing entries in a data matrix through a small portion of observed ones. In practice, prior information and special structures are usually employed in order to improve the accuracy of matrix completion. In this paper, we propose a unified nonconvex optimization framework for matrix completion with linearly parameterized factors. In particular, by introducing a condition referred to as Correlated Parametric Factorization, we conduct a unified geometric analysis for the nonconvex objective by establishing uniform upper bounds for low-rank estimation resulting from any local minimizer. Perhaps surprisingly, the condition of Correlated Parametric Factorization holds for important examples including subspace-constrained matrix completion and skew-symmetric matrix completion. The effectiveness of our unified nonconvex optimization method is also empirically illustrated by extensive numerical simulations."
"211197","tntorch: Tensor Network Learning with PyTorch","Mikhail Usvyatsov, Rafael Ballester-Ripoll, Konrad Schindler","https://jmlr.org//papers/volume23/21-1197/21-1197.pdf","https://github.com/rballester/tntorch","We present tntorch, a tensor learning framework that supports multiple decompositions (including Candecomp/Parafac, Tucker, and Tensor Train) under a unified interface. With our library, the user can learn and handle low-rank tensors with automatic differentiation, seamless GPU support, and the convenience of PyTorch's API. Besides decomposition algorithms, tntorch implements differentiable tensor algebra, rank truncation, cross-approximation, batch processing, comprehensive tensor arithmetics, and more."
"211251","Ranking and Tuning Pre-trained Models: A New Paradigm for Exploiting Model Hubs","Kaichao You, Yong Liu, Ziyang Zhang, Jianmin Wang, Michael I. Jordan, Mingsheng Long","https://jmlr.org//papers/volume23/21-1251/21-1251.pdf","https://github.com/thuml/LogME","Model hubs with many pre-trained models (PTMs) have become a cornerstone of deep learning. Although built at a high cost, they remain under-exploited---practitioners usually pick one PTM from the provided model hub by popularity and then fine-tune the PTM to solve the target task. This na\""ive but common practice poses two obstacles to full exploitation of pre-trained model hubs: first, the PTM selection by popularity has no optimality guarantee, and second, only one PTM is used while the remaining PTMs are ignored. An alternative might be to consider all possible combinations of PTMs and extensively fine-tune each combination, but this would not only be prohibitive computationally but may also lead to statistical over-fitting. In this paper, we propose a new paradigm for exploiting model hubs that is intermediate between these extremes.  The paradigm is characterized by two aspects: (1) We use an evidence maximization procedure to estimate the maximum value of label evidence given features extracted by pre-trained models.  This procedure can rank all the PTMs in a model hub for various types of PTMs and tasks before fine-tuning. (2) The best ranked PTM can either be fine-tuned and deployed if we have no preference for the model's architecture or the target PTM can be tuned by the top $K$ ranked PTMs via a Bayesian procedure that we propose. This procedure, which we refer to as B-Tuning, not only improves upon specialized methods designed for tuning homogeneous PTMs, but also applies to the challenging problem of tuning heterogeneous PTMs where it yields a new level of benchmark performance."
"211262","A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review","Michael Pearce, Elena A. Erosheva","https://jmlr.org//papers/volume23/21-1262/21-1262.pdf","","Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows $\phi$ ranking model with Binomial score models through shared parameters that quantify object quality, a consensus ranking, and the level of consensus among judges. We propose an efficient tree-search algorithm to calculate the exact MLE of model parameters, study statistical properties of the model both analytically and through simulation, and apply our model to real data from an instance of grant panel review that collected both scores and partial rankings. Furthermore, we demonstrate how model outputs can be used to rank objects with confidence. The proposed model is shown to sensibly combine information from both scores and rankings to quantify object quality and measure consensus with appropriate levels of statistical uncertainty."
"211273","Efficient Inference for Dynamic Flexible Interactions of Neural Populations","Feng Zhou, Quyu Kong, Zhijie Deng, Jichao Kan, Yixuan Zhang, Cheng Feng, Jun Zhu","https://jmlr.org//papers/volume23/21-1273/21-1273.pdf","","Hawkes process provides an effective statistical framework for analyzing the interactions of neural spiking activities. Although utilized in many real applications, the classic Hawkes process is incapable of modeling inhibitory interactions among neural population. Instead, the nonlinear Hawkes process allows for modeling a more flexible influence pattern with excitatory or inhibitory interactions. This work proposes a flexible nonlinear Hawkes process variant based on sigmoid nonlinearity. To ease inference, three sets of auxiliary latent variables (Polya-Gamma variables, latent marked Poisson processes and sparsity variables) are augmented to make functional connection weights appear in a Gaussian form, which enables simple iterative algorithms with analytical updates. As a result, the efficient Gibbs sampler, expectation-maximization algorithm and mean-field approximation are derived to estimate the interactions among neural populations. Furthermore, to reconcile with time-varying neural systems, the proposed time-invariant model is extended to a dynamic version by introducing a Markov state process. Similarly, three analytical iterative inference algorithms: Gibbs sampler, EM algorithm and mean-field approximation are derived. We compare the accuracy and efficiency of these inference algorithms on synthetic data, and further experiment on real neural recordings to demonstrate that the developed models achieve superior performance over the state-of-the-art competitors."
"21138","Multi-Agent Multi-Armed Bandits with Limited Communication","Mridul Agarwal, Vaneet Aggarwal, Kamyar Azizzadenesheli","https://jmlr.org//papers/volume23/21-138/21-138.pdf","","We consider the problem where $N$ agents collaboratively interact with an instance of a stochastic $K$ arm bandit problem for $K \gg N$. The agents aim to simultaneously minimize the cumulative regret over all the agents for a total of $T$ time steps, the number of communication rounds, and the number of bits in each communication round.  We present Limited Communication Collaboration - Upper Confidence Bound (LCC-UCB), a doubling-epoch based algorithm where each agent communicates only after the end of the epoch and shares the index of the best arm it knows. With our algorithm, LCC-UCB, each agent enjoys a regret of $\tilde{O}\left(\sqrt{({K/N}+ N)T}\right)$, communicates for $O(\log T)$ steps and broadcasts $O(\log K)$ bits in each communication step. We extend the work to sparse graphs with maximum degree $K_G$ and diameter $D$ to propose LCC-UCB-GRAPH which enjoys a regret bound of $\tilde{O}\left(D\sqrt{(K/N+ K_G)DT}\right)$. Finally, we empirically show that the LCC-UCB and the LCC-UCB-GRAPH algorithms perform well and outperform strategies that communicate through a central node."
"211413","Using Shapley Values and Variational Autoencoders to Explain Predictive Models with Dependent Mixed Features","Lars H. B. Olsen, Ingrid K. Glad, Martin Jullum, Kjersti Aas","https://jmlr.org//papers/volume23/21-1413/21-1413.pdf","https://github.com/LHBO/ShapleyValuesVAEAC","Shapley values are today extensively used as a model-agnostic explanation framework to explain complex predictive machine learning models. Shapley values have desirable theoretical properties and a sound mathematical foundation in the field of cooperative game theory. Precise Shapley value estimates for dependent data rely on accurate modeling of the dependencies between all feature combinations. In this paper, we use a variational autoencoder with arbitrary conditioning (VAEAC) to model all feature dependencies simultaneously. We demonstrate through comprehensive simulation studies that our VAEAC approach to Shapley value estimation outperforms the state-of-the-art methods for a wide range of settings for both continuous and mixed dependent features. For high-dimensional settings, our VAEAC approach with a non-uniform masking scheme significantly outperforms competing methods. Finally, we apply our VAEAC approach to estimate Shapley value explanations for the Abalone data set from the UCI Machine Learning Repository."
"211489","When is the Convergence Time of Langevin Algorithms Dimension Independent? A Composite Optimization Viewpoint","Yoav Freund, Yi-An Ma, Tong Zhang","https://jmlr.org//papers/volume23/21-1489/21-1489.pdf","","There has been a surge of works bridging MCMC sampling and optimization, with a specific focus on translating non-asymptotic convergence guarantees for optimization problems into the analysis of Langevin algorithms in MCMC sampling. A conspicuous distinction between the convergence analysis of Langevin sampling and that of optimization is that all known convergence rates for Langevin algorithms depend on the dimensionality of the problem, whereas the convergence rates for optimization are dimension-free for convex problems. Whether a dimension independent convergence rate can be achieved by the Langevin algorithm is thus a long-standing open problem. This paper provides an affirmative answer to this problem for the case of either Lipschitz or smooth convex functions with normal priors. By viewing Langevin algorithm as composite optimization, we develop a new analysis technique that leads to dimension independent convergence rates for such problems."
"211521","Learning Operators with Coupled Attention","Georgios Kissas, Jacob H. Seidman, Leonardo Ferreira Guilhoto, Victor M. Preciado, George J. Pappas, Paris Perdikaris","https://jmlr.org//papers/volume23/21-1521/21-1521.pdf","","Supervised operator learning is an emerging machine learning paradigm with applications to modeling the evolution of spatio-temporal dynamical systems and approximating general black-box relationships between functional data. We propose a novel operator learning method, LOCA (Learning Operators with Coupled Attention), motivated from the recent success of the attention mechanism. In our architecture, the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations. By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions, enabling us to approximate nonlinear operators even when the number of output function measurements in the training set is very small. Our formulation is accompanied by rigorous approximation theoretic guarantees on the universal expressiveness of the proposed model. Empirically, we evaluate the performance of LOCA on several operator learning scenarios involving systems governed by ordinary and partial differential equations, as well as a black-box climate prediction problem. Through these scenarios we demonstrate state of the art accuracy, robustness with respect to noisy input data, and a consistently small spread of errors over testing data sets, even for out-of-distribution prediction tasks."
"21493","Kernel Partial Correlation Coefficient --- a Measure of Conditional Dependence","Zhen Huang, Nabarun Deb, Bodhisattva Sen","https://jmlr.org//papers/volume23/21-493/21-493.pdf","","We propose and study a class of simple, nonparametric, yet interpretable measures of conditional dependence, which we call kernel partial correlation (KPC) coefficient, between two random variables $Y$ and $Z$ given a third variable $X$, all taking values in general topological spaces. The population KPC captures the strength of conditional dependence and it is 0 if and only if $Y$ is conditionally independent of $Z$ given $X$, and 1 if and only if $Y$ is a measurable function of $Z$ and $X$. We describe two consistent methods of estimating KPC. Our first method is based on the general framework of geometric graphs, including $K$-nearest neighbor graphs and minimum spanning trees. A sub-class of these estimators can be computed in near linear time and converges at a rate that adapts automatically to the intrinsic dimensionality of the underlying distributions. The second strategy involves direct estimation of conditional mean embeddings in the RKHS framework. Using these empirical measures we develop a fully model-free variable selection algorithm, and formally prove the consistency of the procedure under suitable sparsity assumptions. Extensive simulation and real-data examples illustrate the superior performance of our methods compared to existing procedures."
"220369","Smooth Robust Tensor Completion for Background/Foreground Separation with Missing Pixels: Novel Algorithm with Convergence Guarantee","Bo Shen, Weijun Xie, Zhenyu (James) Kong","https://jmlr.org//papers/volume23/22-0369/22-0369.pdf","https://github.com/BoShen0/Smooth-Robust-Tensor-Completion-for-Background-Foreground-Separation-with-Missing-Pixels","Robust PCA (RPCA) and its tensor extension, namely, Robust Tensor PCA (RTPCA), provide an effective framework for background/foreground separation by decomposing the data into low-rank and sparse components, which contain the background and the foreground (moving objects), respectively. However, in real-world applications, the presence of missing pixels is a very common and challenging issue due to errors in the acquisition process or manufacturer defects. RPCA and RTPCA are not able to recover the background and foreground simultaneously with missing pixels. This study aims to address the problem of background/foreground separation with missing pixels by combining video recovery and background/foreground separation into a single framework. To achieve this goal, a smooth robust tensor completion (SRTC) model is proposed to recover the data and decompose it into the static background and smooth foreground, respectively. An efficient algorithm based on tensor proximal alternating minimization (tenPAM) is implemented to solve the proposed model with a global convergence guarantee under very mild conditions. Extensive experiments on actual data demonstrate that the proposed method significantly outperforms the state-of-the-art approaches for background/foreground separation with missing pixels."
"220433","Learning Green's functions associated with time-dependent partial differential equations","Nicolas Boullé, Seick Kim, Tianyi Shi, Alex Townsend","https://jmlr.org//papers/volume23/22-0433/22-0433.pdf","","Neural operators are a popular technique in scientific machine learning to learn a mathematical model of the behavior of unknown physical systems from data. Neural operators are especially useful to learn solution operators associated with partial differential equations (PDEs) from pairs of forcing functions and solutions when numerical solvers are not available or the underlying physics is poorly understood. In this work, we attempt to provide theoretical foundations to understand the amount of training data needed to learn time-dependent PDEs. Given input-output pairs from a parabolic PDE in any spatial dimension $n\geq 1$, we derive the first theoretically rigorous scheme for learning the associated solution operator, which takes the form of a convolution with a Green's function $G$. Until now, rigorously learning Green's functions associated with time-dependent PDEs has been a major challenge in the field of scientific machine learning because $G$ may not be square-integrable when $n>1$, and time-dependent PDEs have transient dynamics. By combining the hierarchical low-rank structure of $G$ together with randomized numerical linear algebra, we construct an approximant to $G$ that achieves a relative error of $\smash{\mathcal{O}(\Gamma_\epsilon^{-1/2}\epsilon)}$ in the $L^1$-norm with high probability by using at most $\smash{\mathcal{O}(\epsilon^{-\frac{n+2}{2}}\log(1/\epsilon))}$ input-output training pairs, where $\Gamma_\epsilon$ is a measure of the quality of the training dataset for learning $G$, and $\epsilon>0$ is sufficiently small."
"19529","Structural Agnostic Modeling: Adversarial Learning of Causal Graphs","Diviyan Kalainathan, Olivier Goudet, Isabelle Guyon, David Lopez-Paz, Michèle Sebag","https://jmlr.org//papers/volume23/19-529/19-529.pdf","https://github.com/FenTechSolutions/CausalDiscoveryToolbox","A new causal discovery method, Structural Agnostic Modeling (SAM), is presented in this paper. Leveraging both conditional independencies and distributional asymmetries, SAM aims to find the underlying causal structure from observational data. The approach is based on a game between different players estimating each variable distribution conditionally to the others as a neural net, and an adversary aimed at discriminating the generated data against the original data. A learning criterion combining distribution estimation, sparsity and acyclicity constraints is used to enforce the  optimization of the graph structure and parameters through stochastic gradient descent. SAM is extensively experimentally validated on synthetic and real data."
"19939","Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks","Alireza Fallah, Mert Gürbüzbalaban, Asuman Ozdaglar, Umut Şimşekli, Lingjiong Zhu","https://jmlr.org//papers/volume23/19-939/19-939.pdf","","We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance and dependence to network effects. When gradients do not contain noise, we also prove that D-ASG can achieve acceleration, in the sense that it requires $\mathcal{O}(\sqrt{\kappa} \log(1/\varepsilon))$ gradient evaluations and $\mathcal{O}(\sqrt{\kappa} \log(1/\varepsilon))$ communications to converge to the same fixed point with the non-accelerated variant where $\kappa$ is the condition number and $\varepsilon$ is the target accuracy.  For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact convergence to the optimal solution. It achieves optimal and accelerated $\mathcal{O}(-k/\sqrt{\kappa})$ linear decay in the bias term as well as optimal $\mathcal{O}(\sigma^2/k)$ in the variance term. We illustrate through numerical experiments that our approach results in accelerated practical algorithms that are robust to gradient noise and that can outperform existing methods."
"201038","Behavior Priors for Efficient Reinforcement Learning","Dhruva Tirumala, Alexandre Galashov, Hyeonwoo Noh, Leonard Hasenclever, Razvan Pascanu, Jonathan Schwarz, Guillaume Desjardins, Wojciech Marian Czarnecki, Arun Ahuja, Yee Whye Teh, Nicolas Heess","https://jmlr.org//papers/volume23/20-1038/20-1038.pdf","","As we deploy reinforcement learning agents to solve increasingly challenging problems, methods that allow us to inject prior knowledge about the structure of the world and effective solution strategies becomes increasingly important. In this work we consider how information and architectural constraints can be combined with ideas from the probabilistic modeling literature to learn behavior priors that capture the common movement and interaction patterns that are shared across a set of related tasks or contexts. For example the day-to day behavior of humans comprises distinctive locomotion and manipulation patterns that recur across many different situations and goals. We discuss how such behavior patterns can be captured using probabilistic trajectory models and how these can be integrated effectively into reinforcement learning schemes, e.g. to facilitate multi-task and transfer learning. We then extend these ideas to latent variable models and consider a formulation to learn hierarchical priors that capture different aspects of the behavior in reusable modules. We discuss how such latent variable formulations connect to related work on hierarchical reinforcement learning (HRL) and mutual information and curiosity based objectives, thereby offering an alternative perspective on existing ideas. We demonstrate the effectiveness of our framework by applying it to a range of simulated continuous control domains, videos of which can be found at the following url: https://sites.google.com/view/behavior-priors."
"201130","Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization","Huan Li, Zhouchen Lin, Yongchun Fang","https://jmlr.org//papers/volume23/20-1130/20-1130.pdf","","We study stochastic decentralized optimization for the problem of training machine learning models with large-scale distributed data. We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O((\kappa_s+n)\log\frac{1}{\epsilon})$ stochastic gradient evaluations and $O((\kappa_b+\kappa_c)\log\frac{1}{\epsilon})$ communication rounds to reach precision $\epsilon$, which are the best complexities among the non-accelerated gradient-type methods, where $\kappa_s$ and $\kappa_b$ are the stochastic condition number and batch condition number for strongly convex and smooth problems, respectively, $\kappa_c$ is the condition number of the communication network, and $n$ is the sample size on each distributed node. The proposed VR-DIGing has a little higher communication cost of $O((\kappa_b+\kappa_c^2)\log\frac{1}{\epsilon})$. Our stochastic gradient computation complexities are the same as the ones of single-machine VR methods, such as SAG, SAGA, and SVRG, and our communication complexities keep the same as those of EXTRA and DIGing, respectively. To further speed up the convergence, we also propose the accelerated VR-EXTRA and VR-DIGing with both the optimal $O((\sqrt{n\kappa_s}+n)\log\frac{1}{\epsilon})$ stochastic gradient computation complexity and $O(\sqrt{\kappa_b\kappa_c}\log\frac{1}{\epsilon})$ communication complexity. Our stochastic gradient computation complexity is also the same as the ones of single-machine accelerated VR methods, such as Katyusha, and our communication complexity keeps the same as those of accelerated full batch decentralized methods, such as MSDA. To the best of our knowledge, our accelerated methods are the first to achieve both the optimal stochastic gradient computation complexity and communication complexity in the class of gradient-type methods."
"201179","On Acceleration for Convex Composite Minimization with Noise-Corrupted Gradients and Approximate Proximal Mapping","Qiang Zhou, Sinno Jialin Pan","https://jmlr.org//papers/volume23/20-1179/20-1179.pdf","","The accelerated proximal methods (APM) have become one of the most important optimization tools for large-scale convex composite minimization problems, due to their wide range of applications and the optimal convergence rate in first-order algorithms. However, most existing theoretical results of APM are obtained by assuming that the gradient oracle is exact and the proximal mapping must be exactly solved, which may not hold in practice. This work presents a theoretical study of APM by allowing to use inexact gradient oracle and approximate proximal mapping. Specifically, we analyze inexact APM by improving the approximate duality gap technique (ADGT) which was originally designed for convergence analysis for first-order methods with both exact gradient oracle and proximal mapping. Our approach has several advantages: 1) we provide a unified convergence analysis that allows both inexact gradient oracle and approximate proximal mapping; 2) our proof is generic that naturally recovers the convergence rates of both accelerated and non-accelerated proximal methods, on top of which the advantages and the disadvantages of acceleration can be easily derived; 3) we derive the same convergence bound as previous methods in terms of inexact gradient oracle, but a tighter convergence bound in terms of approximate proximal mapping."
"201264","Getting Better from Worse:  Augmented Bagging and A Cautionary Tale of Variable Importance","Lucas Mentch, Siyu Zhou","https://jmlr.org//papers/volume23/20-1264/20-1264.pdf","","As the size, complexity, and availability of data continues to grow, scientists are increasingly relying upon black-box learning algorithms that can often provide accurate predictions with minimal a priori model specifications.  Tools like random forests have an established track record of off-the-shelf success and even offer various strategies for analyzing the underlying relationships among variables. Here, motivated by recent insights into random forest behavior, we introduce the simple idea of augmented bagging (AugBagg), a procedure that operates in an identical fashion to classical bagging and random forests, but which operates on a larger, augmented space containing additional randomly generated noise features. Surprisingly, we demonstrate that this simple act of including extra noise variables in the model can lead to dramatic improvements in out-of-sample predictive accuracy, sometimes outperforming even an optimally tuned traditional random forest.  As a result, intuitive notions of variable importance based on improved model accuracy may be deeply flawed, as even purely random noise can routinely register as statistically significant.  Numerous demonstrations on both real and synthetic data are provided along with a proposed solution."
"201304","Two-Sample Testing on Ranked Preference Data and the Role of Modeling Assumptions","Charvi Rastogi, Sivaraman Balakrishnan, Nihar B. Shah, Aarti Singh","https://jmlr.org//papers/volume23/20-1304/20-1304.pdf","","A number of applications require two-sample testing on ranked preference data. For instance, in crowdsourcing, there is a long-standing question of whether pairwise-comparison data provided by people is distributed identically to ratings-converted-to-comparisons. Other applications include sports data analysis and peer grading. In this paper, we design twosample tests for pairwise-comparison data and ranking data. For our two-sample test for pairwise-comparison data, we establish an upper bound on the sample complexity required to correctly test whether the distributions of the two sets of samples are identical. Our test requires essentially no assumptions on the distributions. We then prove complementary lower bounds showing that our results are tight (in the minimax sense) up to constant factors. We investigate the role of modeling assumptions by proving lower bounds for a range of pairwise-comparison models (WST, MST, SST, parameter-based such as BTL and Thurstone). We also provide tests and associated sample complexity bounds for partial (or total) ranking data. Furthermore, we empirically evaluate our results via extensive simulations as well as three real-world data sets consisting of pairwise-comparisons and rankings. By applying our two-sample test on real-world pairwise-comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently."
"201335","Underspecification Presents Challenges for Credibility in Modern Machine Learning","Alexander D'Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley","https://jmlr.org//papers/volume23/20-1335/20-1335.pdf","","Machine learning (ML) systems often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification in ML pipelines as a key reason for these failures. An ML pipeline is the full procedure followed to train and validate a predictor. Such a pipeline is underspecified when it can return many distinct predictors with equivalently strong test performance. Underspecification is common in modern ML pipelines that primarily validate predictors on held-out data that follow the same distribution as the training data. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We provide evidence that underspecfication has substantive implications for practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain."
"201365","Gaussian Process Parameter Estimation Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits","Hao Chen, Lili Zheng, Raed Al Kontar, Garvesh Raskutti","https://jmlr.org//papers/volume23/20-1365/20-1365.pdf","","Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on hyperparmeter estimation for the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full log-likelihood loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ for $K$ iterations, up to a statistical error term depending on the minibatch size. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or polynomial eigendecay which is satisfied by a wide range of kernels commonly used in GPs. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs."
"201407","Asymptotic Study of Stochastic Adaptive Algorithms in Non-convex Landscape","Sébastien Gadat, Ioana Gavra","https://jmlr.org//papers/volume23/20-1407/20-1407.pdf","","This paper studies some asymptotic properties of adaptive algorithms widely used in optimization and machine learning, and among them Adagrad and Rmsprop, which are involved in most of the blackbox deep learning algorithms. Our setup is the non-convex landscape optimization point of view, we consider a one time scale parametrization and the situation where these algorithms may or may not be used with mini-batches. We adopt the point of view of stochastic algorithms and establish the almost sure convergence of these methods when using a decreasing step-size towards the set of critical points of the target function. With a mild extra assumption on the noise, we also obtain the convergence towards the set of minimizers of the function. Along our study, we also obtain a ""convergence rate” of the methods, namely a bound on the expected value of the gradient of the cost function along a finite number of iterations."
"201438","Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration","Congliang Chen, Li Shen, Fangyu Zou, Wei Liu","https://jmlr.org//papers/volume23/20-1438/20-1438.pdf","","Adam is one of the most influential adaptive stochastic algorithms for training deep neural networks, which has been pointed out to be divergent even in the simple convex setting via a few simple counterexamples.  Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been tried to promote Adam-type algorithms to converge.  In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization.  This observation, coupled with this sufficient condition, gives much deeper interpretations on the divergence of Adam. On the other hand, in practice, mini-Adam and distributed-Adam are widely used without any theoretical guarantee. We further give an analysis on how the batch size or the number of nodes in the distributed system affects the convergence of Adam, which theoretically shows that mini-batch and distributed Adam can be linearly accelerated by using a larger mini-batch size or a larger number of nodes. At last, we apply the generic Adam and mini-batch Adam with the sufficient condition for solving the counterexample and training several neural networks on various real-world datasets. Experimental results are exactly in accord with our theoretical analysis."
"201446","Multi-Task Dynamical Systems","Alex Bird, Christopher K. I. Williams, Christopher Hawthorne","https://jmlr.org//papers/volume23/20-1446/20-1446.pdf","","Time series datasets are often composed of a variety of sequences from the same domain, but from different entities, such as individuals, products, or organizations. We are interested in how time series models can be specialized to individual sequences (capturing the specific characteristics) while still retaining statistical power by sharing commonalities across the sequences. This paper describes the multi-task dynamical system (MTDS); a general methodology for extending multi-task learning (MTL) to time series models. Our approach endows dynamical systems with a set of hierarchical latent variables which can modulate all model parameters. To our knowledge, this is a novel development of MTL, and applies to time series both with and without control inputs. We apply the MTDS to motion-capture data of people walking in various styles using a multi-task recurrent neural network (RNN), and to patient drug-response data using a multi-task pharmacodynamic model."
"201469","Representation Learning for Maximization of MI, Nonlinear ICA and Nonlinear Subspaces with Robust Density Ratio Estimation","Hiroaki Sasaki, Takashi Takenouchi","https://jmlr.org//papers/volume23/20-1469/20-1469.pdf","","Unsupervised representation learning is one of the most important problems in machine learning. A recent promising approach is contrastive learning: A feature representation of data is learned by solving a pseudo classification problem where class labels are automatically generated from unlabelled data. However, it is not straightforward to understand what representation contrastive learning yields through the classification problem. In addition, most of practical methods for contrastive learning are based on the maximum likelihood estimation, which is often vulnerable to the contamination by outliers. In order to promote the understanding to contrastive learning, this paper first theoretically shows a connection to maximization of mutual information (MI). Our result indicates that density ratio estimation is necessary and sufficient for maximization of MI under some conditions. Since popular objective functions for classification can be regarded as estimating density ratios, contrastive learning related to density ratio estimation can be interpreted as maximizing MI. Next, in terms of density ratio estimation, we establish new recovery conditions for the latent source components in nonlinear independent component analysis (ICA). In contrast with existing work, the established conditions include a novel insight for the dimensionality of data, which is clearly supported by numerical experiments. Furthermore, inspired by nonlinear ICA, we propose a novel framework to estimate a nonlinear subspace for lower-dimensional latent source components, and some theoretical conditions for the subspace estimation are established with density ratio estimation. Motivated by the theoretical results, we propose a practical method through outlier-robust density ratio estimation, which can be seen as performing maximization of MI, nonlinear ICA or nonlinear subspace estimation.  Moreover, a sample-efficient nonlinear ICA method is also proposed based on a variational lower-bound of MI. Then, we theoretically investigate outlier-robustness of the proposed methods. Finally, we numerically demonstrate usefulness of the proposed methods in nonlinear ICA and through application to a downstream task for linear classification."
"20322","Gaussian Process Boosting","Fabio Sigrist","https://jmlr.org//papers/volume23/20-322/20-322.pdf","https://github.com/fabsig/GPBoost","We introduce a novel way to combine boosting with Gaussian process and mixed effects models. This allows for relaxing, first, the zero or linearity assumption for the prior mean function in Gaussian process and grouped random effects models in a flexible non-parametric way and, second, the independence assumption made in most boosting algorithms. The former is advantageous for prediction accuracy and for avoiding model misspecifications. The latter is important for efficient learning of the fixed effects predictor function and for obtaining probabilistic predictions. Our proposed algorithm is also a novel solution for handling high-cardinality categorical variables in tree-boosting. In addition, we present an extension that scales to large data using a Vecchia approximation for the Gaussian process model relying on novel results for covariance parameter inference. We obtain increased prediction accuracy compared to existing approaches on multiple simulated and real-world data sets."
"20527","An Efficient Sampling Algorithm for Non-smooth Composite Potentials","Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, Peter L. Bartlett","https://jmlr.org//papers/volume23/20-527/20-527.pdf","","We consider the problem of sampling from a density of the form $p(x) \propto \exp(-f(x)- g(x))$, where $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is a smooth function and $g: \mathbb{R}^d \rightarrow \mathbb{R}$ is a convex and Lipschitz function. We propose a new algorithm based on the Metropolis--Hastings framework. Under certain isoperimetric inequalities on the target density, we prove that the algorithm mixes to within total variation (TV) distance $\varepsilon$ of the target density in at most $O(d \log (d/\varepsilon))$ iterations. This guarantee extends previous results on sampling from distributions with smooth log densities ($g = 0$) to the more general composite non-smooth case, with the same mixing time up to a multiple of the condition number.  Our method is based on a novel proximal-based proposal distribution that can be efficiently computed for a large class of non-smooth functions $g$. Simulation results on posterior sampling problems that arise from the Bayesian Lasso show empirical advantage over previous proposal distributions."
"20643","Change point localization in dependent dynamic nonparametric random dot product graphs","Oscar Hernan Madrid Padilla, Yi Yu, Carey E. Priebe","https://jmlr.org//papers/volume23/20-643/20-643.pdf","","In this paper, we study the offline change point localization problem in a sequence of dependent nonparametric random dot product graphs.  To be specific, assume that at every time point, a network is generated from a nonparametric random dot product graph model (see e.g. Athreya et al., 2018), where the latent positions are generated from unknown underlying distributions.  The underlying distributions are piecewise constant in time and change at unknown locations, called change points.  Most importantly, we allow for dependence among networks generated between two consecutive change points.  This setting incorporates edge-dependence within networks and temporal dependence between networks, which is the most flexible setting in the published literature. To accomplish the task of consistently localizing change points, we propose a novel change point detection algorithm, consisting of two steps.  First, we estimate the latent positions of the random dot product model, our theoretical result being a refined version of the state-of-the-art results, allowing the dimension of the latent positions to diverge.  Subsequently, we construct a nonparametric version of the CUSUM statistic (e.g. Page, 1954; Padilla et al., 2019a) that allows for temporal dependence.  Consistent localization is proved theoretically and supported by extensive numerical experiments, which illustrate state-of-the-art performance.  We also provide in depth discussion of possible extensions to give more understanding and insights."
"20655","Bounding the Error of Discretized Langevin Algorithms for Non-Strongly Log-Concave Targets","Arnak S. Dalalyan, Avetik Karagulyan, Lionel Riou-Durand","https://jmlr.org//papers/volume23/20-655/20-655.pdf","","In this paper, we provide non-asymptotic upper bounds on the error of sampling from a target density over  $\mathbb{R}^p$ using three schemes of discretized Langevin diffusions. The first scheme is the Langevin Monte Carlo (LMC) algorithm, the Euler discretization of the Langevin diffusion. The second and the third schemes are, respectively, the kinetic Langevin Monte Carlo (KLMC) for differentiable potentials and the kinetic Langevin Monte Carlo for twice-differentiable potentials (KLMC2). The main focus is on the target densities that are smooth and log-concave on $\mathbb{R}^p$, but not necessarily strongly log-concave. Bounds on the computational complexity are obtained under two types of smoothness assumption: the potential has a Lipschitz-continuous gradient and the potential has a Lipschitz-continuous Hessian matrix. The error of sampling is measured by Wasserstein-$q$ distances. We advocate for the use of a new dimension-adapted scaling in the definition of the computational complexity, when Wasserstein-$q$ distances are considered. The obtained results show that the number of iterations to achieve a scaled-error smaller than a prescribed value depends only polynomially in the dimension."
"20931","KoPA: Automated Kronecker Product Approximation","Chencheng Cai, Rong Chen, Han Xiao","https://jmlr.org//papers/volume23/20-931/20-931.pdf","","We consider the problem of matrix approximation and denoising induced by the Kronecker product decomposition. Specifically, we propose to approximate a given matrix by the sum of a few Kronecker products of matrices, which we refer to as the Kronecker product approximation (KoPA). Because the Kronecker product is an extensions of the outer product from vectors to matrices, KoPA extends the low rank matrix approximation, and includes it as a special case. Comparing with the latter, KoPA also offers a greater flexibility, since it allows the user to choose the configuration, which are the dimensions of the two smaller matrices forming the Kronecker product. On the other hand, the configuration to be used is usually unknown, and needs to be determined from the data in order to achieve the optimal balance between accuracy and parsimony. We propose to use extended information criteria to select the configuration. Under the paradigm of high dimensional analysis, we show that the proposed procedure is able to select the true configuration with probability tending to one, under suitable conditions on the signal-to-noise ratio. We demonstrate the superiority of KoPA over the low rank approximations through numerical studies, and several benchmark image examples."
"20963","Nonparametric Principal Subspace Regression","Yang Zhou, Mark Koudstaal, Dengdeng Yu, Dehan Kong, Fang Yao","https://jmlr.org//papers/volume23/20-963/20-963.pdf","","In scientific applications, multivariate observations often come in tandem with temporal or spatial covariates, with which the underlying signals vary smoothly. The standard approaches such as principal component analysis and factor analysis neglect the smoothness of the data, while multivariate linear or nonparametric regression fails to leverage the correlation information among multivariate response variables. We propose a novel approach named nonparametric principal subspace regression to overcome these issues. By decoupling the model discrepancy, a simple two-step estimation procedure is introduced, which takes advantage of the low-rank approximation while keeping smooth dynamics. The theoretical property of the proposed procedure is established under an increasing-dimension framework. We demonstrate the favorable  performance of our method in comparison with its counterpart, the conventional nonparametric regression, from both theoretical and numerical perspectives."
"20965","A Wasserstein Distance Approach for Concentration of Empirical Risk Estimates","Prashanth L.A., Sanjay P. Bhat","https://jmlr.org//papers/volume23/20-965/20-965.pdf","","This paper presents a unified approach based on Wasserstein distance to derive concentration bounds for empirical estimates for two  broad classes of risk measures defined in the paper. The classes of risk measures introduced include as special cases  well known risk measures from the finance literature such as conditional value at risk (CVaR), optimized certainty equivalent risk, spectral risk measures, utility-based shortfall risk, cumulative prospect theory (CPT) value, rank dependent expected utility and distorted risk measures. Two estimation schemes are considered, one for each class of risk measures. One estimation scheme involves applying the risk measure to the empirical distribution function formed from a collection of i.i.d. samples of the random variable (r.v.), while the second scheme involves applying the same procedure to a truncated sample. The bounds provided apply to three popular  classes of distributions, namely sub-Gaussian, sub-exponential and heavy-tailed distributions. The bounds are derived by first relating the estimation error to the Wasserstein distance between the true and empirical distributions, and then using recent concentration bounds for the latter. Previous concentration bounds are available only for specific risk measures such as CVaR and CPT-value. The bounds derived in this paper are shown to either match or improve upon previous bounds in cases where they are available. The usefulness of the bounds is illustrated through an algorithm and the corresponding regret bound for a stochastic bandit problem involving a general risk measure from each of the two classes introduced in the paper."
"210028","Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization","Zhize Li, Jian Li","https://jmlr.org//papers/volume23/21-0028/21-0028.pdf","","We propose and analyze several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal stochastic gradient algorithm based on variance reduction called ProxSVRG+.  We provide a clean and tight analysis of ProxSVRG+, which shows that it outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, hence solves an open problem proposed in Reddi et al. 2016. Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG (Reddi et al. 2016) and extends to the online setting by avoiding full gradient computations. Then, we further propose an optimal algorithm, called SSRGD, based on ARAH (Nguyen et al. 2017) and show that SSRGD further improves the gradient complexity of ProxSVRG+ and achieves the the optimal upper bound, matching the known lower bound. Moreover, we show that both ProxSVRG+ and SSRGD enjoy automatic adaptation with local structure of the objective function such as the Polyak-Lojasiewicz (PL) condition for nonconvex functions in the finite-sum case, i.e., we prove that both of them can automatically switch to faster global linear convergence without any restart performed in prior work. Finally, we focus on the more challenging problem of finding an $(\epsilon, \delta)$-local minimum instead of just finding an $\epsilon$-approximate (first-order) stationary point  (which may be some bad unstable saddle points). We show that SSRGD can find an $(\epsilon, \delta)$-local minimum by simply adding some random perturbations. Our algorithm is almost as simple as its counterpart for finding stationary points, and achieves similar optimal rates."
"210053","MALTS: Matching After Learning to Stretch","Harsh Parikh, Cynthia Rudin, Alexander Volfovsky","https://jmlr.org//papers/volume23/21-0053/21-0053.pdf","https://github.com/almost-matching-exactly/MALTS","We introduce a flexible framework that produces high-quality almost-exact matches for causal inference. Most prior work in matching uses ad-hoc distance metrics, often leading to poor quality matches, particularly when there are irrelevant covariates. In this work, we learn an interpretable distance metric for matching, which leads to substantially higher quality matches. The learned distance metric stretches the covariate space according to each covariate's contribution to outcome prediction: this stretching means that mismatches on important covariates carry a larger penalty than mismatches on irrelevant covariates. Our ability to learn flexible distance metrics leads to matches that are interpretable and useful for the estimation of conditional average treatment effects."
"210080","Weakly Supervised Disentangled Generative Causal Representation Learning","Xinwei Shen, Furui Liu, Hanze Dong, Qing Lian, Zhitang Chen, Tong Zhang","https://jmlr.org//papers/volume23/21-0080/21-0080.pdf","https://github.com/xwshen51/DEAR","This paper proposes a Disentangled gEnerative cAusal Representation (DEAR) learning method under appropriate supervised information. Unlike existing disentanglement methods that enforce independence of the latent variables, we consider the general case where the underlying factors of interests can be causally related. We show that previous methods with independent priors fail to disentangle causally related factors even under supervision. Motivated by this finding, we propose a new disentangled learning method called DEAR that enables causal controllable generation and causal representation learning. The key ingredient of this new formulation is to use a structural causal model (SCM) as the prior distribution for a bidirectional generative model. The prior is then trained jointly with a generator and an encoder using a suitable GAN algorithm incorporated with supervised information on the ground-truth factors and their underlying causal structure. We provide theoretical justification on the identifiability and asymptotic convergence of the proposed method. We conduct extensive experiments on both synthesized and real data sets to demonstrate the effectiveness of DEAR in causal controllable generation, and the benefits of the learned representations for downstream tasks in terms of sample efficiency and distributional robustness."
"210102","Bayesian Covariate-Dependent Gaussian Graphical Models with Varying Structure","Yang Ni, Francesco C. Stingo, Veerabhadran Baladandayuthapani","https://jmlr.org//papers/volume23/21-0102/21-0102.pdf","","We introduce Bayesian Gaussian graphical models with covariates (GGMx), a class of  multivariate Gaussian distributions with covariate-dependent sparse precision matrix.  We propose a general construction of a functional mapping from the covariate space to the cone of sparse positive definite matrices, which encompasses many existing graphical models for heterogeneous settings. Our methodology is based on a novel mixture prior for precision matrices with a non-local component that admits attractive theoretical and empirical properties. The flexible formulation of GGMx allows both the strength and the sparsity pattern of the precision matrix (hence the graph structure) change with the covariates. Posterior inference is carried out with a carefully designed Markov chain Monte Carlo algorithm, which ensures the positive definiteness of sparse precision matrices  at any given  covariates' values.  Extensive simulations and a case study in cancer genomics demonstrate the utility of the proposed model."
"210105","Tree-based Node Aggregation in Sparse Graphical Models","Ines Wilms, Jacob Bien","https://jmlr.org//papers/volume23/21-0105/21-0105.pdf","https://github.com/ineswilms/taglasso","High-dimensional graphical models are often estimated using regularization that is aimed at reducing the number of edges in a network. In this work, we show how even simpler networks can be produced by aggregating the nodes of the graphical model.  We develop a new convex regularized method, called the  tree-aggregated graphical lasso or tag-lasso, that estimates graphical models that are both edge-sparse and node-aggregated.  The aggregation is performed in a data-driven fashion by leveraging side information in the form of a tree that encodes node similarity and facilitates the interpretation of the resulting aggregated nodes. We provide an efficient implementation of the tag-lasso by using the locally adaptive alternating direction method of multipliers and illustrate our proposal's practical advantages in simulation and in applications in finance and biology."
"210152","Mitigating the Effects of Non-Identifiability on Inference for Bayesian Neural Networks with Latent Variables","Yaniv Yacoby, Weiwei Pan, Finale Doshi-Velez","https://jmlr.org//papers/volume23/21-0152/21-0152.pdf","","Bayesian Neural Networks with Latent Variables (BNN+LVs) capture predictive uncertainty by explicitly modeling model uncertainty (via priors on network weights) and environmental stochasticity (via a latent input noise variable). In this work, we first show that BNN+LV suffers from a serious form of non-identifiability: explanatory power can be transferred between the model parameters and latent variables while fitting the data equally well. We demonstrate that as a result, in the limit of infinite data, the posterior mode over the network weights and latent variables is asymptotically biased away from the ground-truth. Due to this asymptotic bias, traditional inference methods may in practice yield parameters that generalize poorly and misestimate uncertainty. Next, we develop a novel inference procedure that explicitly mitigates the effects of likelihood non-identifiability during training and yields high-quality predictions as well as uncertainty estimates. We demonstrate that our inference method improves upon benchmark methods across a range of synthetic and real data-sets."
"210200","Mappings for Marginal Probabilities with Applications to Models in Statistical Physics","Mehdi Molkaraie","https://jmlr.org//papers/volume23/21-0200/21-0200.pdf","","We present local mappings that relate the marginal probabilities of a global probability mass function represented by its primal normal factor graph to the corresponding marginal probabilities in its dual normal factor  graph. The mapping is based on the Fourier transform of the local factors of the models. Details of the mapping are provided for the  Ising model, where it is proved that the local extrema of the fixed points are attained at the phase transition of the two-dimensional nearest-neighbor Ising model. The results are further extended to the Potts model, to the clock model, and to Gaussian Markov random fields.  By employing the mapping, we can transform  simultaneously all the estimated marginal probabilities from the dual domain to the primal domain (and vice versa), which  is advantageous if estimating the marginals  can be carried out more efficiently in the dual domain. An example of particular significance is the ferromagnetic Ising model in a positive  external magnetic field. For this model, there exists a rapidly mixing Markov chain (called the subgraphs--world process)  to generate configurations in the dual normal factor graph of the model. Our numerical experiments illustrate that the proposed procedure can provide more accurate  estimates of marginal probabilities of a global probability mass function in various settings."
"210247","Multivariate Boosted Trees and Applications to Forecasting and Control","Lorenzo Nespoli, Vasco Medici","https://jmlr.org//papers/volume23/21-0247/21-0247.pdf","https://github.com/supsi-dacd-isaac/mbtr","Gradient boosted trees are competition-winning, general-purpose, non-parametric regressors, which exploit sequential model fitting and gradient descent to minimize a specific loss function. The most popular implementations are tailored to univariate regression and classification tasks, precluding the possibility of capturing multivariate target cross-correlations and applying structured penalties to the predictions. In this paper, we present a computationally efficient algorithm for fitting multivariate boosted trees. We show that multivariate trees can outperform their univariate counterpart when the predictions are correlated. Furthermore, the algorithm allows to arbitrarily regularize the predictions, so that properties like smoothness, consistency and functional relations can be enforced. We present applications and numerical results related to forecasting and control."
"210309","Quantile regression with ReLU Networks: Estimators and minimax rates","Oscar Hernan Madrid Padilla, Wesley Tansey, Yanzhen Chen","https://jmlr.org//papers/volume23/21-0309/21-0309.pdf","https://github.com/tansey/quantile-regression","Quantile regression is the task of estimating a specified percentile response, such as the median (50th percentile), from a collection of known covariates. We study quantile regression with rectified linear unit (ReLU) neural networks as the chosen model class. We derive an upper bound on the expected mean squared error of a ReLU network used to estimate any quantile conditioning on a set of covariates. This upper bound only depends on the best possible approximation error, the number of layers in the network, and the number of nodes per layer. We further show upper bounds that are tight for two large classes of functions: compositions of Hölder functions and members of a Besov space. These tight bounds imply ReLU networks with quantile regression achieve minimax rates for broad collections of function types. Unlike existing work, the theoretical results hold under minimal assumptions and apply to general error distributions, including heavy-tailed distributions. Empirical simulations on a suite of synthetic response functions demonstrate the theoretical results translate to practical implementations of ReLU networks. Overall, the theoretical and empirical results provide insight into the strong performance of ReLU neural networks for quantile regression across a broad range of function classes and error distributions. All code for this paper is publicly available at https://github.com/tansey/quantile-regression."
"210335","Double Spike Dirichlet Priors for Structured Weighting","Huiming Lin, Meng Li","https://jmlr.org//papers/volume23/21-0335/21-0335.pdf","https://github.com/xylimeng/StructuredEnsemble","Assigning weights to a large pool of objects is a fundamental task in a wide variety of applications. In this article, we introduce the concept of structured high-dimensional probability simplexes, in which most components are zero or near zero and the remaining ones are close to each other. Such structure is well motivated by (i) high-dimensional weights that are common in modern applications, and (ii) ubiquitous examples in which equal weights---despite their simplicity---often achieve favorable or even state-of-the-art predictive performance. This particular structure, however, presents unique challenges partly because, unlike high-dimensional linear regression, the parameter space is a simplex and pattern switching between partial constancy and sparsity is unknown. To address these challenges, we propose a new class of double spike Dirichlet priors to shrink a probability simplex to one with the desired structure. When applied to ensemble learning, such priors lead to a Bayesian method for structured high-dimensional ensembles that is useful for forecast combination and improving random forests, while enabling uncertainty quantification. We design efficient Markov chain Monte Carlo algorithms for implementation. Posterior contraction rates are established to study large sample behaviors of the posterior distribution. We demonstrate the wide applicability and competitive performance of the proposed methods through simulations and two real data applications using the European Central Bank Survey of Professional Forecasters data set and a data set from the UC Irvine Machine Learning Repository (UCI)."
"210347","Projected Robust PCA with Application to Smooth Image Recovery","Long Feng, Junhui Wang","https://jmlr.org//papers/volume23/21-0347/21-0347.pdf","","Most high-dimensional matrix recovery problems are studied under the assumption that the target matrix has certain intrinsic structures. For image data related matrix recovery problems, approximate low-rankness and smoothness are the two most commonly imposed structures. For approximately low-rank matrix recovery, the robust principal component analysis (PCA) is well-studied and proved to be effective. For smooth matrix problem, 2d fused Lasso and other total variation based approaches have played a fundamental role. Although both low-rankness and smoothness are key assumptions for image data analysis, the two lines of research, however, have very limited interaction. Motivated by taking advantage of both features, we in this paper develop a framework named projected robust PCA (PRPCA), under which the low-rank matrices are projected onto a space of smooth matrices. Consequently, a large class of image matrices can be decomposed as a low-rank and smooth component plus a sparse component. A key advantage of this decomposition is that the dimension of the core low-rank component can be significantly reduced. Consequently, our framework is able to address a problematic bottleneck of many low-rank matrix problems: singular value decomposition (SVD) on large matrices. Theoretically, we provide explicit statistical recovery guarantees of PRPCA and include classical robust PCA as a special case."
"210354","Non-asymptotic Properties of Individualized Treatment Rules from Sequentially Rule-Adaptive Trials","Daiqi Gao, Yufeng Liu, Donglin Zeng","https://jmlr.org//papers/volume23/21-0354/21-0354.pdf","","Learning optimal individualized treatment rules (ITRs) has become increasingly important in the modern era of precision medicine. Many statistical and machine learning methods for learning optimal ITRs have been developed in the literature. However, most existing methods are based on data collected from traditional randomized controlled trials and thus cannot take advantage of the accumulative evidence when patients enter the trials sequentially. It is also ethically important that future patients should have a high probability to be treated optimally based on the updated knowledge so far. In this work, we propose a new design called sequentially rule-adaptive trials to learn optimal ITRs based on the contextual bandit framework, in contrast to the response-adaptive design in traditional adaptive trials. In our design, each entering patient will be allocated with a high probability to the current best treatment for this patient, which is estimated using the past data based on some machine learning algorithm (for example, outcome weighted learning in our implementation). We explore the tradeoff between training and test values of the estimated ITR in single-stage problems by proving theoretically that for a higher probability of following the estimated ITR, the training value converges to the optimal value at a faster rate, while the test value converges at a slower rate. This problem is different from traditional decision problems in the sense that the training data are generated sequentially and are dependent. We also develop a tool that combines martingale with empirical process to tackle the problem that cannot be solved by previous techniques for i.i.d. data. We show by numerical examples that without much loss of the test value, our proposed algorithm can improve the training value significantly as compared to existing methods. Finally, we use a real data study to illustrate the performance of the proposed method."
"210409","Using Active Queries to Infer Symmetric Node Functions of Graph Dynamical Systems","Abhijin Adiga, Chris J. Kuhlman, Madhav V. Marathe, S. S. Ravi, Daniel J. Rosenkrantz, Richard E. Stearns","https://jmlr.org//papers/volume23/21-0409/21-0409.pdf","","Developing techniques to infer the behavior of networked social systems has attracted a lot of attention in the literature. Using a discrete dynamical system to model a networked social system, the problem of inferring the behavior of the system can be formulated as the problem of learning the local functions of the dynamical system. We investigate the problem assuming an active form of interaction with the system through queries. We consider two classes of local functions (namely, symmetric and threshold functions) and two interaction modes, namely batch (where all the queries must be submitted together) and adaptive (where the set of queries submitted at a stage may rely on the answers to previous queries). We establish bounds on the number of queries under both batch and adaptive query modes using vertex coloring and probabilistic methods. Our results show that a small number of appropriately chosen queries are provably sufficient to correctly learn all the local functions. We develop complexity results which suggest that, in general, the problem of generating query sets of minimum size is computationally intractable. We present efficient heuristics that produce query sets under both batch and adaptive query modes. Also, we present a query compaction algorithm that identifies and removes redundant queries from a given query set. Our algorithms were evaluated through experiments on over 20 well-known networks."
"210468","A Closer Look at Embedding Propagation for Manifold Smoothing","Diego Velazquez, Pau Rodriguez, Josep M. Gonfaus, F. Xavier Roca, Jordi Gonzalez","https://jmlr.org//papers/volume23/21-0468/21-0468.pdf","","Supervised training of neural networks requires a large amount of manually annotated data and the resulting networks tend to be sensitive to out-of-distribution (OOD) data. Self- and semi-supervised training schemes reduce the amount of annotated data required during the training process. However, OOD generalization remains a major challenge for most methods. Strategies that promote smoother decision boundaries play an important role in out-of-distribution generalization. For example, embedding propagation (EP) for manifold smoothing has recently shown to considerably improve the OOD performance for few-shot classification. EP achieves smoother class manifolds by building a graph from sample embeddings and propagating information through the nodes in an unsupervised manner. In this work, we extend the original EP paper providing additional evidence and experiments showing that it attains smoother class embedding manifolds and improves results in settings beyond few-shot classification. Concretely, we show that EP improves the robustness of neural networks against multiple adversarial attacks as well as semi- and self-supervised learning performance."
"21054","Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences","Alan Chan, Hugo Silva, Sungsu Lim, Tadashi Kozuno, A. Rupam Mahmood, Martha White","https://jmlr.org//papers/volume23/21-054/21-054.pdf","","Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate the difference between the forward and reverse KL divergences, with varying degrees of entropy regularization; these are chosen because they underlie many existing policy optimization approaches, as we highlight in this work. We show that the reverse KL has stronger policy improvement guarantees, and that reducing the forward KL can result in a worse policy. We also demonstrate, however, that a large enough reduction of the forward KL can induce improvement under additional assumptions. Empirically, we show on simple continuous-action environments that the forward KL can induce more exploration, but at the cost of a more suboptimal policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems. This work provides novel theoretical and empirical insights about the forward KL and reverse KL for greedification, and clear next steps for understanding and improving our policy optimization algorithms."
"210582","Adaptive Greedy Algorithm for Moderately Large Dimensions in Kernel Conditional Density Estimation","Minh-Lien Jeanne Nguyen, Claire Lacour, Vincent Rivoirard","https://jmlr.org//papers/volume23/21-0582/21-0582.pdf","","This paper studies the estimation of the conditional density $f(x,\cdot)$  of $Y_i$ given $X_i=x$, from the observation of an i.i.d. sample $(X_i,Y_i)\in \mathbb R^d$, $i\in \{1,\dots,n\}.$ We assume that $f$ depends only on $r$ unknown components with  typically $r\ll d$.We provide an adaptive fully-nonparametric strategy based on kernel rules to estimate $f$. To select the bandwidth of our kernel rule, we propose a new fast iterative algorithm inspired by the Rodeo algorithm (Wasserman and Lafferty, 2006) to detect the sparsity structure of $f$. More precisely, in the minimax setting, our pointwise estimator, which is adaptive to both the regularity and the sparsity,  achieves the quasi-optimal rate of convergence. Our results also hold for   (unconditional) density estimation. The computational complexity of our method is only $O(dn \log n)$. A deep numerical study shows nice performances of our approach."
"210773","Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States","Shi Dong, Benjamin Van Roy, Zhengyuan Zhou","https://jmlr.org//papers/volume23/21-0773/21-0773.pdf","","We design a simple reinforcement learning (RL) agent that implements an optimistic version of $Q$-learning and establish through regret analysis that this agent can operate with some level of competence in any environment.  While we leverage concepts from the literature on provably efficient RL, we consider a general agent-environment interface and provide a novel agent design and analysis.  This level of generality positions our results to inform the design of future agents for operation in complex real environments.  We establish that, as time progresses, our agent performs competitively relative to policies that require longer times to evaluate.  The time it takes to approach asymptotic performance is polynomial in the complexity of the agent’s state representation and the time required to evaluate the best policy that the agent can represent.  Notably, there is no dependence on the complexity of the environment.  The ultimate per-period performance loss of the agent is bounded by a constant multiple of a measure of distortion introduced by the agent’s state representation.  This work is the first to establish that an algorithm approaches this asymptotic condition within a tractable time frame."
"210798","On Constraints in First-Order Optimization: A View from Non-Smooth Dynamical Systems","Michael Muehlebach, Michael I. Jordan","https://jmlr.org//papers/volume23/21-0798/21-0798.pdf","https://github.com/michaemu/OnConstraintsInFirstOrderOptimization.git","We introduce a class of first-order methods for smooth constrained optimization that are based on an analogy to non-smooth dynamical systems. Two distinctive features of our approach are that (i) projections or optimizations over the entire feasible set are avoided, in stark contrast to projected gradient methods or the Frank-Wolfe method, and (ii) iterates are allowed to become infeasible, which differs from active set or feasible direction methods, where the descent motion stops as soon as a new constraint is encountered. The resulting algorithmic procedure is simple to implement even when constraints are nonlinear, and is suitable for large-scale constrained optimization problems in which the feasible set fails to have a simple structure.  The key underlying idea is that constraints are expressed in terms of velocities instead of positions, which has the algorithmic consequence that optimizations over feasible sets at each iteration are replaced with optimizations over local, sparse convex approximations. In particular, this means that at each iteration only constraints that are violated are taken into account. The result is a simplified suite of algorithms and an expanded range of possible applications in machine learning."
"210879","Sparse Continuous Distributions and Fenchel-Young Losses","André F. T. Martins, Marcos Treviso, António Farinhas, Pedro M. Q. Aguiar, Mário A. T. Figueiredo, Mathieu Blondel, Vlad Niculae","https://jmlr.org//papers/volume23/21-0879/21-0879.pdf","https://github.com/deep-spin/sparse_continuous_distributions/","Exponential families are widely used in machine learning, including many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions in each of these families have fixed support. In contrast, for finite domains, recent work on sparse alternatives to softmax (e.g., sparsemax, $\alpha$-entmax, and fusedmax), has led to distributions with varying support. This paper develops sparse alternatives to continuous distributions, based on several technical contributions: First, we define $\Omega$-regularized prediction maps and Fenchel-Young losses for arbitrary domains (possibly countably infinite or continuous). For linearly parametrized families, we show that minimization of Fenchel-Young losses is equivalent to moment matching of the statistics, generalizing a fundamental property of exponential families. When $\Omega$ is a Tsallis negentropy with parameter $\alpha$, we obtain “deformed exponential families,” which include $\alpha$-entmax and sparsemax ($\alpha=2$) as particular cases. For quadratic energy functions, the resulting densities are $\beta$-Gaussians, an instance of elliptical distributions that contain as particular cases the Gaussian, biweight, triweight, and Epanechnikov densities, and for which we derive closed-form expressions for the variance, Tsallis entropy, and Fenchel-Young loss. When $\Omega$ is a total variation or Sobolev regularizer, we obtain a continuous version of the fusedmax. Finally, we introduce continuous-domain attention mechanisms, deriving efficient gradient backpropagation algorithms for $\alpha \in \{1,\frac{4}{3}, \frac{3}{2}, 2\}$.  Using these algorithms, we demonstrate our sparse continuous distributions for attention-based audio classification and visual question answering, showing that they allow attending to time intervals and compact regions."
"210885","Tree-Based Models for Correlated Data","Assaf Rabinowicz, Saharon Rosset","https://jmlr.org//papers/volume23/21-0885/21-0885.pdf","","This paper presents a new approach for regression tree-based models, such as simple regression tree, random forest and gradient boosting, in settings involving correlated data. We show the problems that arise when implementing standard regression tree-based models, which ignore the correlation structure. Our new approach explicitly takes the correlation structure into account in the splitting criterion, stopping rules and fitted values in the leaves, which induces some major modifications of standard methodology. The superiority of our new approach over tree-based models that do not account for the correlation, and over previous work that integrated some aspects of our approach, is supported by simulation experiments and real data analyses."
"210952","Learning Temporal Evolution of Spatial Dependence with Generalized Spatiotemporal Gaussian Process Models","Shiwei Lan","https://jmlr.org//papers/volume23/21-0952/21-0952.pdf","https://github.com/lanzithinking/TESD_gSTGP","A large number of scientific studies involve high-dimensional spatiotemporal data with complicated relationships.  In this paper, we focus on a type of space-time interaction named temporal evolution of spatial dependence (TESD), which is a zero time-lag spatiotemporal covariance. For this purpose, we propose a novel Bayesian nonparametric method based on non-stationary spatiotemporal Gaussian process (STGP). The classic STGP has a covariance kernel separable in space and time, failed to characterize TESD. More recent works on non-separable STGP treat location and time together as a joint variable, which is unnecessarily inefficient. We generalize STGP (gSTGP) to introduce time-dependence to the spatial kernel by varying its eigenvalues over time in the Mercer's representation. The resulting non-stationary non-separable covariance model bares a quasi Kronecker sum structure.  Finally, a hierarchical Bayesian model for the joint covariance is proposed to allow for full flexibility in learning TESD. A simulation study and a longitudinal neuroimaging analysis on Alzheimer's patients demonstrate that the proposed methodology is (statistically) effective and (computationally) efficient in characterizing TESD. Theoretic properties of gSTGP including posterior contraction (for covariance) are also studied."
"210962","A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions","Arnulf Jentzen, Adrian Riekert","https://jmlr.org//papers/volume23/21-0962/21-0962.pdf","","Gradient descent (GD) type optimization methods are the standard instrument to train artificial neural networks (ANNs) with rectified linear unit (ReLU) activation. Despite the great success of GD type optimization methods in numerical simulations for the training of ANNs with ReLU activation, it remains - even in the simplest situation of the plain vanilla GD optimization method and ANNs with one hidden layer - an open problem to prove (or disprove) the conjecture that the risk of the GD optimization method converges in the training of such ANNs to zero. In this article we establish in the situation where the probability distribution of the input data is equivalent to the continuous uniform distribution on a compact interval, where the probability distribution for the random initialization of the ANN parameters is the standard normal distribution, and where the target function under consideration is continuous and piecewise affine linear that the risk of the considered GD process converges exponentially fast to zero with a positive probability. Roughly speaking, the key ingredients in our mathematical convergence analysis are  (i) to prove that suitable sets of global minima of the risk functions are twice continuously differentiable submanifolds of the ANN parameter spaces, (ii) to prove that the Hessians of the risk functions on these sets of global minima satisfy an appropriate maximal rank condition, and, thereafter, (iii) to apply the machinery in [Fehrman, B., Gess, B., Jentzen, A., Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res. 21(136): 1-48, 2020] to establish local convergence of the GD optimization method. As a consequence, we obtain convergence of the risk to zero as the width of the ANNs, the number of independent random initializations, and the number of GD steps increase to infinity."
"210992","Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning","Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, Frank Hutter","https://jmlr.org//papers/volume23/21-0992/21-0992.pdf","https://github.com/automl/ASKL2.0_experiments","Automated Machine Learning (AutoML) supports practitioners and researchers with the tedious task of designing machine learning pipelines and has recently achieved substantial success. In this paper, we introduce new AutoML approaches motivated by our winning submission to the second ChaLearn AutoML challenge. We develop PoSH Auto-sklearn, which enables AutoML systems to work well on large datasets under rigid time limits by using a new, simple and meta-feature-free meta-learning technique and by employing a successful bandit strategy for budget allocation. However, PoSH Auto-sklearn introduces even more ways of running AutoML and might make it harder for users to set it up correctly. Therefore, we also go one step further and study the design space of AutoML itself, proposing a solution towards truly hands-free AutoML. Together, these changes give rise to the next generation of our AutoML system, Auto-sklearn 2.0 . We verify the improvements by these additions in an extensive experimental study on 39 AutoML benchmark datasets. We conclude the paper by comparing to other popular AutoML frameworks and Auto-sklearn 1.0 , reducing the relative error by up to a factor of 4.5, and yielding a performance in 10 minutes that is substantially better than what Auto-sklearn 1.0 achieves within an hour."
"210993","Estimation and inference on high-dimensional individualized treatment rule in observational data using split-and-pooled de-correlated score","Muxuan Liang, Young-Geun Choi, Yang Ning, Maureen A Smith, Ying-Qi Zhao","https://jmlr.org//papers/volume23/21-0993/21-0993.pdf","https://github.com/muxuanliang/ITRInference.git","With the increasing adoption of electronic health records,  there is an increasing interest in developing individualized treatment rules, which recommend treatments according to patients' characteristics, from large observational data. However,  there is a lack of valid inference procedures for such rules developed from this type of data in the presence of high-dimensional covariates. In this work, we develop a penalized doubly robust method to estimate the optimal individualized treatment rule from high-dimensional data.  We propose a split-and-pooled de-correlated score to construct hypothesis tests and confidence intervals. Our proposal adopts the data splitting to conquer the slow convergence rate of nuisance parameter estimations, such as non-parametric methods for outcome regression or propensity models. We establish the limiting distributions of the split-and-pooled de-correlated score test and the corresponding one-step estimator in high-dimensional setting. Simulation and real data analysis are conducted to demonstrate the superiority of the proposed method."
"211011","The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks","Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett","https://jmlr.org//papers/volume23/21-1011/21-1011.pdf","","The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of benign overfitting has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk."
"211038","A Random Matrix Perspective on Random Tensors","José Henrique de M. Goulart, Romain Couillet, Pierre Comon","https://jmlr.org//papers/volume23/21-1038/21-1038.pdf","","Several machine learning problems such as latent variable model learning and community detection can be addressed by estimating a low-rank signal from a noisy tensor. Despite recent substantial progress on the fundamental limits of the corresponding estimators in the large-dimensional setting, some of the most significant results are based on spin glass theory, which is not easily accessible to non-experts. We propose a sharply distinct and more elementary approach, relying on tools from random matrix theory. The key idea is to study random matrices arising from contractions of a random tensor, which give access to its spectral properties. In particular, for a symmetric $d$th-order rank-one model with Gaussian noise, our approach yields a novel characterization of maximum likelihood (ML) estimation performance in terms of a fixed-point equation valid in the regime where weak recovery is possible. For $d=3$, the solution to this equation matches the existing results. We conjecture that the same holds for any order $d$, based on numerical evidence for $d \in \{4,5\}$. Moreover, our analysis illuminates certain properties of the large-dimensional ML landscape. Our approach can be extended to other models, including asymmetric and non-Gaussian ones."
"211062","Stochastic subgradient for composite  convex optimization with functional constraints","Ion Necoara, Nitesh Kumar Singh","https://jmlr.org//papers/volume23/21-1062/21-1062.pdf","","In this paper we consider optimization problems with  stochastic composite objective function subject to (possibly) infinite intersection of constraints. The objective function is   expressed in terms of expectation operator over a sum of two terms   satisfying a stochastic bounded gradient condition, with or without strong convexity type properties. In contrast to the classical approach, where the constraints are usually represented as intersection of simple sets, in this paper we consider that each constraint set is  given as the level set of a convex but not necessarily differentiable function.  Based on the flexibility offered by our general optimization model we consider a stochastic subgradient method with random  feasibility updates.  At each iteration, our algorithm takes a stochastic proximal (sub)gradient step aimed at  minimizing the objective function and then a subsequent subgradient step  minimizing the feasibility violation of the observed random constraint.  We analyze the convergence behavior of the proposed algorithm  for  diminishing stepsizes and for the case when the objective function is  convex or has a quadratic functional growth, unifying the nonsmooth and smooth cases.  We prove sublinear convergence rates for this stochastic subgradient algorithm, which are known to be optimal for subgradient  methods on  this class of problems. When the objective function has a linear least-square form and the constraints are polyhedral, it is shown that the algorithm converges linearly.   Numerical evidence supports the effectiveness of our method in real problems."
"211091","Functional Linear Regression with Mixed Predictors","Daren Wang, Zifeng Zhao, Yi Yu, Rebecca Willett","https://jmlr.org//papers/volume23/21-1091/21-1091.pdf","https://github.com/darenwang/functional_regression","We study a functional linear regression model that deals with functional responses and allows for both functional covariates and high-dimensional vector covariates. The proposed model is flexible and nests several functional regression models in the literature as special cases. Based on the theory of reproducing kernel Hilbert spaces (RKHS), we propose a penalized least squares estimator that can accommodate functional variables observed on discrete sample points. Besides a conventional smoothness penalty, a group Lasso-type penalty is further imposed to induce sparsity in the high-dimensional vector predictors. We derive finite sample theoretical guarantees and show that the excess prediction risk of our estimator is minimax optimal. Furthermore, our analysis reveals an interesting phase transition phenomenon that the optimal excess risk is determined jointly by the smoothness and the sparsity of the functional regression coefficients. A novel efficient optimization algorithm based on iterative coordinate descent is devised to handle the smoothness and group penalties simultaneously. Simulation studies and real data applications illustrate the promising performance of the proposed approach compared to the state-of-the-art methods in the literature."
"211127","Tianshou: A Highly Modularized Deep Reinforcement Learning Library","Jiayi Weng, Huayu Chen, Dong Yan, Kaichao You, Alexis Duburcq, Minghao Zhang, Yi Su, Hang Su, Jun Zhu","https://jmlr.org//papers/volume23/21-1127/21-1127.pdf","https://github.com/thu-ml/tianshou/","In this paper, we present Tianshou, a highly modularized Python library for deep reinforcement learning (DRL) that uses PyTorch as its backend. Tianshou intends to be research-friendly by providing a flexible and reliable infrastructure of DRL algorithms. It supports online and offline training with more than 20 classic algorithms through a unified interface. To facilitate related research and prove Tianshou's reliability, we have released Tianshou's benchmark of MuJoCo environments, covering eight classic algorithms with state-of-the-art performance. We open-sourced Tianshou at https://github.com/thu-ml/tianshou/."
"211129","A Computationally Efficient Framework for Vector Representation of Persistence Diagrams","Kit C Chan, Umar Islambekov, Alexey Luchinsky, Rebecca Sanders","https://jmlr.org//papers/volume23/21-1129/21-1129.pdf","","In Topological Data Analysis, a common way of quantifying the shape of data is to use a persistence diagram (PD). PDs are multisets of points in $R^2$ computed using tools of algebraic topology. However, this multi-set structure limits the utility of PDs in applications. Therefore, in recent years efforts have been directed towards extracting informative and efficient summaries from PDs to broaden the scope of their use for machine learning tasks. We propose a computationally efficient framework to convert a PD into a vector in $R^n$, called a vectorized persistence block (VPB). We show that our representation possesses many of the desired properties of vector-based summaries such as stability with respect to input noise, low computational cost and flexibility. Through simulation studies, we demonstrate the effectiveness of VPBs in terms of performance and computational cost for various learning tasks, namely clustering, classification and change point detection."
"211173","Learning linear non-Gaussian directed acyclic graph with diverging number of nodes","Ruixuan Zhao, Xin He, Junhui Wang","https://jmlr.org//papers/volume23/21-1173/21-1173.pdf","","An acyclic model, often depicted as a directed acyclic graph (DAG), has been widely employed to represent directional causal relations among collected nodes. In this article, we propose an efficient method to learn linear non-Gaussian DAG in high dimensional cases, where the noises can be of any continuous non-Gaussian distribution. The proposed method leverages the concept of topological layer to facilitate the DAG learning, and its theoretical justification in terms of exact DAG recovery is also established under mild conditions. Particularly, we show that the topological layers can be exactly reconstructed in a bottom-up fashion, and the parent-child relations among nodes can also be consistently established. The established asymptotic DAG recovery is in sharp contrast to that of many existing learning methods assuming parental faithfulness or ordered noise variances. The advantage of the proposed method is also supported by the numerical comparison against some popular competitors in various simulated examples as well as a real application on the global spread of COVID-19."
"211184","Minimax Mixing Time of the Metropolis-Adjusted Langevin Algorithm for Log-Concave Sampling","Keru Wu, Scott Schmidler, Yuansi Chen","https://jmlr.org//papers/volume23/21-1184/21-1184.pdf","","We study the mixing time of the Metropolis-adjusted Langevin algorithm (MALA) for sampling from a log-smooth and strongly log-concave distribution. We  establish its optimal minimax mixing time under a warm start. Our main contribution is two-fold. First, for a $d$-dimensional log-concave density with condition number $\kappa$, we show that MALA with a warm start mixes in $\tilde O(\kappa \sqrt{d})$ iterations up to logarithmic factors. This improves upon the previous work on the dependency of either the condition number $\kappa$ or the dimension $d$. Our proof relies on comparing the leapfrog integrator with the continuous Hamiltonian dynamics, where we establish a new concentration bound for the acceptance rate. Second, we prove a spectral gap based mixing time lower bound for reversible MCMC algorithms on general state spaces. We apply this lower bound result to construct a hard distribution for which MALA requires at least $\tilde\Omega(\kappa \sqrt{d})$ steps to mix. The lower bound for MALA matches our upper bound in terms of condition number and dimension. Finally, numerical experiments are included to validate our theoretical results."
"211265","Fast Stagewise Sparse Factor Regression","Kun Chen, Ruipeng Dong, Wanwan Xu, Zemin Zheng","https://jmlr.org//papers/volume23/21-1265/21-1265.pdf","","Sparse factorization of a large matrix is fundamental in modern statistical learning. In particular, the sparse singular value decomposition has been utilized in many multivariate regression methods. The appeal of this factorization is owing to its power in discovering a highly-interpretable latent association network. However, many existing methods are either ad hoc without a general performance guarantee, or are computationally intensive. We formulate the statistical problem as a sparse factor regression and tackle it with a two-stage “deflation + stagewise learning” approach. In the first stage, we consider both sequential and parallel approaches for simplifying the task into a set of co-sparse unit-rank estimation (CURE) problems, and establish the statistical underpinnings of these commonly-adopted and yet poorly understood deflation methods. In the second stage, we innovate a contended stagewise learning technique, consisting of a sequence of simple incremental updates, to efficiently trace out the whole solution paths of CURE. Our algorithm achieves a much lower computational complexity than alternating convex search, and it enables a flexible and principled tradeoff between statistical accuracy and computational efficiency. Our work is among the first to enable stagewise learning for non-convex problems, and the idea can be applicable in many multi-convex problems. Extensive simulation studies and an application in genetics demonstrate the effectiveness and scalability of our approach."
"211269","Communication-Constrained Distributed Quantile Regression with Optimal Statistical Guarantees","Kean Ming Tan, Heather Battey, Wen-Xin Zhou","https://jmlr.org//papers/volume23/21-1269/21-1269.pdf","","We address the problem of how to achieve optimal inference in distributed quantile regression without stringent scaling conditions. This is challenging due to the non-smooth nature of the quantile regression (QR) loss function, which invalidates the use of existing methodology. The difficulties are resolved through a double-smoothing approach that is applied to the local (at each data source) and global objective functions. Despite the reliance on a delicate combination of local and global smoothing parameters, the quantile regression model is fully parametric, thereby facilitating interpretation. In the low-dimensional regime, we establish a finite-sample theoretical framework for the sequentially defined distributed QR estimators. This reveals a trade-off between the communication cost and statistical error. We further discuss and compare several alternative confidence set constructions, based on inversion of Wald and score-type tests and resampling techniques, detailing an improvement that is effective for more extreme quantile coefficients. In high dimensions, a sparse framework is adopted, where the proposed doubly-smoothed objective function is complemented with an $\ell_1$-penalty. We show that the corresponding distributed penalized QR estimator achieves the global convergence rate after a near-constant number of communication rounds. A thorough simulation study further elucidates our findings."
"211328","The Weighted Generalised Covariance Measure","Cyrill Scheidegger, Julia Hörrmann, Peter Bühlmann","https://jmlr.org//papers/volume23/21-1328/21-1328.pdf","","We introduce a new test for conditional independence which is based on what we call the weighted generalised covariance measure (WGCM). It is an extension of the recently introduced generalised covariance measure (GCM). To test the null hypothesis of $X$ and $Y$ being conditionally independent given $Z$, our test statistic is a weighted form of the sample covariance between the residuals of nonlinearly regressing  $X$ and $Y$ on $Z$. We propose different variants of the test for both univariate and multivariate $X$ and $Y$. We give conditions under which the tests yield the correct type I error rate. Finally, we compare our novel tests to the original GCM using simulation and on real data sets. Typically, our tests have power against a wider class of alternatives compared to the GCM. This comes at the cost of having less power against alternatives for which the GCM already works well. In the special case of binary or categorical $X$ and $Y$, one of our tests has power against all alternatives."
"211342","CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms","Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, João G.M. Araújo","https://jmlr.org//papers/volume23/21-1342/21-1342.pdf","https://github.com/vwxyzjn/cleanrl","CleanRL is an open-source library that provides high-quality single-file implementations of Deep Reinforcement Learning (DRL) algorithms. These single-file implementations are self-contained algorithm variant files such as dqn.py, ppo.py, and ppo_atari.py that individually include all algorithm variant's implementation details. Such a paradigm significantly reduces the complexity and the lines of code (LOC) in each implemented variant, which makes them quicker and easier to understand. This paradigm gives the researchers the most fine-grained control over all aspects of the algorithm in a single file, allowing them to prototype novel features quickly. Despite having succinct implementations, CleanRL's codebase is thoroughly documented and benchmarked to ensure performance is on par with reputable sources. As a result, CleanRL produces a repository tailor-fit for two purposes: 1) understanding all implementation details of DRL algorithms and 2) quickly prototyping novel features.  CleanRL's source code can be found at https://github.com/vwxyzjn/cleanrl."
"211387","Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms","Yanwei Jia, Xun Yu Zhou","https://jmlr.org//papers/volume23/21-1387/21-1387.pdf","https://www.dropbox.com/sh/ezbuntcfje3d7kg/AABW-ndK-4j9N8E3kVTDZLz6a?dl=0","We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This representation effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2022a) for PE to solve our PG problem. Based on this analysis, we propose two types of actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation, which involves future trajectories and is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples."
"211404","Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons","Shijun Zhang, Zuowei Shen, Haizhao Yang","https://jmlr.org//papers/volume23/21-1404/21-1404.pdf","","This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple, computable, and continuous activation function $\sigma$ leveraging a triangular-wave function and the softsign function. We first prove that  $\sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensional hypercube within an arbitrarily small error. Hence, for supervised learning and its related regression problems, the hypothesis space generated by these networks with a size not smaller than $36d(2d+1)\times 11$ is dense in the continuous function space $C([a,b]^d)$ and therefore dense in the Lebesgue spaces $L^p([a,b]^d)$ for $p\in [1,\infty)$. Furthermore, we show that classification functions arising from image and signal classification are in the hypothesis space generated by $\sigma$-activated networks with width $36d(2d+1)$ and depth $12$ when there exist pairwise disjoint bounded closed subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset. Finally, we use numerical experimentation to show that replacing the rectified linear unit (ReLU) activation function by ours would improve the experiment results."
"211443","Nonstochastic Bandits with Composite Anonymous Feedback","Nicolò Cesa-Bianchi, Tommaso Cesari, Roberto Colomboni, Claudio Gentile, Yishay Mansour","https://jmlr.org//papers/volume23/21-1443/21-1443.pdf","","We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over the subsequent rounds in an adversarial way. The instantaneous loss observed by the player at the end of each round is then a sum of many loss components of previously played actions. This setting encompasses as a special case the easier task of bandits with delayed feedback, a well-studied framework where the player observes the delayed losses individually.  Our first contribution is a general reduction transforming a standard bandit algorithm into one that can operate in the harder setting: We bound the regret of the transformed algorithm in terms of the stability and regret of the original algorithm. Then, we show that the transformation of a suitably tuned FTRL with Tsallis entropy has a regret of order $\sqrt{(d+1)KT}$, where $d$ is the maximum delay, $K$ is the number of arms, and $T$ is the time horizon. Finally, we show that our results cannot be improved in general by exhibiting a matching (up to a log factor) lower bound on the regret of any algorithm operating in this setting."
"211472","Jump Gaussian Process Model for Estimating Piecewise Continuous Regression Functions","Chiwoo Park","https://jmlr.org//papers/volume23/21-1472/21-1472.pdf","https://www.chiwoopark.net/code-and-dataset","This paper presents a Gaussian process (GP) model for estimating piecewise continuous regression functions. In many scientific and engineering applications of regression analysis, the underlying regression functions are often piecewise continuous in that data follow different continuous regression models for different input regions with discontinuities across regions. However, many conventional GP regression approaches are not designed for piecewise regression analysis. There are piecewise GP models to use explicit domain partitioning and pose independent GP models over partitioned regions. They are not flexible enough to model real datasets where data domains are divided by complex and curvy jump boundaries.   We propose a new GP modeling approach to estimate an unknown piecewise continuous regression function. The new GP model seeks a local GP estimate of an unknown regression function at each test location, using local data neighboring the test location. Considering the possibilities of the local data being from different regions, the proposed approach partitions the local data into pieces by a local data partitioning function. It uses only the local data likely from the same region as the test location for the regression estimate. Since we do not know which local data points come from the relevant region, we propose a data-driven approach to split and subset local data by a local partitioning function. We discuss several modeling choices of the local data partitioning function, including a locally linear function and a locally polynomial function. We also investigate an optimization problem to jointly optimize the partitioning function and other covariance parameters using a likelihood maximization criterion. Several advantages of using the proposed approach over the conventional GP and piecewise GP modeling approaches are shown by various simulated experiments and real data studies."
"211528","Convergence Guarantees for the Good-Turing Estimator","Amichai Painsky","https://jmlr.org//papers/volume23/21-1528/21-1528.pdf","","Consider a finite sample from an unknown distribution over a countable alphabet. The occupancy probability (OP) refers to the total probability of symbols that appear exactly k times in the sample. Estimating the OP is a basic problem in large alphabet modeling, with a variety of applications in machine learning, statistics and information theory. The Good-Turing (GT) framework is perhaps the most popular OP estimation scheme. Classical results show that the GT estimator converges to the OP, for every k independently. In this work we introduce new exact convergence guarantees for the GT estimator, based on worst-case mean squared error analysis. Our scheme improves upon currently known results. Further, we introduce a novel simultaneous convergence rate, for any desired set of occupancy probabilities. This allows us to quantify the unified performance of OP estimators, and introduce a novel estimation framework with favorable convergence guarantees."
"21820","Generalized Resubstitution for Classification Error Estimation","Parisa Ghane, Ulisses Braga-Neto","https://jmlr.org//papers/volume23/21-820/21-820.pdf","","We propose the family of generalized resubstitution classifier error estimators based on arbitrary empirical probability measures. These error estimators are computationally efficient and do not require retraining of classifiers. The plain resubstitution error estimator corresponds to choosing the standard empirical probability measure. Other choices of empirical probability measure lead to bolstered, posterior-probability, Gaussian-process, and Bayesian error estimators; in addition, we propose here bolstered posterior-probability error estimators, as a new family of generalized resubstitution estimators. In the two-class case, we show that a generalized resubstitution estimator is consistent and asymptotically unbiased, regardless of the distribution of the features and label, if the corresponding empirical probability measure converges uniformly to the standard empirical probability measure and the classification rule has finite VC dimension. A generalized resubstitution estimator typically has hyperparameters that can be tuned to control its bias and variance, which adds flexibility. We conducted extensive numerical experiments with various classification rules trained on synthetic data, which indicate that the new family of error estimators proposed here produces the best results overall, except in the case of very complex, overfitting classifiers, in which semi-bolstered resubstitution should be used instead. In addition, results of an image classification experiment using the LeNet-5 convolutional neural network and the MNIST data set show that naive-Bayes bolstered resubstitution with a simple data-driven calibration procedure produces excellent results, demonstrating the potential of this class of error estimators in deep learning for computer vision."
"220022","Nonparametric adaptive control and prediction: theory and randomized algorithms","Nicholas M. Boffi, Stephen Tu, Jean-Jacques E. Slotine","https://jmlr.org//papers/volume23/22-0022/22-0022.pdf","","A key assumption in the theory of nonlinear adaptive control is that the uncertainty of the system can be expressed in the linear span of a set of known basis functions. While this assumption leads to efficient algorithms, it limits applications to very specific classes of systems. We introduce a novel nonparametric adaptive algorithm that estimates an infinite-dimensional density over parameters online to learn an unknown dynamics in a reproducing kernel Hilbert space. Surprisingly, the resulting control input admits an analytical expression that enables its implementation despite its underlying infinite-dimensional structure. While this adaptive input is rich and expressive -- subsuming, for example, traditional linear parameterizations -- its computational complexity grows linearly with time, making it comparatively more expensive than its parametric counterparts. Leveraging the theory of random Fourier features, we provide an efficient randomized implementation that recovers the complexity of classical parametric methods while provably retaining the expressivity of the nonparametric input. In particular, our explicit bounds only depend polynomially on the underlying parameters of the system, allowing our proposed algorithms to efficiently scale to high-dimensional systems. As an illustration of the method, we demonstrate the ability of the randomized approximation algorithm to learn a predictive model of a 60-dimensional system consisting of ten point masses interacting through Newtonian gravitation. By reinterpretation as a gradient flow on a specific loss, we conclude with a natural extension of our kernel-based adaptive algorithms to deep neural networks. We show empirically that the extra expressivity afforded by deep representations can lead to improved performance at the expense of the closed-loop stability that is rigorously guaranteed and consistently observed for kernel machines."
"220056","On the Convergence Rates of Policy Gradient Methods","Lin Xiao","https://jmlr.org//papers/volume23/22-0056/22-0056.pdf","","We consider infinite-horizon discounted Markov decision problems with finite state and action spaces and study the convergence rates of the projected policy gradient method and a general class of policy mirror descent methods, all with direct parametrization in the policy space. First, we develop a theory of weak gradient-mapping dominance and use it to prove sharp sublinear convergence rate of the projected policy gradient method. Then we show that with geometrically increasing step sizes, a general class of policy mirror descent methods, including the natural policy gradient method and a projected Q-descent method, all enjoy a linear rate of convergence without relying on entropy or other strongly convex regularization. Finally, we also analyze the convergence rate of an inexact policy mirror descent method and estimate its sample complexity under a simple generative model."
"220140","De-Sequentialized Monte Carlo: a parallel-in-time particle smoother","Adrien Corenflos, Nicolas Chopin, Simo Särkkä","https://jmlr.org//papers/volume23/22-0140/22-0140.pdf","https://github.com/AdrienCorenflos/parallel-ps","Particle smoothers are SMC (Sequential Monte Carlo) algorithms designed to approximate the joint distribution of the states given observations from a state-space model. We propose dSMC (de-Sequentialized Monte Carlo), a new particle smoother that is able to process $T$ observations in $\mathcal{O}(\log_2 T)$ time on parallel architectures. This compares favorably with standard particle smoothers, the complexity of which is linear in $T$. We derive $\mathcal{L}_p$ convergence results for dSMC, with an explicit upper bound, polynomial in $T$. We then discuss how to reduce the variance of the smoothing estimates computed by dSMC by (i) designing good proposal distributions for sampling the particles at the initialization of the algorithm, as well as by (ii) using lazy resampling to increase the number of particles used in dSMC. Finally, we design a particle Gibbs sampler based on dSMC, which is able to perform parameter inference in a state-space model at a $\mathcal{O}(\log_2 T)$ cost on parallel hardware."
"220232","Exact Partitioning of High-order Models with a Novel Convex Tensor Cone Relaxation","Chuyang Ke, Jean Honorio","https://jmlr.org//papers/volume23/22-0232/22-0232.pdf","","In this paper we propose an algorithm for exact partitioning of high-order models. We define a general class of $m$-degree Homogeneous Polynomial Models, which subsumes several examples motivated from prior literature. Exact partitioning can be formulated as a tensor optimization problem. We relax this high-order combinatorial problem to a convex conic form problem. To this end, we carefully define the Carathéodory symmetric tensor cone, and show its convexity, and the convexity of its dual cone. This allows us to construct a primal-dual certificate to show that the solution of the convex relaxation is correct (equal to the unobserved true group assignment) and to analyze the statistical upper bound of exact partitioning."
"220281","Deepchecks: A Library for Testing and Validating Machine Learning Models and Data","Shir Chorev, Philip Tannor, Dan Ben Israel, Noam Bressler, Itay Gabbay, Nir Hutnik, Jonatan Liberman, Matan Perlmutter, Yurii Romanyshyn, Lior Rokach","https://jmlr.org//papers/volume23/22-0281/22-0281.pdf","https://github.com/deepchecks/deepchecks","This paper presents Deepchecks, a Python library for comprehensively validating machine learning models and data. Our goal is to provide an easy-to-use library comprising many checks related to various issues, such as model predictive performance, data integrity, data distribution mismatches, and more. The package is distributed under the GNU Affero General Public License and relies on core libraries from the scientific Python ecosystem: scikit-learn, PyTorch, NumPy, pandas, and SciPy."
"220297","Integral Autoencoder Network for Discretization-Invariant Learning","Yong Zheng Ong, Zuowei Shen, Haizhao Yang","https://jmlr.org//papers/volume23/22-0297/22-0297.pdf","https://github.com/IAE-Net/iae_net","Discretization invariant learning aims at learning in the infinite-dimensional function spaces with the capacity to process heterogeneous discrete representations of functions as inputs and/or outputs of a learning model. This paper proposes a novel deep learning framework based on integral autoencoders (IAE-Net) for discretization invariant learning. The basic building block of IAE-Net consists of an encoder and a decoder as integral transforms with data-driven kernels, and a fully connected neural network between the encoder and decoder. This basic building block is applied in parallel in a wide multi-channel structure, which is repeatedly composed to form a deep and densely connected neural network with skip connections as IAE-Net. IAE-Net is trained with randomized data augmentation that generates training data with heterogeneous structures to facilitate the performance of discretization invariant learning. The proposed IAE-Net is tested with various applications in predictive data science, solving forward and inverse problems in scientific computing, and signal/image processing. Compared with alternatives in the literature, IAE-Net achieves state-of-the-art performance in existing applications and creates a wide range of new applications where existing methods fail."
"220541","Information-Theoretic Characterization of the Generalization Error for Iterative Semi-Supervised Learning","Haiyun He, Hanshu Yan, Vincent Y. F. Tan","https://jmlr.org//papers/volume23/22-0541/22-0541.pdf","https://github.com/HerianHe/GenErrorSSL_2022.git","Using information-theoretic principles, we consider the generalization error (gen-error) of iterative semi-supervised learning (SSL) algorithms that iteratively generate pseudo-labels for a large amount of unlabelled data  to progressively refine the model parameters. In contrast to most previous works that bound the gen-error, we provide an exact expression for the gen-error and particularize it to  the binary Gaussian mixture model. Our theoretical results suggest that when the class conditional variances are not too large, the gen-error decreases with the number of iterations, but quickly saturates.  On the flip side, if the class conditional variances (and so amount of overlap between the classes) are large, the gen-error  increases with the number of iterations. To mitigate this undesirable effect, we show that regularization can reduce  the gen-error. The theoretical results are corroborated by extensive experiments on  the MNIST and CIFAR datasets in which we notice that  for easy-to-distinguish classes, the gen-error improves after several pseudo-labelling iterations, but saturates afterwards, and for more difficult-to-distinguish classes, regularization improves the generalization performance."
"220611","ReservoirComputing.jl: An Efficient and Modular Library for Reservoir Computing Models","Francesco Martinuzzi, Chris Rackauckas, Anas Abdelrehim, Miguel D. Mahecha, Karin Mora","https://jmlr.org//papers/volume23/22-0611/22-0611.pdf","https://github.com/SciML/ReservoirComputing.jl","We introduce ReservoirComputing.jl, an open source Julia library for reservoir computing models. It is designed for temporal or sequential tasks such as time series prediction and modeling complex dynamical systems. As such it is suited to process a range of complex spatio-temporal data sets, from mathematical models to climate data. The key ideas of reservoir computing are the model architecture, i.e. the reservoir, which embeds the input into a higher dimensional space, and the learning paradigm, where only the readout layer is trained. As a result the computational resources can be kept low, and only linear optimization is required for the training. Although reservoir computing has proven itself as a successful machine learning algorithm, the software implementations have lagged behind, hindering wide recognition, reproducibility, and uptake by general scientists.  ReservoirComputing.jl enhances this field by being intuitive, highly modular, and faster compared to alternative tools.  A variety of modular components from the literature are implemented, e.g. two reservoir types - echo state networks and cellular automata models, and multiple training methods including Gaussian and support vector regression. A comprehensive documentation, which includes reproduced experiments from the literature is provided. The code and documentation are hosted on Github under an MIT license https://github.com/SciML/ReservoirComputing.jl."
"18711","Estimating Causal Effects under Network Interference with Bayesian Generalized Propensity Scores","Laura Forastiere, Fabrizia Mealli, Albert Wu, Edoardo M. Airoldi","https://jmlr.org//papers/volume23/18-711/18-711.pdf","","Real-world systems are often comprised of interconnected units, and can be represented as networks, with nodes and edges. In a social system, for instance, individuals  may have social ties and financial relationships. In these settings, when a node (the unit analysis) is exposed to a treatment, its effects may spill over to connected units; then estimating both the direct effect of the treatment and its spillover effects presents several challenges. First, assumptions about the mechanism through which spillover effects occur along the observed network are required. Second, in observational studies, where the treatment assignment has not been randomized, confounding and homophily are further potential threats to the identification and to the estimation of causal effects, on networks. Here, we make two structural assumptions: (i) neighborhood interference, which assumes interference operates only through a function of the immediate neighbors' treatments, and (ii) unconfoundedness of the individual and neighborhood treatment, which rules out the presence of unmeasured confounding variables, including those driving homophily.  Under these assumptions we develop a new covariate-adjustment estimator for direct treatment and spillover effects in observational studies on networks. We proposed an estimation strategy based on a generalized propensity score that balances individual and neighborhood covariates across units under different levels of individual treatment and of exposure to neighbors' treatment. Adjustment for propensity score is performed using a penalized spline regression. Our inference strategy capitalizes on a three-step Bayesian procedure, which allows to take account for the uncertainty in the propensity score estimation, and avoids model feedback. The correlation among connected units is taken into account using a community detection algorithm, and incorporating random effects in the outcome model. All these sources of variability, including variability of treatment assignment, are accounted for in the posterior distribution of the finite-sample causal estimands we target. We design a simulation study to assess the performance of the proposed estimator on different network topologies, both on synthetic networks and on real friendship network from the Add-Health study."
"201002","Regularized and Smooth Double Core Tensor Factorization for Heterogeneous Data","Davoud Ataee Tarzanagh, George Michailidis","https://jmlr.org//papers/volume23/20-1002/20-1002.pdf","https://github.com/Tarzanagh/DCOT","We introduce a general tensor model suitable for data analytic tasks for heterogeneous datasets, wherein there are joint low-rank structures within groups of observations, but also discriminative structures across different groups. To capture such complex structures, a double core tensor (DCOT) factorization model is introduced together with a family of smoothing loss functions. By leveraging the proposed smoothing function, the model accurately estimates the model factors, even in the presence of missing entries. A linearized ADMM method is employed to solve regularized versions of DCOT factorizations, that avoid large tensor operations and large memory storage requirements. Further, we establish theoretically its global convergence, together with consistency of the estimates of the model parameters. The effectiveness of the DCOT model is illustrated on several real-world examples including image completion, recommender systems, subspace clustering, and detecting modules in heterogeneous Omics multi-modal data, since it provides more insightful decompositions than conventional tensor methods."
"201104","Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays","Lukasz Kidzinski, Francis K.C. Hui, David I. Warton, Trevor J. Hastie","https://jmlr.org//papers/volume23/20-1104/20-1104.pdf","https://github.com/kidzik/gmf","Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large data sets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional data sets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a data set of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithms."
"201255","Two-mode Networks: Inference with as Many Parameters as Actors and Differential Privacy","Qiuping Wang, Ting Yan, Binyan Jiang, Chenlei Leng","https://jmlr.org//papers/volume23/20-1255/20-1255.pdf","","Many network data encountered are two-mode networks. These networks are characterized by having two sets of nodes and links are only made between nodes belonging to different sets. While their two-mode feature triggers interesting interactions, it also increases the risk of  privacy exposure, and it is essential to protect sensitive information from being disclosed when releasing these data. In this paper, we introduce a weak notion of edge differential privacy and propose to release the degree sequence of a two-mode network by adding non-negative Laplacian noises that satisfies this privacy definition. Under mild conditions for an exponential-family model for bipartite graphs in which each node is individually parameterized, we establish the consistency and Asymptotic normality of two differential privacy estimators, the first based on moment equations and the second after denoising the noisy sequence. For the latter, we develop an efficient algorithm which produces a readily useful synthetic bipartite graph. Numerical simulations and a real data application are carried out to verify  our theoretical results and demonstrate the usefulness of our proposal."
"201339","Expected Regret and Pseudo-Regret are Equivalent When the Optimal Arm is Unique","Daron Anderson, Douglas J. Leith","https://jmlr.org//papers/volume23/20-1339/20-1339.pdf","","In online linear optimisation with stochastic losses it is common to bound the pseudo-regret of an algorithm rather than the expected regret.  This is attributed to the expected fluctuations for i.i.d sums making expected regret bounds better than  $\Omega(\sqrt T)$ impossible.   In this paper we show that when there is a unique optimal action and the action set is a polytope the difference between pseudo-regret and expected regret is $o(1)$.   This means that the existing upper bounds on pseudo-regret in the literature can immediately be extended to also upper bound the expected regret.   Our results are independent of the algorithm used to select the actions and apply equally to the bandit and full-information settings."
"201405","Linearization and Identification of Multiple-Attractor Dynamical Systems through Laplacian Eigenmaps","Bernardo Fichera, Aude Billard","https://jmlr.org//papers/volume23/20-1405/20-1405.pdf","","Dynamical Systems (DS) are fundamental to the modeling and understanding time evolving phenomena, and have application in physics, biology and control. As determining an analytical description of the dynamics is often difficult, data-driven approaches are preferred for identifying and controlling nonlinear DS with multiple equilibrium points. Identification of such DS has been treated largely as a supervised learning problem. Instead, we focus on an unsupervised learning scenario where we know neither the number nor the type of dynamics. We propose a Graph-based spectral clustering method that takes advantage of a velocity-augmented kernel to connect data points belonging to the same dynamics, while preserving the natural temporal evolution. We study the eigenvectors and eigenvalues of the Graph Laplacian and show that they form a set of orthogonal embedding spaces, one for each sub-dynamics. We prove that there always exist a set of 2-dimensional embedding spaces in which the sub-dynamics are linear and n-dimensional embedding spaces where they are quasi-linear. We compare the clustering performance of our algorithm to Kernel K-Means, Spectral Clustering and Gaussian Mixtures and show that, even when these algorithms are provided with the correct number of sub-dynamics, they fail to cluster them correctly. We learn a diffeomorphism from the Laplacian embedding space to the original space and show that the Laplacian embedding leads to good reconstruction accuracy and a faster training time through an exponential decaying loss compared to the state-of-the-art diffeomorphism-based approaches."
"20296","Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables","Rohit Bhattacharya, Razieh Nabi, Ilya Shpitser","https://jmlr.org//papers/volume23/20-296/20-296.pdf","https://ananke.readthedocs.io/en/latest/","Identification theory for causal effects in causal models associated with hidden variable directed acyclic graphs (DAGs) is well studied. However, the corresponding algorithms are underused due to the complexity of estimating the identifying functionals they output. In this work, we bridge the gap between identification and estimation of population-level causal effects involving a single treatment and a single outcome. We derive influence function based estimators that exhibit double robustness for the identified effects in a large class of hidden variable DAGs where the treatment satisfies a simple graphical criterion; this class includes models yielding the adjustment and front-door functionals as special cases. We also provide necessary and sufficient conditions under which the statistical model of a hidden variable DAG is nonparametrically saturated and implies no equality constraints on the observed data distribution. Further, we derive an important class of hidden variable DAGs that imply observed data distributions observationally equivalent (up to equality constraints) to fully observed DAGs. In these classes of DAGs, we derive estimators that achieve the semiparametric efficiency bounds for the target of interest where the treatment satisfies our graphical criterion. Finally, we provide a sound and complete identification algorithm that directly yields a weight based estimation strategy for any identifiable effect in hidden variable causal models."
"20667","Stable Classification","Dimitris Bertsimas, Jack Dunn, Ivan Paskov","https://jmlr.org//papers/volume23/20-667/20-667.pdf","","We address the problem of instability of classification models: small changes in the training data leading to large changes in the resulting model and predictions. This phenomenon is especially well established for single tree based methods such as CART, however it is present  in all classification methods. We apply robust optimization to improve the stability of four of the most commonly used classification methods:  Random Forests, Logistic Regression, Support Vector Machines, and Optimal Classification Trees. Through experiments on 30 data sets with sizes ranging between 10^2 and 10^4 observations and features,  we show that our approach  (a) leads to improvements in stability, and in some cases accuracy, compared to the original methods, with the gains in stability being particularly significant (even, surprisingly, for those methods that were previously thought to be stable, such as Random Forests) and (b) has computational times comparable with (and indeed in some cases even faster than) the original methods allowing the method to be very scalable."
"210007","Handling Hard Affine SDP Shape Constraints in RKHSs","Pierre-Cyril Aubin-Frankowski, Zoltan Szabo","https://jmlr.org//papers/volume23/21-0007/21-0007.pdf","https://github.com/PCAubin/Handling-Hard-Affine-SDP-Shape-Constraints-in-RKHSs","Shape constraints, such as non-negativity, monotonicity, convexity  or supermodularity, play a key role in various applications of machine learning and statistics. However, incorporating this side information into predictive models in a hard way (for example at all points of an interval) for rich function classes is a notoriously challenging problem. We propose a unified and modular convex optimization framework, relying on second-order cone (SOC) tightening, to encode hard affine SDP constraints on function derivatives, for models belonging to vector-valued reproducing kernel Hilbert spaces (vRKHSs). The modular nature of the proposed approach allows to simultaneously handle multiple shape constraints, and to tighten an infinite number of constraints into finitely many. We prove the convergence of the proposed scheme and that of its adaptive variant, leveraging geometric properties of vRKHSs. Due to the covering-based construction of the tightening, the method is particularly well-suited to tasks with small to moderate input dimensions. The efficiency of the approach is illustrated in the context of shape optimization, safety-critical control, robotics and econometrics."
"210174","JsonGrinder.jl: automated differentiable neural architecture for embedding arbitrary JSON data","Šimon Mandlík, Matěj Račinský, Viliam Lisý, Tomáš Pevný","https://jmlr.org//papers/volume23/21-0174/21-0174.pdf","https://github.com/CTUAvastLab/JsonGrinder.jl","Standard machine learning (ML) problems are formulated on data converted into a suitable tensor representation. However, there are data sources, for example in cybersecurity, that are naturally represented in a unifying hierarchical structure, such as XML, JSON, and Protocol Buffers. Converting this data to a tensor representation is usually done by manual feature engineering, which is laborious, lossy, and prone to bias originating from the human inability to correctly judge the importance of particular features. JsonGrinder.jl is a library automating various ML tasks on these difficult sources. Starting with an arbitrary set of JSON samples, it automatically creates a differentiable ML model (called hmilnet), which embeds raw JSON samples into a fixed-size tensor representation. This embedding network can be naturally extended by an arbitrary  ML model expecting tensor inputs in order to perform classification, regression, or clustering."
"210369","Interpretable Classification of Categorical Time Series Using the Spectral Envelope and Optimal Scalings","Zeda Li, Scott A. Bruce, Tian Cai","https://jmlr.org//papers/volume23/21-0369/21-0369.pdf","https://github.com/zedali16/envsca","This article introduces a novel approach to the classification of categorical time series under the supervised learning paradigm. To construct meaningful features for categorical time series classification, we consider two relevant quantities: the spectral envelope and its corresponding set of optimal scalings. These quantities characterize oscillatory patterns in a categorical time series as the largest possible power at each frequency, or spectral envelope, obtained by assigning numerical values, or scalings, to categories that optimally emphasize oscillations at each frequency. Our procedure combines these two quantities to produce an interpretable and parsimonious feature-based classifier that can be used to accurately determine group membership for categorical time series. Classification consistency of the proposed method is investigated, and simulation studies are used to demonstrate accuracy in classifying categorical time series with various underlying group structures.  Finally, we use the proposed method to explore key differences in oscillatory patterns of sleep stage time series for patients with different sleep disorders and accurately classify patients accordingly.  The code for implementing the proposed method is available at https://github.com/zedali16/envsca."
"210494","More Powerful Conditional Selective Inference for Generalized Lasso by Parametric Programming","Vo Nguyen Le Duy, Ichiro Takeuchi","https://jmlr.org//papers/volume23/21-0494/21-0494.pdf","https://github.com/vonguyenleduy/parametric_generalized_lasso_selective_inference","Conditional selective inference (SI) has been studied intensively as a new statistical inference framework for data-driven hypotheses. The basic concept of conditional SI is to make the inference conditional on the selection event, which enables an exact and valid statistical inference to be conducted even when the hypothesis is selected based on the data. Conditional SI has mainly been studied in the context of model selection, such as vanilla lasso or generalized lasso. The main limitation of existing approaches is the low statistical power owing to over-conditioning, which is required for computational tractability. In this study, we propose a more powerful and general conditional SI method for a class of problems that can be converted into quadratic parametric programming, which includes generalized lasso. The key concept is to compute the continuum path of the optimal solution in the direction of the selected test statistic and to identify the subset of the data space that corresponds to the model selection event by following the solution path. The proposed parametric programming-based method not only avoids the aforementioned major drawback of over-conditioning, but also improves the performance and practicality of SI in various respects. We conducted several experiments to demonstrate the effectiveness and efficiency of our proposed method."
"210524","Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data","T. Tony Cai, Rong Ma","https://jmlr.org//papers/volume23/21-0524/21-0524.pdf","","This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations based on the underlying graph Laplacian, characterize its limiting behavior, and uncover its deep connection to Laplacian spectral clustering, and fundamental principles including early stopping as implicit regularization. The results explain the intrinsic mechanism and the empirical benefits of such a computational strategy. For the embedding stage of t-SNE, we characterize the kinematics of the low-dimensional map throughout the iterations, and identify an amplification phase, featuring the intercluster repulsion and the expansive behavior of the low-dimensional map, and a stabilization phase. The general theory explains the fast convergence rate and the exceptional empirical performance of t-SNE for visualizing clustered data, brings forth the interpretations of the t-SNE visualizations, and provides theoretical guidance for applying t-SNE and selecting its tuning parameters in various applications."
"210614","On Instrumental Variable Regression for Deep Offline Policy Evaluation","Yutian Chen, Liyuan Xu, Caglar Gulcehre, Tom Le Paine, Arthur Gretton, Nando de Freitas, Arnaud Doucet","https://jmlr.org//papers/volume23/21-0614/21-0614.pdf","https://github.com/liyuan9988/IVOPEwithACME","We show that the popular reinforcement learning (RL) strategy of estimating the state-action value (Q-function) by  minimizing the mean squared Bellman error leads to a regression problem with confounding, the inputs and  output  noise  being  correlated. Hence, direct minimization of the Bellman error can result in significantly biased Q-function estimates. We explain why fixing the target Q-network in Deep Q-Networks and Fitted Q Evaluation provides a way of overcoming this confounding, thus shedding new light on this popular but not well understood trick in the deep RL literature. An alternative approach to address confounding is to leverage techniques developed in the causality literature, notably instrumental variables (IV). We bring together here the literature on IV and RL by investigating whether IV approaches can lead to improved Q-function estimates. This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), where the goal is to estimate the value of a policy using logged data only. By applying different IV techniques to OPE, we are not only able to recover previously proposed OPE methods such as model-based techniques but also to obtain competitive new techniques. We find empirically that state-of-the-art OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE. We open-source all our code and datasets at https://github.com/liyuan9988/IVOPEwithACME."
"210644","Graph Partitioning and Sparse Matrix Ordering using Reinforcement Learning and Graph Neural Networks","Alice Gatti, Zhixiong Hu, Tess Smidt, Esmond G. Ng, Pieter Ghysels","https://jmlr.org//papers/volume23/21-0644/21-0644.pdf","https://github.com/alga-hopf/drl-graph-partitioning","We present a novel method for graph partitioning, based on reinforcement learning and graph convolutional neural networks. Our approach is to recursively partition coarser representations of a given graph. The neural network is implemented using SAGE graph convolution layers, and trained using an advantage actor critic (A2C) agent. We present two variants, one for finding ean edge separator that minimizes the normalized cut or quotient cut, and one that finds a small vertex separator. The vertex separators are then used to construct a nested dissection ordering to permute a sparse matrix so that its triangular factorization will incur less fill-in. The partitioning quality is compared with partitions obtained using METIS and SCOTCH, and the nested dissection ordering is evaluated in the sparse solver SuperLU. Our results show that the proposed method achieves similar partitioning quality as METIS, SCOTCH and spectral partitioning. Furthermore, the method generalizes across different classes of graphs, and works well on a variety of graphs from the SuiteSparse sparse matrix collection."
"210696","Variational Inference in high-dimensional linear regression","Sumit Mukherjee, Subhabrata Sen","https://jmlr.org//papers/volume23/21-0696/21-0696.pdf","","We study high-dimensional bayesian linear regression with product priors. Using the nascent theory of “non-linear large deviations"" (Chatterjee and Dembo, 2016), we derive sufficient conditions for the leading-order correctness of the naive mean-field approximation to the log-normalizing constant of the posterior distribution. Subsequently, assuming a true  linear model for the observed data, we derive a limiting infinite dimensional variational formula for the log normalizing constant for the posterior. Furthermore, we establish that under an additional “separation"" condition, the variational problem has a unique optimizer, and this optimizer governs the probabilistic properties of the posterior distribution. We provide intuitive sufficient conditions for the validity of this “separation"" condition. Finally, we illustrate our results on concrete examples with specific design matrices."
"210722","Tree-Values: Selective Inference for Regression Trees","Anna C. Neufeld, Lucy L. Gao, Daniela M. Witten","https://jmlr.org//papers/volume23/21-0722/21-0722.pdf","https://github.com/anna-neufeld/treevalues-simulations","We consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake."
"210889","Pathfinder:  Parallel quasi-Newton variational inference","Lu Zhang, Bob Carpenter, Andrew Gelman, Aki Vehtari","https://jmlr.org//papers/volume23/21-0889/21-0889.pdf","https://github.com/LuZhangstat/Pathfinder","We propose Pathfinder, a variational method for approximately sampling from differentiable probability densities.  Starting from a random initialization, Pathfinder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the inverse Hessian estimates produced by the optimizer.  Pathfinder returns draws from the approximation with the lowest estimated Kullback-Leibler (KL) divergence to the target distribution. We evaluate Pathfinder on a wide range of posterior distributions, demonstrating that its approximate draws are better than those from automatic differentiation variational inference (ADVI) and comparable to those produced by short chains of dynamic Hamiltonian Monte Carlo (HMC), as measured by 1-Wasserstein distance.  Compared to ADVI and short dynamic HMC runs, Pathfinder requires one to two orders of magnitude fewer log density and gradient evaluations, with greater reductions for more challenging posteriors.  Importance resampling over multiple runs of Pathfinder improves the diversity of approximate draws, reducing 1-Wasserstein distance further and providing a measure of robustness to optimization failures on plateaus, saddle points, or in minor modes.  The Monte Carlo KL divergence estimates are embarrassingly parallelizable in the core Pathfinder algorithm, as are multiple runs in the resampling version, further increasing Pathfinder's speed advantage with multiple cores."
"210946","Learning from Noisy Pairwise Similarity and Unlabeled Data","Songhua Wu, Tongliang Liu, Bo Han, Jun Yu, Gang Niu, Masashi Sugiyama","https://jmlr.org//papers/volume23/21-0946/21-0946.pdf","https://github.com/scifancier/Learning-from-Noisy-Pairwise-Similarity-and-Unlabeled-Data","SU classification employs similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data points to build a classifier, which can serve as an alternative to the standard supervised trained classifiers requiring data points with class labels. SU classification is advantageous because in the era of big data, more attention has been paid to data privacy. Datasets with specific class labels are often difficult to obtain in real-world classification applications regarding privacy-sensitive matters, such as politics and religion, which can be a bottleneck in supervised classification. Fortunately, similarity labels do not reveal the explicit information and inherently protect the privacy, e.g., collecting answers to “With whom do you share the same opinion on issue $\mathcal{I}$?"" instead of “What is your opinion on issue $\mathcal{I}$?"". Nevertheless, SU classification still has an obvious limitation: respondents might answer these questions in a manner that is viewed favorably by others instead of answering truthfully. Therefore, there exist some dissimilar data pairs labeled as similar, which significantly degenerates the performance of SU classification. In this paper, we study how to learn from noisy similar (nS) data pairs and unlabeled (U) data, which is called nSU classification. Specifically, we carefully model the similarity noise and estimate the noise rate by using the mixture proportion estimation technique. Then, a clean classifier can be learned by minimizing a denoised and unbiased classification risk estimator, which only involves the noisy data. Moreover, we further derive a theoretical generalization error bound for the proposed method. Experimental results demonstrate the effectiveness of the proposed algorithm on several benchmark datasets."
"211070","On Regularized Square-root Regression Problems: Distributionally Robust Interpretation and Fast Computations","Hong T.M. Chu, Kim-Chuan Toh, Yangjing Zhang","https://jmlr.org//papers/volume23/21-1070/21-1070.pdf","","Square-root (loss) regularized models have recently become popular in linear regression due to their nice statistical properties. Moreover, some of these models can be interpreted as the distributionally robust optimization  counterparts of the traditional least-squares regularized models. In this paper, we give a unified proof to show that any square-root regularized model whose penalty function being the sum of a simple norm and a seminorm can be interpreted as the distributionally robust optimization (DRO) formulation of the corresponding least-squares problem. In particular, the optimal transport cost in the DRO formulation is given by a certain dual form of the penalty. To solve the resulting square-root regularized model whose loss function and penalty function are both nonsmooth, we design a proximal point dual semismooth Newton algorithm and demonstrate its efficiency when the penalty is the  sparse group Lasso penalty or the  fused Lasso penalty. Extensive experiments demonstrate that our algorithm is highly efficient for solving the square-root sparse group Lasso problems and the square-root fused Lasso problems."
"211079","The Separation Capacity of Random Neural Networks","Sjoerd Dirksen, Martin Genzel, Laurent Jacques, Alexander Stollenwerk","https://jmlr.org//papers/volume23/21-1079/21-1079.pdf","","Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. In the present article, we enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under what conditions can a random neural network make two classes $\mathcal{X}^-, \mathcal{X}^+ \subset \mathbb{R}^d$ (with positive distance) linearly separable? We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. Crucially, the number of required neurons is explicitly linked to geometric properties of the underlying sets $\mathcal{X}^-, \mathcal{X}^+$ and their mutual arrangement. This instance-specific viewpoint allows us to overcome the usual curse of dimensionality (exponential width of the layers) in non-pathological situations where the data carries low-complexity structure. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity (based on a localized version of Gaussian mean width), which leads to sound and informative separation guarantees. We connect our result with related lines of work on approximation, memorization, and generalization."
"211234","Detecting Latent Communities in Network Formation Models","Shujie Ma, Liangjun Su, Yichong Zhang","https://jmlr.org//papers/volume23/21-1234/21-1234.pdf","","This paper proposes a logistic undirected network formation model which allows for assortative matching on observed individual characteristics and the presence of edge-wise fixed effects. We model the coefficients of observed characteristics to have a latent community structure and the edge-wise fixed effects to be of low rank. We propose a multi-step estimation procedure involving nuclear norm regularization, sample splitting, iterative logistic regression and spectral clustering to detect the latent communities. We show that the latent communities can be exactly recovered when the expected degree of the network is of order logn or higher, where n is the number of nodes in the network. The finite sample performance of the new estimation and inference methods is illustrated through both simulated and real datasets."
"211259","Toward Understanding Convolutional Neural Networks from Volterra Convolution Perspective","Tenghui Li, Guoxu Zhou, Yuning Qiu, Qibin Zhao","https://jmlr.org//papers/volume23/21-1259/21-1259.pdf","https://github.com/tenghuilee/nnvolterra.git","We make an attempt to understand convolutional neural network by exploring the relationship between (deep) convolutional neural networks and Volterra convolutions. We propose a novel approach to explain and  study the overall characteristics of neural networks without being disturbed by the horribly complex architectures. Specifically, we  attempt to convert the basic structures of a convolutional neural network (CNN) and their combinations to the form of Volterra convolutions. The results show that most of convolutional neural networks can be approximated in the form of Volterra convolution, where the approximated proxy kernels preserve the characteristics of the original network. Analyzing these proxy kernels may give valuable insight about the original network. Based on this setup, we present methods to approximate the order-zero and order-one proxy kernels, and verify the correctness and effectiveness of our results."
"211341","Nystrom Regularization for Time Series Forecasting","Zirui Sun, Mingwei Dai, Yao Wang, Shao-Bo Lin","https://jmlr.org//papers/volume23/21-1341/21-1341.pdf","","This paper focuses on learning rate analysis of Nystrom regularization with sequential sub-sampling for $\tau$-mixing time series. Using a recently developed Banach-valued Bernstein inequality for $\tau$-mixing sequences and an integral operator approach based on second-order decomposition, we succeed in deriving almost optimal learning rates of Nystrom regularization with sequential sub-sampling for $\tau$-mixing time series. A series of numerical experiments are carried out to verify our theoretical results, showing the  excellent learning performance of Nystrom regularization with sequential sub-sampling in learning massive  time series data. All these results extend the applicable range of  Nystr\""{o}m regularization  from i.i.d. samples to non-i.i.d. sequences."
"211483","Intrinsic Dimension Estimation Using Wasserstein Distance","Adam Block, Zeyu Jia, Yury Polyanskiy, Alexander Rakhlin","https://jmlr.org//papers/volume23/21-1483/21-1483.pdf","","It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data."
"211507","Oracle Complexity in Nonsmooth Nonconvex Optimization","Guy Kornowski, Ohad Shamir","https://jmlr.org//papers/volume23/21-1507/21-1507.pdf","","It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can find $\epsilon$-stationary points (with gradient norm less than $\epsilon$) in $\mathcal{O}(1/\epsilon^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. In this paper, we study nonsmooth nonconvex optimization from an oracle complexity viewpoint, where the algorithm is assumed to be given access only to local information about the function at various points. We provide two main results: First, we consider the problem of getting near $\epsilon$-stationary points. This is perhaps the most natural relaxation of finding $\epsilon$-stationary points, which is impossible in the nonsmooth nonconvex case. We prove that this relaxed goal cannot be achieved efficiently, for any distance and $\epsilon$ smaller than some constants. Our second result deals with the possibility of tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Namely, applying smooth optimization methods on a smooth approximation of the objective function. For this approach, we prove under a mild assumption an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very efficiently (e.g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods. On the other hand, these dimension factors can be  eliminated with suitable smoothing methods, but only by making the oracle complexity of the smoothing process exponentially large."
"220017","d3rlpy: An Offline Deep Reinforcement Learning Library","Takuma Seno, Michita Imai","https://jmlr.org//papers/volume23/22-0017/22-0017.pdf","https://github.com/takuseno/d3rlpy","In this paper, we introduce d3rlpy, an open-sourced offline deep reinforcement learning (RL) library for Python. d3rlpy supports a set of offline deep RL algorithms as well as off-policy online algorithms via a fully documented plug-and-play API. To address a reproducibility issue, we conduct a large-scale benchmark with D4RL and Atari 2600 dataset to ensure implementation quality and provide experimental scripts and full tables of results. The d3rlpy source code can be found on GitHub: https://github.com/takuseno/d3rlpy."
"220185","WarpDrive: Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU","Tian Lan, Sunil Srinivasa, Huan Wang, Stephan Zheng","https://jmlr.org//papers/volume23/22-0185/22-0185.pdf","https://github.com/salesforce/warp-drive","WarpDrive is a flexible, lightweight, and easy-to-use open-source framework for end-to-end deep multi-agent reinforcement learning (MARL) on a Graphics Processing Unit (GPU), available at https://github.com/salesforce/warp-drive. It addresses key system bottlenecks when applying MARL to complex environments with high-dimensional state, observation, or action spaces. For example, WarpDrive eliminates data copying between the CPU and GPU and runs thousands of simulations and agents in parallel. It also enables distributed training on multiple GPUs and scales to millions of agents. In all, WarpDrive enables orders-of-magnitude faster MARL compared to common CPU-GPU implementations. For example, WarpDrive yields 2.9 million environment steps/second with 2000 environments and 1000 agents (at least 100× faster than a CPU version) in a 2d-Tag simulation. It is user-friendly: e.g., it provides a lightweight, extendable Python interface and flexible environment wrappers. It is also compatible with PyTorch. In all, WarpDrive offers a platform to significantly accelerate reinforcement learning research and development."
"220207","Nonparametric Neighborhood Selection in Graphical Models","Hao Dong, Yuedong Wang","https://jmlr.org//papers/volume23/22-0207/22-0207.pdf","","The neighborhood selection method directly explores the conditional dependence structure and has been widely used to construct undirected graphical models. However, except for some special cases with discrete data, there is little research on nonparametric methods for neighborhood selection with mixed data. This paper develops a fully nonparametric neighborhood selection method under a consolidated smoothing spline ANOVA (SS ANOVA) decomposition framework. The proposed model is flexible and contains many existing models as special cases. The proposed method provides a unified framework for mixed data without any restrictions on the type of each random variable. We detect edges by applying an L1 regularization to interactions in the SS ANOVA decomposition. We propose an iterative procedure to compute the estimates and establish the convergence rates for conditional density and interactions. Simulations indicate that the proposed methods perform well under Gaussian and non-Gaussian settings. We illustrate the proposed methods using two real data examples."
"220293","Hamilton-Jacobi equations on graphs with applications to semi-supervised learning and data depth","Jeff Calder, Mahmood Ettehad","https://jmlr.org//papers/volume23/22-0293/22-0293.pdf","https://github.com/jwcalder/peikonal","Shortest path graph distances are widely used in data science and machine learning, since they can approximate the underlying geodesic distance on the data manifold. However, the shortest path distance is highly sensitive to the addition of corrupted edges in the graph, either through noise or an adversarial perturbation. In this paper we study a family of Hamilton-Jacobi equations on graphs that we call the $p$-eikonal equation. We show that the $p$-eikonal equation with $p=1$ is a provably robust distance-type function on a graph, and the $p\to \infty$ limit recovers shortest path distances. While the $p$-eikonal equation does not correspond to a shortest-path graph distance, we nonetheless show that the continuum limit of the $p$-eikonal equation on a random geometric graph recovers a geodesic density weighted distance in the continuum. We consider applications of the $p$-eikonal equation to data depth and semi-supervised learning, and use the continuum limit to prove asymptotic consistency results for both applications. Finally, we show the results of experiments with data depth and semi-supervised learning on real image datasets, including MNIST, FashionMNIST and CIFAR-10, which show that the $p$-eikonal equation offers significantly better results compared to shortest path distances."
"220529","Self-Healing Robust Neural Networks via Closed-Loop Control","Zhuotong Chen, Qianxiao Li, Zheng Zhang","https://jmlr.org//papers/volume23/22-0529/22-0529.pdf","https://github.com/zhuotongchen/Self-Healing-Robust-Neural-Networks-via-Closed-Loop-Control","Despite the wide applications of neural networks, there have been increasing concerns about their vulnerability issue. While numerous attack and defense techniques have been developed, this work investigates the robustness issue from a new angle: can we design a self-healing neural network that can automatically detect and fix the vulnerability issue by itself? A typical self-healing mechanism is the immune system of a human body. This biology-inspired idea has been used in many engineering designs but has rarely been investigated in deep learning.  This paper considers the post-training self-healing of a neural network, and proposes a closed-loop control formulation to automatically detect and fix the errors caused by various attacks or perturbations. We provide a margin-based analysis to explain how this formulation can improve the robustness of a classifier. To speed up the inference, we convert the optimal control problem to Pontryagon's Maximum Principle and solve it via the method of successive approximation. Lastly, we present an error estimation of the proposed framework for neural networks with nonlinear activation functions. We validate the performance of several network architectures against various perturbations. Since the self-healing method does not need a-priori information about data perturbations or attacks, it can handle a broad class of unforeseen perturbations."
"220681","Network Regression with Graph Laplacians","Yidong Zhou, Hans-Georg Müller","https://jmlr.org//papers/volume23/22-0681/22-0681.pdf","https://github.com/yidongzhou/Network-Regression-with-Graph-Laplacians","Network data are increasingly available in various research fields, motivating statistical analysis for populations of networks, where a network as a whole is viewed as a data point. The study of how a network changes as a function of covariates is often of paramount interest. However, due to the non-Euclidean nature of networks, basic statistical tools available for scalar and vector data are no longer applicable. This motivates an extension of the notion of regression to the case where responses are network data. Here we propose to adopt conditional Fréchet means implemented as M-estimators that depend on weights derived from both global and local least squares regression, extending the Fréchet regression framework to networks that are quantified by their graph Laplacians. The challenge is to characterize the space of graph Laplacians to justify the application of Fréchet regression. This characterization then leads to asymptotic rates of convergence for the corresponding M-estimators by applying empirical process methods. We demonstrate the usefulness and good practical performance of the proposed framework with simulations and with network data arising from resting-state fMRI in neuroimaging, as well as New York taxi records."
"19795","On Low-rank Trace Regression under General Sampling Distribution","Nima Hamidi, Mohsen Bayati","https://jmlr.org//papers/volume23/19-795/19-795.pdf","https://github.com/mohsenbayati/cv-impute","In this paper, we study the trace regression when a matrix of parameters $\mathbf{B}^\star$ is estimated via the convex relaxation of a rank-regularized regression or via regularized non-convex optimization. It is known that these estimators satisfy near-optimal error bounds under assumptions on the rank, coherence, and spikiness of $\mathbf{B}^\star$. We start by introducing a general notion of spikiness for $\mathbf{B}^\star$ that provides a generic recipe to prove the restricted strong convexity of the sampling operator of the trace regression and obtain near-optimal and non-asymptotic error bounds for the estimation error. Similar to the existing literature, these results require the regularization parameter to be above a certain theory-inspired threshold that depends on observation noise that may be unknown in practice. Next, we extend the error bounds to cases where the regularization parameter is chosen via cross-validation. This result is significant in that existing theoretical results on cross-validated estimators (Kale et al., 2011; Kumar et al., 2013; Abou-Moustafa and Szepesvari, 2017) do not apply to our setting since the estimators we study are not known to satisfy their required notion of stability. Finally, using simulations on synthetic and real data, we show that the cross-validated estimator selects a near-optimal penalty parameter and outperforms the theory-inspired approach of selecting the parameter."
"201036","Community detection in sparse latent space models","Fengnan Gao, Zongming Ma, Hongsong Yuan","https://jmlr.org//papers/volume23/20-1036/20-1036.pdf","","We show that a simple community detection algorithm originated from stochastic blockmodel literature achieves consistency, and even optimality, for a broad and flexible class of sparse latent space models. The class of models includes latent eigenmodels (Hoff, 2008). The community detection algorithm is based on spectral clustering followed by local refinement via normalized edge counting. It is easy to implement and attains high accuracy with a low computational budget. The proof of its optimality depends on a neat equivalence between likelihood ratio test and edge counting in a simple vs. simple hypothesis testing problem that underpins the refinement step, which could be of independent interest."
"201129","Convergence Rates for Gaussian Mixtures of Experts","Nhat Ho, Chiao-Yu Yang, Michael I. Jordan","https://jmlr.org//papers/volume23/20-1129/20-1129.pdf","","We provide a theoretical treatment of over-specified Gaussian mixtures of experts with covariate-free gating networks. We establish the convergence rates of the maximum likelihood estimation (MLE) for these models. Our proof technique is based on a novel notion of algebraic independence of the expert functions. Drawing on optimal transport, we establish a connection between the algebraic independence of the expert functions and a certain class of partial differential equations (PDEs) with respect to the parameters. Exploiting this connection allows us to derive convergence rates for parameter estimation."
"201319","Improving Bayesian Network Structure Learning in the Presence of Measurement Error","Yang Liu, Anthony C. Constantinou, Zhigao Guo","https://jmlr.org//papers/volume23/20-1319/20-1319.pdf","https://github.com/Enderlogic/Spurious-Edge-Detection","Structure learning algorithms that learn the graph of a Bayesian network from observational data often do so by assuming the data correctly reflect the true distribution of the variables. However, this assumption does not hold in the presence of measurement error, which can lead to spurious edges. This is one of the reasons why the synthetic performance of these algorithms often overestimates real-world performance. This paper describes a heuristic algorithm that can be added as an additional learning phase at the end of any structure learning algorithm, and serves as a correction learning phase that removes potential false positive edges. The results show that the proposed correction algorithm successfully improves the graphical score of five well-established structure learning algorithms spanning different classes of learning in the presence of measurement error."
"201385","On Mixup Regularization","Luigi Carratino, Moustapha Cissé, Rodolphe Jenatton, Jean-Philippe Vert","https://jmlr.org//papers/volume23/20-1385/20-1385.pdf","","Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretical foundations of Mixup, by clarifying its regularization effects. We show that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data. We gain two core insights from this new interpretation. First, the data transformation suggests that, at test time, a model trained with Mixup should also be applied to transformed data, a one-line change in code that we show empirically to improve both accuracy and calibration of the prediction. Second, we show how the random perturbation of the new interpretation of Mixup induces multiple known regularization schemes, including label smoothing and reduction of the Lipschitz constant of the estimator. These schemes interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions. We corroborate our theoretical analysis with experiments that support our conclusions."
"201424","Project and Forget: Solving Large-Scale Metric Constrained Problems","Rishi Sonthalia, Anna C. Gilbert","https://jmlr.org//papers/volume23/20-1424/20-1424.pdf","https://github.com/rsonthal/ProjectAndForget","Many important machine learning problems can be formulated as highly constrained convex optimization problems. One important example is metric constrained problems. In this paper, we show that standard optimization techniques can not be used to solve metric constrained problem. To solve such problems, we provide a general active set framework, called Project and Forget, and several variants thereof that use Bregman projections. Project and Forget is a general purpose method that can be used to solve highly constrained convex problems with many (possibly exponentially) constraints. We provide a theoretical analysis of Project and Forget and prove that our algorithms converge to the global optimal solution and have a linear rate of convergence. We demonstrate that using our method, we can solve large problem instances of general weighted correlation clustering, metric nearness, information theoretic metric learning and quadratically regularized optimal transport; in each case, out-performing the state of the art methods with respect to CPU times and problem sizes."
"20442","Kernel Autocovariance Operators of Stationary Processes: Estimation and Convergence","Mattes Mollenhauer, Stefan Klus, Christof Schütte, Péter Koltai","https://jmlr.org//papers/volume23/20-442/20-442.pdf","","We consider autocovariance operators of a stationary stochastic process on a Polish space that is embedded into a reproducing kernel Hilbert space. We investigate how empirical estimates of these operators converge along realizations of the process under various conditions. In particular, we examine ergodic and strongly mixing processes and obtain several asymptotic results as well as finite sample error bounds. We provide applications of our theory in terms of consistency results for kernel PCA with dependent data and the conditional mean embedding of transition probabilities. Finally, we use our approach to examine the nonparametric estimation of Markov transition operators and highlight how our theory can give a consistency analysis for a large family of spectral analysis methods including kernel-based dynamic mode decomposition."
"20899","Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima","Brian Swenson, Ryan Murray, H. Vincent Poor, Soummya Kar","https://jmlr.org//papers/volume23/20-899/20-899.pdf","","Gradient-descent (GD) based algorithms are an indispensable tool for optimizing modern machine learning models. The paper considers distributed stochastic GD (D-SGD)--a network-based variant of GD. Distributed algorithms play an important role in large-scale machine learning problems as well as the Internet of Things (IoT) and related applications. The paper considers two main issues. First, we study convergence of D-SGD to critical points when the loss function is nonconvex and nonsmooth. We consider a broad range of nonsmooth loss functions including those of practical interest in modern deep learning. It is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points. It is well known that classical GD avoids saddle points; however, analogous results have been absent for distributed variants of GD. For this problem, we again assume that loss functions may be nonconvex and nonsmooth, but are smooth in a neighborhood of a saddle point. It is shown that, for any fixed initialization, D-SGD avoids such saddle points with probability one. Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation."
"210166","Joint Continuous and Discrete Model Selection via Submodularity","Jonathan Bunton, Paulo Tabuada","https://jmlr.org//papers/volume23/21-0166/21-0166.pdf","","In model selection problems for machine learning, the desire for a well-performing model with meaningful structure is typically expressed through a regularized optimization problem.  In many scenarios, however, the meaningful structure is specified in some discrete space, leading to difficult nonconvex optimization problems.  In this paper, we connect the model selection problem with structure-promoting regularizers to submodular function minimization with continuous and discrete arguments.  In particular, we leverage the theory of submodular functions to identify a class of these problems that can be solved exactly and efficiently with an agnostic combination of discrete and continuous optimization routines.  We show how simple continuous or discrete constraints can also be handled for certain problem classes and extend these ideas to a robust optimization framework.  We also show how some problems outside of this class can be embedded into the class, further extending the class of problems our framework can accommodate.  Finally, we numerically validate our theoretical results with several proof-of-concept examples with synthetic and real-world data, comparing against state-of-the-art algorithms."
"210182","ALMA: Alternating Minimization Algorithm for Clustering Mixture Multilayer Network","Xing Fan, Marianna Pensky, Feng Yu, Teng Zhang","https://jmlr.org//papers/volume23/21-0182/21-0182.pdf","","The paper considers a Mixture Multilayer Stochastic Block Model (MMLSBM), where layers can be partitioned into groups of similar networks, and networks in each group are equipped with a distinct Stochastic Block Model. The goal is to partition the multilayer network into clusters of similar layers, and to identify communities in those layers. Jing et al. (2020) introduced the MMLSBM and developed a clustering methodology, TWIST, based on regularized tensor decomposition. The present paper proposes a different technique, an alternating minimization algorithm (ALMA), that aims at simultaneous recovery of the layer partition, together with estimation of the matrices of connection probabilities of the distinct layers. Compared to TWIST, ALMA achieves higher accuracy, both theoretically and numerically."
"210420","The Geometry of Uniqueness, Sparsity and Clustering in Penalized Estimation","Ulrike Schneider, Patrick Tardivel","https://jmlr.org//papers/volume23/21-0420/21-0420.pdf","","We provide a necessary and sufficient condition for the uniqueness of penalized least-squares estimators whose penalty term is given by a norm with a polytope unit ball, covering a wide range of methods including SLOPE, PACS, fused, clustered and classical LASSO as well as the related method of basis pursuit. We consider a strong type of uniqueness that is relevant for statistical problems. The uniqueness condition is geometric and involves how the row span of the design matrix intersects the faces of the dual norm unit ball, which for SLOPE is given by the signed permutahedron. Further considerations based this condition also allow to derive results on sparsity and clustering features. In particular, we define the notion of a SLOPE pattern to describe both sparsity and clustering properties of this method and also provide a geometric characterization of accessible SLOPE patterns."
"210506","Maximum sampled conditional likelihood for informative subsampling","HaiYing Wang, Jae Kwang Kim","https://jmlr.org//papers/volume23/21-0506/21-0506.pdf","","Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited. After a subsample is taken from the full data, most available methods use an inverse probability weighted (IPW) objective function to estimate the model parameters. The IPW estimator does not fully utilize the information in the selected subsample. In this paper, we propose to use the maximum sampled conditional likelihood estimator (MSCLE) based on the sampled data. We established the asymptotic normality of the MSCLE and prove that its asymptotic variance covariance matrix is the smallest among a class of asymptotically unbiased estimators, including the IPW estimator. We further discuss the asymptotic results with the L-optimal subsampling probabilities and illustrate the estimation procedure with generalized linear models. Numerical experiments are provided to evaluate the practical performance of the proposed method."
"210585","Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression","Domagoj Cevid, Loris Michel, Jeffrey Näf, Peter Bühlmann, Nicolai Meinshausen","https://jmlr.org//papers/volume23/21-0585/21-0585.pdf","","Random Forest is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint conditional distribution, independent of the estimation target and the data model. It uses a new splitting criterion based on the MMD distributional metric, which is suitable for detecting heterogeneity in multivariate distributions. The induced weights define an estimate of the full conditional distribution, which in turn can be used for arbitrary and potentially complicated targets of interest. The method is very versatile and convenient to use, as we illustrate on a wide range of examples. The code is available as Python and R packages drf."
"210618","Fully General Online Imitation Learning","Michael K. Cohen, Marcus Hutter, Neel Nanda","https://jmlr.org//papers/volume23/21-0618/21-0618.pdf","","In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way they would have if the demonstrator had been acting the whole time. In general, one mistake during learning can lead to completely different events. In the special setting of environments that restart, existing work provides formal guidance in how to imitate so that events unfold similarly, but outside that setting, no formal guidance exists. We address a fully general setting, in which the (stochastic) environment and demonstrator never reset, not even for training purposes, and we allow our imitator to learn online from the demonstrator. Our new conservative Bayesian imitation learner underestimates the probabilities of each available action, and queries for more data with the remaining probability. Our main result: if an event would have been unlikely had the demonstrator acted the whole time, that event's likelihood can be bounded above when running the (initially totally ignorant) imitator instead. Meanwhile, queries to the demonstrator rapidly diminish in frequency. If any such event qualifies as ""dangerous"", our imitator would have the notable distinction of being relatively ""safe""."
"210656","Causal Aggregation: Estimation and Inference of Causal Effects by Constraint-Based Data Fusion","Jaime Roquero Gimenez, Dominik Rothenhäusler","https://jmlr.org//papers/volume23/21-0656/21-0656.pdf","","In causal inference, it is common to estimate the causal effect of a single treatment variable on an outcome. However, practitioners may also be interested in the effect of simultaneous interventions on multiple covariates of a fixed target variable. We propose a novel method that allows to estimate the effect of joint interventions using data from different experiments in which only very few variables are manipulated. If there is only little randomized data or no randomized data at all, one can use observational data sets if certain parental sets are known or instrumental variables are available. If the joint causal effect is linear, the proposed method can be used for estimation and inference of joint causal effects, and we characterize conditions for identifiability. In the overidentified case, we indicate how to leverage all the available causal information across multiple data sets to efficiently estimate the causal effects. If the dimension of the covariate vector is large, we may only have a few samples in each data set. Under a sparsity assumption, we derive an estimator of the causal effects in this high-dimensional scenario. In addition, we show how to deal with the case where a lack of experimental constraints prevents direct estimation of the causal effects. When the joint causal effects are non-linear, we characterize conditions under which identifiability holds, and propose a non-linear causal aggregation methodology for experimental data sets similar to the gradient boosting algorithm where in each iteration we combine weak learners trained on different datasets using only unconfounded samples. We demonstrate the effectiveness of the proposed method on simulated and semi-synthetic data."
"210709","Faster Randomized Interior Point Methods for Tall/Wide Linear Programs","Agniva Chowdhury, Gregory Dexter, Palma London, Haim Avron, Petros Drineas","https://jmlr.org//papers/volume23/21-0709/21-0709.pdf","","Linear programming (LP) is an extremely useful tool which has been successfully applied to solve various problems in a wide range of areas, including operations research, engineering, economics, or even more abstract mathematical areas such as combinatorics. It is also used in many machine learning applications, such as $\ell_1$-regularized SVMs, basis pursuit, nonnegative matrix factorization, etc.  Interior Point Methods (IPMs) are one of the most popular methods to solve LPs both in theory and in practice. Their underlying complexity is dominated by the cost of solving a system of linear equations at each iteration.  In this paper, we consider both feasible and infeasible IPMs for the special case where the number of variables is much larger than the number of constraints. Using tools from Randomized Linear Algebra, we present a preconditioning technique that, when combined with the iterative solvers such as Conjugate Gradient or Chebyshev Iteration, provably guarantees that IPM algorithms (suitably modified to account for the error incurred by the approximate solver), converge to a feasible, approximately optimal solution, without increasing their iteration complexity. Our empirical evaluations verify our theoretical results on both real-world and synthetic data."
"210766","Statistical Optimality and Computational Efficiency of Nystrom Kernel PCA","Nicholas Sterge, Bharath K. Sriperumbudur","https://jmlr.org//papers/volume23/21-0766/21-0766.pdf","","Kernel methods provide an elegant framework for developing nonlinear learning algorithms from simple linear methods. Though these methods have superior empirical performance in several real data applications, their usefulness is inhibited by the significant computational burden incurred in large sample situations. Various approximation schemes have been proposed in the literature to alleviate these computational issues, and the approximate kernel machines are shown to retain the empirical performance. However, the theoretical properties of these approximate kernel machines are less well understood. In this work, we theoretically study the trade-off between computational complexity and statistical accuracy in Nystrom approximate kernel principal component analysis (KPCA), wherein we show that the Nystrom approximate KPCA matches the statistical performance of (non-approximate) KPCA while remaining computationally beneficial. Additionally, we show that Nystrom approximate KPCA outperforms the statistical behavior of another popular approximation scheme, the random feature approximation, when applied to KPCA."
"210917","Interval-censored Hawkes processes","Marian-Andrei Rizoiu, Alexander Soen, Shidi Li, Pio Calderon, Leanne J. Dong, Aditya Krishna Menon, Lexing Xie","https://jmlr.org//papers/volume23/21-0917/21-0917.pdf","","Interval-censored data solely records the aggregated counts of events during specific time intervals -- such as the number of patients admitted to the hospital or the volume of vehicles passing traffic loop detectors -- and not the exact occurrence time of the events. It is currently not understood how to fit the Hawkes point processes to this kind of data. Its typical loss function (the point process log-likelihood) cannot be computed without exact event times. Furthermore, it does not have the independent increments property to use the Poisson likelihood. This work builds a novel point process, a set of tools, and approximations for fitting Hawkes processes within interval-censored data scenarios. First, we define the Mean Behavior Poisson process (MBPP), a novel Poisson process with a direct parameter correspondence to the popular self-exciting Hawkes process. We fit MBPP in the interval-censored setting using an interval-censored Poisson log-likelihood (IC-LL). We use the parameter equivalence to uncover the parameters of the associated Hawkes process. Second, we introduce two novel exogenous functions to distinguish the exogenous from the endogenous events. We propose the multi-impulse exogenous function -- for when the exogenous events are observed as event time -- and the latent homogeneous Poisson process exogenous function -- for when the exogenous events are presented as interval-censored volumes. Third, we provide several approximation methods to estimate the intensity and compensator function of MBPP when no analytical solution exists. Fourth and finally, we connect the interval-censored loss of MBPP to a broader class of Bregman divergence-based functions. Using the connection, we show that the popularity estimation algorithm Hawkes Intensity Process (HIP) is a particular case of the MBPP. We verify our models through empirical testing on synthetic data and real-world data. We find that our MBPP outperforms HIP on real-world datasets for the task of popularity prediction. This work makes it possible to efficiently fit the Hawkes process to interval-censored data."
"210983","Early Stopping for Iterative Regularization with General Loss Functions","Ting Hu, Yunwen Lei","https://jmlr.org//papers/volume23/21-0983/21-0983.pdf","","In this paper, we investigate the early stopping strategy for the iterative regularization technique, which is based on gradient descent of convex loss functions in reproducing kernel Hilbert spaces without an explicit regularization term. This work shows that projecting the last iterate of the stopping time produces an estimator that can improve the generalization ability. Using the upper bound of the generalization errors, we  establish a close link between the iterative regularization and Tikhonov regularization scheme and explain theoretically why the two schemes have similar regularization paths in the existing  numerical simulations. We introduce a data-dependent way based on cross-validation to select the  stopping time.  We prove that the  a-posteriori selection way can retain the comparable generalization errors to those obtained by our stopping rules with a-prior parameters."
"211078","Fundamental Limits and Tradeoffs in Invariant Representation Learning","Han Zhao, Chen Dan, Bryon Aragam, Tommi S. Jaakkola, Geoffrey J. Gordon, Pradeep Ravikumar","https://jmlr.org//papers/volume23/21-1078/21-1078.pdf","","A wide range of machine learning applications such as privacy-preserving learning, algorithmic fairness, and domain adaptation/generalization among others, involve learning invariant representations of the data that aim to achieve two competing goals: (a) maximize information or accuracy with respect to a target response, and (b) maximize invariance or independence with respect to a set of protected features (e.g.\ for fairness, privacy, etc). Despite their wide applicability, theoretical understanding of the optimal tradeoffs --- with respect to accuracy, and invariance --- achievable by invariant representations is still severely lacking. In this paper, we provide an information theoretic analysis of such tradeoffs under both classification and regression settings. More precisely, we provide a geometric characterization of the accuracy and invariance achievable by any representation of the data; we term this feasible region the information plane. We provide an inner bound for this feasible region for the classification case, and an exact characterization for the regression case, which allows us to either bound or exactly characterize the Pareto optimal frontier between accuracy and invariance. Although our contributions are mainly theoretical, a key practical application of our results is in certifying the potential sub-optimality of any given representation learning algorithm for either classification or regression tasks. Our results shed new light on the fundamental interplay between accuracy and invariance, and may be useful in guiding the design of future representation learning algorithms."
"211150","Information-theoretic Classification Accuracy: A Criterion that Guides Data-driven Combination of Ambiguous Outcome Labels in Multi-class Classification","Chihao Zhang, Yiling Elaine Chen, Shihua Zhang, Jingyi Jessica Li","https://jmlr.org//papers/volume23/21-1150/21-1150.pdf","https://github.com/JSB-UCLA/ITCA","Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels for all data points (instances) in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide the label combination for all data points by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion that balances the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and classification resolution (how many labels are predictable), to guide practitioners on how to combine ambiguous outcome labels. To find the optimal label combination indicated by ITCA, we propose two search strategies: greedy search and breadth-first search. ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: improving prediction accuracy and identifying ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications, including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification. We also provide theoretical insights into ITCA by studying the oracle and the linear discriminant analysis classification algorithms. Python package itca (available at https://github.com/JSB-UCLA/ITCA) implements ITCA and the search strategies."
"211240","SGD with Coordinate Sampling: Theory and Practice","Rémi Leluc, François Portier","https://jmlr.org//papers/volume23/21-1240/21-1240.pdf","https://github.com/RemiLELUC/SCGD-Musketeer","While classical forms of stochastic gradient descent algorithm treat the different coordinates in the same way, a framework allowing for adaptive (non uniform) coordinate sampling is developed to leverage structure in data. In a non-convex setting and including zeroth-order gradient estimate, almost sure convergence as well as non-asymptotic bounds are established. Within the proposed framework, we develop an algorithm, MUSKETEER, based on a reinforcement strategy: after collecting information on the noisy gradients, it samples the most promising coordinate (all for one); then it moves along the one direction yielding an important decrease of the objective (one for all). Numerical experiments on both synthetic and real data examples confirm the effectiveness of MUSKETEER in large scale problems."
"211306","Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch","Shangtong Zhang, Remi Tachet des Combes, Romain Laroche","https://jmlr.org//papers/volume23/21-1306/21-1306.pdf","https://github.com/ShangtongZhang/DeepRL","In this paper,  we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy.  Our work goes beyond existing works on the optimality of policy gradient methods in that  existing works use the exact policy gradient for updating the policy parameters   while we use an approximate and stochastic update step.  Our update step is not a gradient update because we do not use a density ratio to correct the state distribution,  which aligns well with what practitioners do.  Our update is approximate because we use a learned critic instead of the true value function.  Our update is stochastic because at each step the update is done for only the current state action pair.  Moreover,  we remove several restrictive assumptions from existing works in our analysis.  Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains,  based on its uniform contraction properties."
"211357","Vector-Valued Least-Squares Regression under Output Regularity Assumptions","Luc Brogat-Motte, Alessandro Rudi, Céline Brouard, Juho Rousu, Florence d'Alché-Buc","https://jmlr.org//papers/volume23/21-1357/21-1357.pdf","","We propose and analyse a reduced-rank method for solving least-squares regression problems with infinite dimensional output. We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to full-rank method. Our analysis extends the interest of reduced-rank regression beyond the standard low-rank setting to more general output regularity assumptions. We illustrate our theoretical insights on synthetic least-squares problems. Then, we propose a surrogate structured prediction method derived from this reduced-rank method.  We assess its benefits on three different problems: image reconstruction, multi-label classification, and metabolite identification."
"211484","Constraint Reasoning Embedded Structured Prediction","Nan Jiang, Maosen Zhang, Willem-Jan van Hoeve, Yexiang Xue","https://jmlr.org//papers/volume23/21-1484/21-1484.pdf","https://jiangnanhugo.github.io/CORE-SP/","Many real-world structured prediction problems need machine learning to capture data distribution and constraint reasoning to ensure structure validity. Nevertheless, constrained structured prediction is still limited in real-world applications because of the lack of tools to bridge constraint satisfaction and machine learning. In this paper, we propose COnstraint REasoning embedded Structured Prediction (Core-Sp), a scalable constraint reasoning and machine learning integrated approach for learning over structured domains. We propose to embed decision diagrams, a popular constraint reasoning tool, as a fully-differentiable module into deep neural networks for structured prediction. We also propose an iterative search algorithm to automate the searching process of the best Core-Sp structure. We evaluate Core-Sp on three applications: vehicle dispatching service planning, if-then program synthesis, and text2SQL generation. The proposed Core-Sp module demonstrates superior performance over state-of-the-art approaches in all three applications. The structures generated with Core-Sp satisfy 100% of the constraints when using exact decision diagrams. In addition, Core-Sp boosts learning performance by reducing the modeling space via constraint satisfaction."
"211519","Minimax optimal approaches to the label shift problem in non-parametric settings","Subha Maity, Yuekai Sun, Moulinath Banerjee","https://jmlr.org//papers/volume23/21-1519/21-1519.pdf","","We study the minimax rates of the label shift problem in non-parametric classification. In addition to the unsupervised setting in which the learner only has access to unlabeled examples from the target domain, we also consider the setting in which a small number of labeled examples from the target domain is available to the learner. Our study reveals a difference in the difficulty of the label shift problem in the two settings, and we attribute this difference to the availability of data from the target domain to estimate the class conditional distributions in the latter setting. We also show that a class proportion estimation approach is minimax rate-optimal in the unsupervised setting."
"220026","Existence, Stability and Scalability of Orthogonal Convolutional Neural Networks","El Mehdi Achour, François Malgouyres, Franck Mamalet","https://jmlr.org//papers/volume23/22-0026/22-0026.pdf","https://github.com/deel-ai/deel-lip","Imposing orthogonality on the layers of neural networks is known to facilitate the learning by limiting the exploding/vanishing of the gradient; decorrelate the features; improve the robustness. This paper studies the theoretical properties of orthogonal convolutional layers. We establish necessary and sufficient conditions on the layer architecture guaranteeing the existence of an orthogonal convolutional transform. The conditions prove that orthogonal convolutional transforms exist for almost all architectures used in practice for 'circular' padding. We also exhibit limitations with 'valid' boundary conditions and 'same' boundary conditions with zero-padding. Recently, a regularization term imposing the orthogonality of convolutional layers has been proposed, and impressive empirical results have been obtained in different applications (Wang et al., 2020). The second motivation of the present paper is to specify the theory behind this. We make the link between this regularization term and orthogonality measures. In doing so, we show that this regularization strategy is stable with respect to numerical and optimization errors and that, in the presence of small errors and when the size of the signal/image is large, the convolutional layers remain close to isometric. The theoretical results are confirmed with experiments and the landscape of the regularization term is studied. Experiments on real data sets show that when orthogonality is used to enforce robustness, the parameter multiplying the regularization term can be used to tune a tradeoff between accuracy and orthogonality, for the benefit of both accuracy and robustness. Altogether, the study guarantees that the regularization proposed in Wang et al. (2020) is an efficient, flexible and stable numerical strategy to learn orthogonal convolutional layers."
"220204","Scalable Gaussian-process regression and variable selection using Vecchia approximations","Jian Cao, Joseph Guinness, Marc G. Genton, Matthias Katzfuss","https://jmlr.org//papers/volume23/22-0204/22-0204.pdf","https://github.com/katzfuss-group/Vecchia_GPR_var_select","Gaussian process (GP) regression is a flexible, nonparametric approach to regression that naturally quantifies uncertainty. In many applications, the number of responses and covariates are both large, and a goal is to select covariates that are related to the response. For this setting, we propose a novel, scalable algorithm, coined VGPR, which optimizes a penalized GP log-likelihood based on the Vecchia GP approximation, an ordered conditional approximation from spatial statistics that implies a sparse Cholesky factor of the precision matrix. We traverse the regularization path from strong to weak penalization, sequentially adding candidate covariates based on the gradient of the log-likelihood and deselecting irrelevant covariates via a new quadratic constrained coordinate descent algorithm. We propose Vecchia-based mini-batch subsampling, which provides unbiased gradient estimators. The resulting procedure is scalable to millions of responses and thousands of covariates. Theoretical analysis and numerical studies demonstrate the improved scalability and accuracy relative to existing methods."
"220277","OMLT: Optimization & Machine Learning Toolkit","Francesco Ceccon, Jordan Jalving, Joshua Haddad, Alexander Thebelt, Calvin Tsay, Carl D Laird, Ruth Misener","https://jmlr.org//papers/volume23/22-0277/22-0277.pdf","https://github.com/cog-imperial/OMLT","The optimization and machine learning toolkit (OMLT) is an open-source software package incorporating neural network and gradient-boosted tree surrogate models, which have been trained using machine learning, into larger optimization problems. We discuss the advances in optimization technology that made OMLT possible and show how OMLT seamlessly integrates with the algebraic modeling language Pyomo. We demonstrate how to use OMLT for solving decision-making problems in both computer science and engineering."
"220383","Approximate Bayesian Computation via Classification","Yuexi Wang, Tetsuya Kaji, Veronika Rockova","https://jmlr.org//papers/volume23/22-0383/22-0383.pdf","","Approximate Bayesian Computation (ABC) enables statistical inference in simulator-based models whose likelihoods are difficult to calculate but easy to simulate from. ABC constructs  a kernel-type approximation to the posterior distribution through an accept/reject mechanism which compares summary statistics of real and simulated data.  To obviate the need for summary statistics, we directly compare empirical distributions  with a Kullback-Leibler (KL) divergence estimator obtained via contrastive learning. In particular, we blend flexible machine learning classifiers within ABC to automate fake/real data comparisons. We consider the traditional accept/reject kernel as well as  an exponential weighting scheme which does not require the ABC acceptance threshold. Our theoretical results show that the rate at which our ABC posterior distributions concentrate  around the true parameter depends on the estimation error of the classifier. We derive  limiting posterior shape results and find that, with a properly scaled exponential kernel, asymptotic normality holds.  We demonstrate the usefulness of our approach on simulated examples as well as real data in the context of stock volatility estimation."
"220658","Metrics of Calibration for Probabilistic Predictions","Imanol Arrieta-Ibarra, Paman Gujral, Jonathan Tannen, Mark Tygert, Cherie Xu","https://jmlr.org//papers/volume23/22-0658/22-0658.pdf","https://github.com/facebookresearch/ecevecce","Many predictions are probabilistic in nature; for example, a prediction could be for precipitation tomorrow, but with only a 30 percent chance. Given such probabilistic predictions together with the actual outcomes, “reliability diagrams” (also known as “calibration plots”) help detect and diagnose statistically significant discrepancies—so-called “miscalibration”—between the predictions and the outcomes. The canonical reliability diagrams are based on histogramming the observed and expected values of the predictions; replacing the hard histogram binning with soft kernel density estimation using smooth convolutional kernels is another common practice. But, which widths of bins or kernels are best? Plots of the cumulative differences between the observed and expected values largely avoid this question, by displaying miscalibration directly as the slopes of secant lines for the graphs. Slope is easy to perceive with quantitative precision, even when the constant offsets of the secant lines are irrelevant; there is no need to bin or perform kernel density estimation. The existing standard metrics of miscalibration each summarize a reliability diagram into a single scalar statistic. The cumulative plots naturally lead to scalar metrics for the deviation of the graph of cumulative differences away from zero; good calibration corresponds to a horizontal, flat graph which deviates little from zero. The cumulative approach is currently unconventional, yet offers many favorable statistical properties, guaranteed via mathematical theory backed by rigorous proofs and illustrative numerical examples. In particular, metrics based on binning or kernel density estimation unavoidably must trade-off statistical confidence for the ability to resolve variations as a function of the predicted probability or vice versa. Widening the bins or kernels averages away random noise while giving up some resolving power. Narrowing the bins or kernels enhances resolving power while not averaging away as much noise. The cumulative methods do not impose such an explicit trade-off. Considering these results, practitioners probably should adopt the cumulative approach as a standard for best practices."
