{
  "original": [
    "In this report, we describe a Theano-based AlexNet (Krizhevsky et al., 2012) implementation and its naive data parallelism on multiple GPUs. Our performance on 2 GPUs is comparable with the state-of-art Caffe library (Jia et al., 2014) run on 1 GPU. To the best of our knowledge, this is the first open-source Python-based AlexNet implementation to-date.",
    "We show that deep narrow Boltzmann machines are universal approximators of probability distributions on the activities of their visible units, provided they have sufficiently many hidden layers, each containing the same number of units as the visible layer. We show that, within certain parameter domains, deep Boltzmann machines can be studied as feedforward networks. We provide upper and lower bounds on the sufficient depth and width of universal approximators. These results settle various intuitions regarding undirected networks and, in particular, they show that deep narrow Boltzmann machines are at least as compact universal approximators as narrow sigmoid belief networks and restricted Boltzmann machines, with respect to the currently available bounds for those models.",
    "Leveraging advances in variational inference, we propose to enhance recurrent neural networks with latent variables, resulting in Stochastic Recurrent Networks (STORNs). The model i) can be trained with stochastic gradient methods, ii) allows structured and multi-modal conditionals at each time step, iii) features a reliable estimator of the marginal likelihood and iv) is a generalisation of deterministic recurrent neural networks. We evaluate the method on four polyphonic musical data sets and motion capture data.",
    "We describe a general framework for online adaptation of optimization hyperparameters by `hot swapping' their values during learning. We investigate this approach in the context of adaptive learning rate selection using an explore-exploit strategy from the multi-armed bandit literature. Experiments on a benchmark neural network show that the hot swapping approach leads to consistently better solutions compared to well-known alternatives such as AdaDelta and stochastic gradient with exhaustive hyperparameter search.",
    "Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm for partial least squares, whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.",
    "Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report state-of-the-art results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015).",
    "Automatic speech recognition systems usually rely on spectral-based features, such as MFCC of PLP. These features are extracted based on prior knowledge such as, speech perception or/and speech production. Recently, convolutional neural networks have been shown to be able to estimate phoneme conditional probabilities in a completely data-driven manner, i.e. using directly temporal raw speech signal as input. This system was shown to yield similar or better performance than HMM/ANN based system on phoneme recognition task and on large scale continuous speech recognition task, using less parameters. Motivated by these studies, we investigate the use of simple linear classifier in the CNN-based framework. Thus, the network learns linearly separable features from raw speech. We show that such system yields similar or better performance than MLP based system using cepstral-based features as input.",
    "We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.",
    "We develop a new method for visualizing and refining the invariances of learned representations. Specifically, we test for a general form of invariance, linearization, in which the action of a transformation is confined to a low-dimensional subspace. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images lying on a path between them that is of minimal length in the space of the representation (a \"representational geodesic\"). If the transformation relating the two reference images is linearized by the representation, this sequence should follow the gradual evolution of this transformation. We use this method to assess the invariance properties of a state-of-the-art image classification network and find that geodesics generated for image pairs differing by translation, rotation, and dilation do not evolve according to their associated transformations. Our method also suggests a remedy for these failures, and following this prescription, we show that the modified representation is able to linearize a variety of geometric image transformations.",
    "Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.",
    "We present a novel architecture, the \"stacked what-where auto-encoders\" (SWWAE), which integrates discriminative and generative pathways and provides a unified approach to supervised, semi-supervised and unsupervised learning without relying on sampling during training. An instantiation of SWWAE uses a convolutional net (Convnet) (LeCun et al. (1998)) to encode the input, and employs a deconvolutional net (Deconvnet) (Zeiler et al. (2010)) to produce the reconstruction. The objective function includes reconstruction terms that induce the hidden states in the Deconvnet to be similar to those of the Convnet. Each pooling layer produces two sets of variables: the \"what\" which are fed to the next layer, and its complementary variable \"where\" that are fed to the corresponding layer in the generative decoder.",
    "We investigate the problem of inducing word embeddings that are tailored for a particular bilexical relation. Our learning algorithm takes an existing lexical vector space and compresses it such that the resulting word embeddings are good predictors for a target bilexical relation. In experiments we show that task-specific embeddings can benefit both the quality and efficiency in lexical prediction tasks.",
    "A generative model is developed for deep (multi-layered) convolutional dictionary learning. A novel probabilistic pooling operation is integrated into the deep model, yielding efficient bottom-up (pretraining) and top-down (refinement) probabilistic learning. Experimental results demonstrate powerful capabilities of the model to learn multi-layer features from images, and excellent classification results are obtained on the MNIST and Caltech 101 datasets.",
    "Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.",
    "Convolutional neural networks (CNNs) work well on large datasets. But labelled data is hard to collect, and in some applications larger amounts of data are not available. The problem then is how to use CNNs with small data -- as CNNs overfit quickly. We present an efficient Bayesian CNN, offering better robustness to over-fitting on small data than traditional approaches. This is by placing a probability distribution over the CNN's kernels. We approximate our model's intractable posterior with Bernoulli variational distributions, requiring no additional model parameters.   On the theoretical side, we cast dropout network training as approximate inference in Bayesian neural networks. This allows us to implement our model using existing tools in deep learning with no increase in time complexity, while highlighting a negative result in the field. We show a considerable improvement in classification accuracy compared to standard techniques and improve on published state-of-the-art results for CIFAR-10.",
    "We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.",
    "Distributed representations of words have boosted the performance of many Natural Language Processing tasks. However, usually only one representation per word is obtained, not acknowledging the fact that some words have multiple meanings. This has a negative effect on the individual word representations and the language model as a whole. In this paper we present a simple model that enables recent techniques for building word vectors to represent distinct senses of polysemic words. In our assessment of this model we show that it is able to effectively discriminate between words' senses and to do so in a computationally efficient manner.",
    "We propose Diverse Embedding Neural Network (DENN), a novel architecture for language models (LMs). A DENNLM projects the input word history vector onto multiple diverse low-dimensional sub-spaces instead of a single higher-dimensional sub-space as in conventional feed-forward neural network LMs. We encourage these sub-spaces to be diverse during network training through an augmented loss function. Our language modeling experiments on the Penn Treebank data set show the performance benefit of using a DENNLM.",
    "A standard approach to Collaborative Filtering (CF), i.e. prediction of user ratings on items, relies on Matrix Factorization techniques. Representations for both users and items are computed from the observed ratings and used for prediction. Unfortunatly, these transductive approaches cannot handle the case of new users arriving in the system, with no known rating, a problem known as user cold-start. A common approach in this context is to ask these incoming users for a few initialization ratings. This paper presents a model to tackle this twofold problem of (i) finding good questions to ask, (ii) building efficient representations from this small amount of information. The model can also be used in a more standard (warm) context. Our approach is evaluated on the classical CF problem and on the cold-start problem on four different datasets showing its ability to improve baseline performance in both cases.",
    "We propose a deep learning framework for modeling complex high-dimensional densities called Non-linear Independent Component Estimation (NICE). It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. We parametrize this transformation so that computing the Jacobian determinant and inverse transform is trivial, yet we maintain the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood, which is tractable. Unbiased ancestral sampling is also easy. We show that this approach yields good generative models on four image datasets and can be used for inpainting.",
    "We introduce Deep Linear Discriminant Analysis (DeepLDA) which learns linearly separable latent representations in an end-to-end fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classification problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a non-linear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and back-propagation. For evaluation we test our approach on three different benchmark datasets (MNIST, CIFAR-10 and STL-10). DeepLDA produces competitive results on MNIST and CIFAR-10 and outperforms a network trained with categorical cross entropy (same architecture) on a supervised setting of STL-10.",
    "Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.   Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)).   Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.",
    "We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. The data are linearly transformed, and each component is then normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and a constant. We optimize the parameters of the full transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. The optimized transformation substantially Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than alternative methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We demonstrate the use of the model as a prior probability density that can be used to remove additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized using the same Gaussianization objective, thus offering an unsupervised method of optimizing a deep network architecture.",
    "We present flattened convolutional neural networks that are designed for fast feedforward execution. The redundancy of the parameters, especially weights of the convolutional filters in convolutional neural networks has been extensively studied and different heuristics have been proposed to construct a low rank basis of the filters after training. In this work, we train flattened networks that consist of consecutive sequence of one-dimensional filters across all directions in 3D space to obtain comparable performance as conventional convolutional networks. We tested flattened model on different datasets and found that the flattened layer can effectively substitute for the 3D filters without loss of accuracy. The flattened convolution pipelines provide around two times speed-up during feedforward pass compared to the baseline model due to the significant reduction of learning parameters. Furthermore, the proposed method does not require efforts in manual tuning or post processing once the model is trained.",
    "In this paper, we introduce a novel deep learning framework, termed Purine. In Purine, a deep network is expressed as a bipartite graph (bi-graph), which is composed of interconnected operators and data tensors. With the bi-graph abstraction, networks are easily solvable with event-driven task dispatcher. We then demonstrate that different parallelism schemes over GPUs and/or CPUs on single or multiple PCs can be universally implemented by graph composition. This eases researchers from coding for various parallelization schemes, and the same dispatcher can be used for solving variant graphs. Scheduled by the task dispatcher, memory transfers are fully overlapped with other computations, which greatly reduce the communication overhead and help us achieve approximate linear acceleration.",
    "In this paper we propose a model that combines the strengths of RNNs and SGVB: the Variational Recurrent Auto-Encoder (VRAE). Such a model can be used for efficient, large scale unsupervised learning on time series data, mapping the time series data to a latent vector representation. The model is generative, such that data can be generated from samples of the latent space. An important contribution of this work is that the model can make use of unlabeled data in order to facilitate supervised training of RNNs by initialising the weights and network state.",
    "Current work in lexical distributed representations maps each word to a point vector in low-dimensional space. Mapping instead to a density provides many interesting advantages, including better capturing uncertainty about a representation and its relationships, expressing asymmetries more naturally than dot product or cosine similarity, and enabling more expressive parameterization of decision boundaries. This paper advocates for density-based distributed embeddings and presents a method for learning representations in the space of Gaussian distributions. We compare performance on various word embedding benchmarks, investigate the ability of these embeddings to model entailment and other asymmetric relationships, and explore novel properties of the representation.",
    "Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them. For example, it is possible to train Maxout networks with 10 bits multiplications.",
    "Multiple instance learning (MIL) can reduce the need for costly annotation in tasks such as semantic segmentation by weakening the required degree of supervision. We propose a novel MIL formulation of multi-class semantic segmentation learning by a fully convolutional network. In this setting, we seek to learn a semantic segmentation model from just weak image-level labels. The model is trained end-to-end to jointly optimize the representation while disambiguating the pixel-image label assignment. Fully convolutional training accepts inputs of any size, does not need object proposal pre-processing, and offers a pixelwise loss map for selecting latent instances. Our multi-class MIL loss exploits the further supervision given by images with multiple labels. We evaluate this approach through preliminary experiments on the PASCAL VOC segmentation challenge.",
    "Recently, nested dropout was proposed as a method for ordering representation units in autoencoders by their information content, without diminishing reconstruction cost. However, it has only been applied to training fully-connected autoencoders in an unsupervised setting. We explore the impact of nested dropout on the convolutional layers in a CNN trained by backpropagation, investigating whether nested dropout can provide a simple and systematic way to determine the optimal representation size with respect to the desired accuracy and desired task and data complexity.",
    "Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.",
    "When a three-dimensional object moves relative to an observer, a change occurs on the observer's image plane and in the visual representation computed by a learned model. Starting with the idea that a good visual representation is one that transforms linearly under scene motions, we show, using the theory of group representations, that any such representation is equivalent to a combination of the elementary irreducible representations. We derive a striking relationship between irreducibility and the statistical dependency structure of the representation, by showing that under restricted conditions, irreducible representations are decorrelated. Under partial observability, as induced by the perspective projection of a scene onto the image plane, the motion group does not have a linear action on the space of images, so that it becomes necessary to perform inference over a latent representation that does transform linearly. This idea is demonstrated in a model of rotating NORB objects that employs a latent representation of the non-commutative 3D rotation group SO(3).",
    "Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm. Specifically, we propose to train a spherical k-means, after having reduced the MIPS problem to a Maximum Cosine Similarity Search (MCSS). Experiments on two standard recommendation system benchmarks as well as on large vocabulary word embeddings, show that this simple approach yields much higher speedups, for the same retrieval precision, than current state-of-the-art hashing-based and tree-based methods. This simple method also yields more robust retrievals when the query is corrupted by noise.",
    "The variational autoencoder (VAE; Kingma, Welling (2014)) is a recently proposed generative model pairing a top-down generative network with a bottom-up recognition network which approximates posterior inference. It typically makes strong assumptions about posterior inference, for instance that the posterior distribution is approximately factorial, and that its parameters can be approximated with nonlinear regression from the observations. As we show empirically, the VAE objective can lead to overly simplified representations which fail to use the network's entire modeling capacity. We present the importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log-likelihood lower bound derived from importance weighting. In the IWAE, the recognition network uses multiple samples to approximate the posterior, giving it increased flexibility to model complex posteriors which do not fit the VAE modeling assumptions. We show empirically that IWAEs learn richer latent space representations than VAEs, leading to improved test log-likelihood on density estimation benchmarks.",
    "This work investigates how using reduced precision data in Convolutional Neural Networks (CNNs) affects network accuracy during classification. More specifically, this study considers networks where each layer may use different precision data. Our key result is the observation that the tolerance of CNNs to reduced precision data not only varies across networks, a well established observation, but also within networks. Tuning precision per layer is appealing as it could enable energy and performance improvements. In this paper we study how error tolerance across layers varies and propose a method for finding a low precision configuration for a network while maintaining high accuracy. A diverse set of CNNs is analyzed showing that compared to a conventional implementation using a 32-bit floating-point representation for all layers, and with less than 1% loss in relative accuracy, the data footprint required by these networks can be reduced by an average of 74% and up to 92%.",
    "The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.",
    "Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.",
    "We propose local distributional smoothness (LDS), a new notion of smoothness for statistical model that can be used as a regularization term to promote the smoothness of the model distribution. We named the LDS based regularization as virtual adversarial training (VAT). The LDS of a model at an input datapoint is defined as the KL-divergence based robustness of the model distribution against local perturbation around the datapoint. VAT resembles adversarial training, but distinguishes itself in that it determines the adversarial direction from the model distribution alone without using the label information, making it applicable to semi-supervised learning. The computational cost for VAT is relatively low. For neural network, the approximated gradient of the LDS can be computed with no more than three pairs of forward and back propagations. When we applied our technique to supervised and semi-supervised learning for the MNIST dataset, it outperformed all the training methods other than the current state of the art method, which is based on a highly advanced generative model. We also applied our method to SVHN and NORB, and confirmed our method's superior performance over the current state of the art semi-supervised method applied to these datasets.",
    "The availability of large labeled datasets has allowed Convolutional Network models to achieve impressive recognition results. However, in many settings manual annotation of the data is impractical; instead our data has noisy labels, i.e. there is some freely available label for each image which may or may not be accurate. In this paper, we explore the performance of discriminatively-trained Convnets when trained on such noisy data. We introduce an extra noise layer into the network which adapts the network outputs to match the noisy label distribution. The parameters of this noise layer can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks. We demonstrate the approaches on several datasets, including large scale experiments on the ImageNet classification benchmark.",
    "We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.",
    "Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.",
    "In this work, we propose a new method to integrate two recent lines of work: unsupervised induction of shallow semantics (e.g., semantic roles) and factorization of relations in text and knowledge bases. Our model consists of two components: (1) an encoding component: a semantic role labeling model which predicts roles given a rich set of syntactic and lexical features; (2) a reconstruction component: a tensor factorization model which relies on roles to predict argument fillers. When the components are estimated jointly to minimize errors in argument reconstruction, the induced roles largely correspond to roles defined in annotated resources. Our method performs on par with most accurate role induction methods on English, even though, unlike these previous approaches, we do not incorporate any prior linguistic knowledge about the language.",
    "The notion of metric plays a key role in machine learning problems such as classification, clustering or ranking. However, it is worth noting that there is a severe lack of theoretical guarantees that can be expected on the generalization capacity of the classifier associated to a given metric. The theoretical framework of $(\\epsilon, \\gamma, \\tau)$-good similarity functions (Balcan et al., 2008) has been one of the first attempts to draw a link between the properties of a similarity function and those of a linear classifier making use of it. In this paper, we extend and complete this theory by providing a new generalization bound for the associated classifier based on the algorithmic robustness framework.",
    "We present the multiplicative recurrent neural network as a general model for compositional meaning in language, and evaluate it on the task of fine-grained sentiment analysis. We establish a connection to the previously investigated matrix-space models for compositionality, and show they are special cases of the multiplicative recurrent net. Our experiments show that these models perform comparably or better than Elman-type additive recurrent neural networks and outperform matrix-space models on a standard fine-grained sentiment analysis corpus. Furthermore, they yield comparable results to structural deep models on the recently published Stanford Sentiment Treebank without the need for generating parse trees.",
    "Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.",
    "We develop a new statistical model for photographic images, in which the local responses of a bank of linear filters are described as jointly Gaussian, with zero mean and a covariance that varies slowly over spatial position. We optimize sets of filters so as to minimize the nuclear norms of matrices of their local activations (i.e., the sum of the singular values), thus encouraging a flexible form of sparsity that is not tied to any particular dictionary or coordinate system. Filters optimized according to this objective are oriented and bandpass, and their responses exhibit substantial local correlation. We show that images can be reconstructed nearly perfectly from estimates of the local filter response covariances alone, and with minimal degradation (either visual or MSE) from low-rank approximations of these covariances. As such, this representation holds much promise for use in applications such as denoising, compression, and texture representation, and may form a useful substrate for hierarchical decompositions.",
    "Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the \"deconvolution approach\" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.",
    "Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.",
    "This paper introduces a greedy parser based on neural networks, which leverages a new compositional sub-tree representation. The greedy parser and the compositional procedure are jointly trained, and tightly depends on each-other. The composition procedure outputs a vector representation which summarizes syntactically (parsing tags) and semantically (words) sub-trees. Composition and tagging is achieved over continuous (word or tag) representations, and recurrent neural networks. We reach F1 performance on par with well-known existing parsers, while having the advantage of speed, thanks to the greedy nature of the parser. We provide a fully functional implementation of the method described in this paper.",
    "Suitable lateral connections between encoder and decoder are shown to allow higher layers of a denoising autoencoder (dAE) to focus on invariant representations. In regular autoencoders, detailed information needs to be carried through the highest layers but lateral connections from encoder to decoder relieve this pressure. It is shown that abstract invariant features can be translated to detailed reconstructions when invariant features are allowed to modulate the strength of the lateral connection. Three dAE structures with modulated and additive lateral connections, and without lateral connections were compared in experiments using real-world images. The experiments verify that adding modulated lateral connections to the model 1) improves the accuracy of the probability model for inputs, as measured by denoising performance; 2) results in representations whose degree of invariance grows faster towards the higher layers; and 3) supports the formation of diverse invariant poolings.",
    "We develop a new method for visualizing and refining the invariances of learned representations. Specifically, we test for a general form of invariance, linearization, in which the action of a transformation is confined to a low-dimensional subspace. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images lying on a path between them that is of minimal length in the space of the representation (a \"representational geodesic\"). If the transformation relating the two reference images is linearized by the representation, this sequence should follow the gradual evolution of this transformation. We use this method to assess the invariance properties of a state-of-the-art image classification network and find that geodesics generated for image pairs differing by translation, rotation, and dilation do not evolve according to their associated transformations. Our method also suggests a remedy for these failures, and following this prescription, we show that the modified representation is able to linearize a variety of geometric image transformations.",
    "Genomics are rapidly transforming medical practice and basic biomedical research, providing insights into disease mechanisms and improving therapeutic strategies, particularly in cancer. The ability to predict the future course of a patient's disease from high-dimensional genomic profiling will be essential in realizing the promise of genomic medicine, but presents significant challenges for state-of-the-art survival analysis methods. In this abstract we present an investigation in learning genomic representations with neural networks to predict patient survival in cancer. We demonstrate the advantages of this approach over existing survival analysis methods using brain tumor data.",
    "Existing approaches to combine both additive and multiplicative neural units either use a fixed assignment of operations or require discrete optimization to determine what function a neuron should perform. However, this leads to an extensive increase in the computational complexity of the training procedure.   We present a novel, parameterizable transfer function based on the mathematical concept of non-integer functional iteration that allows the operation each neuron performs to be smoothly and, most importantly, differentiablely adjusted between addition and multiplication. This allows the decision between addition and multiplication to be integrated into the standard backpropagation training procedure.",
    "One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.",
    "We extend Stochastic Gradient Variational Bayes to perform posterior inference for the weights of Stick-Breaking processes. This development allows us to define a Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric version of the variational autoencoder that has a latent representation with stochastic dimensionality. We experimentally demonstrate that the SB-VAE, and a semi-supervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE's.",
    "Unsupervised learning on imbalanced data is challenging because, when given imbalanced data, current model is often dominated by the major category and ignores the categories with small amount of data. We develop a latent variable model that can cope with imbalanced data by dividing the latent space into a shared space and a private space. Based on Gaussian Process Latent Variable Models, we propose a new kernel formulation that enables the separation of latent space and derives an efficient variational inference method. The performance of our model is demonstrated with an imbalanced medical image dataset.",
    "Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful.",
    "This paper shows how one can directly apply natural language processing (NLP) methods to classification problems in cheminformatics. Connection between these seemingly separate fields is shown by considering standard textual representation of compound, SMILES. The problem of activity prediction against a target protein is considered, which is a crucial part of computer aided drug design process. Conducted experiments show that this way one can not only outrank state of the art results of hand crafted representations but also gets direct structural insights into the way decisions are made.",
    "We introduce a neural network architecture and a learning algorithm to produce factorized symbolic representations. We propose to learn these concepts by observing consecutive frames, letting all the components of the hidden representation except a small discrete set (gating units) be predicted from the previous frame, and let the factors of variation in the next frame be represented entirely by these discrete gated units (corresponding to symbolic representations). We demonstrate the efficacy of our approach on datasets of faces undergoing 3D transformations and Atari 2600 games.",
    "We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data.",
    "We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. The data are linearly transformed, and each component is then normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and a constant. We optimize the parameters of the full transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. The optimized transformation substantially Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than alternative methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We demonstrate the use of the model as a prior probability density that can be used to remove additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized using the same Gaussianization objective, thus offering an unsupervised method of optimizing a deep network architecture.",
    "Approximate variational inference has shown to be a powerful tool for modeling unknown complex probability distributions. Recent advances in the field allow us to learn probabilistic models of sequences that actively exploit spatial and temporal structure. We apply a Stochastic Recurrent Network (STORN) to learn robot time series data. Our evaluation demonstrates that we can robustly detect anomalies both off- and on-line.",
    "We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.",
    "We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.",
    "Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.",
    "We propose a framework for training multiple neural networks simultaneously. The parameters from all models are regularised by the tensor trace norm, so that each neural network is encouraged to reuse others' parameters if possible -- this is the main motivation behind multi-task learning. In contrast to many deep multi-task learning models, we do not predefine a parameter sharing strategy by specifying which layers have tied parameters. Instead, our framework considers sharing for all shareable layers, and the sharing strategy is learned in a data-driven way.",
    "This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.",
    "We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.",
    "Many machine learning classifiers are vulnerable to adversarial perturbations. An adversarial perturbation modifies an input to change a classifier's prediction without causing the input to seem substantially different to human perception. We deploy three methods to detect adversarial images. Adversaries trying to bypass our detectors must make the adversarial image less pathological or they will fail trying. Our best detection method reveals that adversarial images place abnormal emphasis on the lower-ranked principal components from PCA. Other detectors and a colorful saliency map are in an appendix.",
    "We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.",
    "Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.   Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)).   Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.",
    "This paper builds off recent work from Kiperwasser & Goldberg (2016) using neural attention in a simple graph-based dependency parser. We use a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels. Our parser gets state of the art or near state of the art performance on standard treebanks for six different languages, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset. This makes it the highest-performing graph-based parser on this benchmark---outperforming Kiperwasser Goldberg (2016) by 1.8% and 2.2%---and comparable to the highest performing transition-based parser (Kuncoro et al., 2016), which achieves 95.8% UAS and 94.6% LAS. We also show which hyperparameter choices had a significant effect on parsing accuracy, allowing us to achieve large gains over other graph-based approaches.",
    "Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report state-of-the-art results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015).",
    "Spherical data is found in many applications. By modeling the discretized sphere as a graph, we can accommodate non-uniformly distributed, partial, and changing samplings. Moreover, graph convolutions are computationally more efficient than spherical convolutions. As equivariance is desired to exploit rotational symmetries, we discuss how to approach rotation equivariance using the graph neural network introduced in Defferrard et al. (2016). Experiments show good performance on rotation-invariant learning problems. Code and examples are available at https://github.com/SwissDataScienceCenter/DeepSphere",
    "High computational complexity hinders the widespread usage of Convolutional Neural Networks (CNNs), especially in mobile devices. Hardware accelerators are arguably the most promising approach for reducing both execution time and power consumption. One of the most important steps in accelerator development is hardware-oriented model approximation. In this paper we present Ristretto, a model approximation framework that analyzes a given CNN with respect to numerical resolution used in representing weights and outputs of convolutional and fully connected layers. Ristretto can condense models by using fixed point arithmetic and representation instead of floating point. Moreover, Ristretto fine-tunes the resulting fixed point network. Given a maximum error tolerance of 1%, Ristretto can successfully condense CaffeNet and SqueezeNet to 8-bit. The code for Ristretto is available.",
    "The diversity of painting styles represents a rich visual vocabulary for the construction of an image. The degree to which one may learn and parsimoniously capture this visual vocabulary measures our understanding of the higher level features of paintings, if not images in general. In this work we investigate the construction of a single, scalable deep network that can parsimoniously capture the artistic style of a diversity of paintings. We demonstrate that such a network generalizes across a diversity of artistic styles by reducing a painting to a point in an embedding space. Importantly, this model permits a user to explore new painting styles by arbitrarily combining the styles learned from individual paintings. We hope that this work provides a useful step towards building rich models of paintings and offers a window on to the structure of the learned representation of artistic style.",
    "Sum-Product Networks (SPNs) are a class of expressive yet tractable hierarchical graphical models. LearnSPN is a structure learning algorithm for SPNs that uses hierarchical co-clustering to simultaneously identifying similar entities and similar features. The original LearnSPN algorithm assumes that all the variables are discrete and there is no missing data. We introduce a practical, simplified version of LearnSPN, MiniSPN, that runs faster and can handle missing data and heterogeneous features common in real applications. We demonstrate the performance of MiniSPN on standard benchmark datasets and on two datasets from Google's Knowledge Graph exhibiting high missingness rates and a mix of discrete and continuous features.",
    "Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet).   The SqueezeNet architecture is available for download here: https://github.com/DeepScale/SqueezeNet",
    "In this paper, we study the problem of question answering when reasoning over multiple facts is required. We propose Query-Reduction Network (QRN), a variant of Recurrent Neural Network (RNN) that effectively handles both short-term (local) and long-term (global) sequential dependencies to reason over multiple facts. QRN considers the context sentences as a sequence of state-changing triggers, and reduces the original query to a more informed query as it observes each trigger (context sentence) through time. Our experiments show that QRN produces the state-of-the-art results in bAbI QA and dialog tasks, and in a real goal-oriented dialog dataset. In addition, QRN formulation allows parallelization on RNN's time axis, saving an order of magnitude in time complexity for training and inference.",
    "We propose a language-agnostic way of automatically generating sets of semantically similar clusters of entities along with sets of \"outlier\" elements, which may then be used to perform an intrinsic evaluation of word embeddings in the outlier detection task. We used our methodology to create a gold-standard dataset, which we call WikiSem500, and evaluated multiple state-of-the-art embeddings. The results show a correlation between performance on this dataset and performance on sentiment analysis.",
    "Recurrent neural nets are widely used for predicting temporal data. Their inherent deep feedforward structure allows learning complex sequential patterns. It is believed that top-down feedback might be an important missing ingredient which in theory could help disambiguate similar patterns depending on broader context. In this paper we introduce surprisal-driven recurrent networks, which take into account past error information when making new predictions. This is achieved by continuously monitoring the discrepancy between most recent predictions and the actual observations. Furthermore, we show that it outperforms other stochastic and fully deterministic approaches on enwik8 character level prediction task achieving 1.37 BPC on the test portion of the text.",
    "Although Generative Adversarial Networks achieve state-of-the-art results on a variety of generative tasks, they are regarded as highly unstable and prone to miss modes. We argue that these bad behaviors of GANs are due to the very particular functional shape of the trained discriminators in high dimensional spaces, which can easily make training stuck or push probability mass in the wrong direction, towards that of higher concentration than that of the data generating distribution. We introduce several ways of regularizing the objective, which can dramatically stabilize the training of GAN models. We also show that our regularizers can help the fair distribution of probability mass across the modes of the data generating distribution, during the early phases of training and thus providing a unified solution to the missing modes problem.",
    "Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks, especially when the policies are represented using rich function approximators like deep neural networks. Model-based methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real data with simulated data. However, discrepancies between the simulated source domain and the target domain pose a challenge for simulated training. We introduce the EPOpt algorithm, which uses an ensemble of simulated source domains and a form of adversarial training to learn policies that are robust and generalize to a broad range of possible target domains, including unmodeled effects. Further, the probability distribution over source domains in the ensemble can be adapted using data from target domain and approximate Bayesian methods, to progressively make it a better approximation. Thus, learning on a model ensemble, along with source domain adaptation, provides the benefit of both robustness and learning/adaptation.",
    "We introduce Divnet, a flexible technique for learning networks with diverse neurons. Divnet models neuronal diversity by placing a Determinantal Point Process (DPP) over neurons in a given layer. It uses this DPP to select a subset of diverse neurons and subsequently fuses the redundant neurons into the selected ones. Compared with previous approaches, Divnet offers a more principled, flexible technique for capturing neuronal diversity and thus implicitly enforcing regularization. This enables effective auto-tuning of network architecture and leads to smaller network sizes without hurting performance. Moreover, through its focus on diversity and neuron fusing, Divnet remains compatible with other procedures that seek to reduce memory footprints of networks. We present experimental results to corroborate our claims: for pruning neural networks, Divnet is seen to be notably superior to competing approaches.",
    "The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.",
    "One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. In this work, we propose a new regularizer called DeCov which leads to significantly reduced overfitting (as indicated by the difference between train and val performance), and better generalization. Our regularizer encourages diverse or non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations. This simple intuition has been explored in a number of past works but surprisingly has never been applied as a regularizer in supervised learning. Experiments across a range of datasets and network architectures show that this loss always reduces overfitting while almost always maintaining or increasing generalization performance and often improving performance over Dropout.",
    "Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.",
    "We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.",
    "We introduce the \"Energy-based Generative Adversarial Network\" model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.",
    "Recent research in the deep learning field has produced a plethora of new architectures. At the same time, a growing number of groups are applying deep learning to new applications. Some of these groups are likely to be composed of inexperienced deep learning practitioners who are baffled by the dizzying array of architecture choices and therefore opt to use an older architecture (i.e., Alexnet). Here we attempt to bridge this gap by mining the collective knowledge contained in recent deep learning research to discover underlying principles for designing neural network architectures. In addition, we describe several architectural innovations, including Fractal of FractalNet network, Stagewise Boosting Networks, and Taylor Series Networks (our Caffe code and prototxt files is available at https://github.com/iPhysicist/CNNDesignPatterns). We hope others are inspired to build on our preliminary work.",
    "Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention. In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.",
    "Though with progress, model learning and performing posterior inference still remains a common challenge for using deep generative models, especially for handling discrete hidden variables. This paper is mainly concerned with algorithms for learning Helmholz machines, which is characterized by pairing the generative model with an auxiliary inference model. A common drawback of previous learning algorithms is that they indirectly optimize some bounds of the targeted marginal log-likelihood. In contrast, we successfully develop a new class of algorithms, based on stochastic approximation (SA) theory of the Robbins-Monro type, to directly optimize the marginal log-likelihood and simultaneously minimize the inclusive KL-divergence. The resulting learning algorithm is thus called joint SA (JSA). Moreover, we construct an effective MCMC operator for JSA. Our results on the MNIST datasets demonstrate that the JSA's performance is consistently superior to that of competing algorithms like RWS, for learning a range of difficult models.",
    "Object detection with deep neural networks is often performed by passing a few thousand candidate bounding boxes through a deep neural network for each image. These bounding boxes are highly correlated since they originate from the same image. In this paper we investigate how to exploit feature occurrence at the image scale to prune the neural network which is subsequently applied to all bounding boxes. We show that removing units which have near-zero activation in the image allows us to significantly reduce the number of parameters in the network. Results on the PASCAL 2007 Object Detection Challenge demonstrate that up to 40% of units in some fully-connected layers can be entirely eliminated with little change in the detection result.",
    "Modeling interactions between features improves the performance of machine learning solutions in many domains (e.g. recommender systems or sentiment analysis). In this paper, we introduce Exponential Machines (ExM), a predictor that models all interactions of every order. The key idea is to represent an exponentially large tensor of parameters in a factorized format called Tensor Train (TT). The Tensor Train format regularizes the model and lets you control the number of underlying parameters. To train the model, we develop a stochastic Riemannian optimization procedure, which allows us to fit tensors with 2^160 entries. We show that the model achieves state-of-the-art performance on synthetic data with high-order interactions and that it works on par with high-order factorization machines on a recommender system dataset MovieLens 100K.",
    "We introduce Deep Variational Bayes Filters (DVBF), a new method for unsupervised learning and identification of latent Markovian state space models. Leveraging recent advances in Stochastic Gradient Variational Bayes, DVBF can overcome intractable inference distributions via variational inference. Thus, it can handle highly nonlinear input data with temporal and spatial dependencies such as image sequences without domain knowledge. Our experiments show that enabling backpropagation through transitions enforces state space assumptions and significantly improves information content of the latent embedding. This also enables realistic long-term prediction.",
    "Traditional dialog systems used in goal-oriented applications require a lot of domain-specific handcrafting, which hinders scaling up to new domains. End-to-end dialog systems, in which all components are trained from the dialogs themselves, escape this limitation. But the encouraging success recently obtained in chit-chat dialog may not carry over to goal-oriented settings. This paper proposes a testbed to break down the strengths and shortcomings of end-to-end dialog systems in goal-oriented applications. Set in the context of restaurant reservation, our tasks require manipulating sentences and symbols, so as to properly conduct conversations, issue API calls and use the outputs of such calls. We show that an end-to-end dialog system based on Memory Networks can reach promising, yet imperfect, performance and learn to perform non-trivial operations. We confirm those results by comparing our system to a hand-crafted slot-filling baseline on data from the second Dialog State Tracking Challenge (Henderson et al., 2014a). We show similar result patterns on data extracted from an online concierge service.",
    "Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting. Code is available at https://github.com/tensorflow/models/tree/master/research/adversarial_text.",
    "Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.",
    "This paper is focused on studying the view-manifold structure in the feature spaces implied by the different layers of Convolutional Neural Networks (CNN). There are several questions that this paper aims to answer: Does the learned CNN representation achieve viewpoint invariance? How does it achieve viewpoint invariance? Is it achieved by collapsing the view manifolds, or separating them while preserving them? At which layer is view invariance achieved? How can the structure of the view manifold at each layer of a deep convolutional neural network be quantified experimentally? How does fine-tuning of a pre-trained CNN on a multi-view dataset affect the representation at each layer of the network? In order to answer these questions we propose a methodology to quantify the deformation and degeneracy of view manifolds in CNN layers. We apply this methodology and report interesting results in this paper that answer the aforementioned questions.",
    "Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmentation, and visual question-answering, to get state-of-the-art performances taking advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally complex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning. We show that our model outperforms compact bilinear pooling in visual question-answering tasks with the state-of-the-art results on the VQA dataset, having a better parsimonious property.",
    "The standard interpretation of importance-weighted autoencoders is that they maximize a tighter lower bound on the marginal likelihood than the standard evidence lower bound. We give an alternate interpretation of this procedure: that it optimizes the standard variational lower bound, but using a more complex distribution. We formally derive this result, present a tighter lower bound, and visualize the implicit importance-weighted distribution.",
    "We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.",
    "In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples.Specifically, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal. We derive the analytic form of the induced solution, and analyze the properties. In order to make the proposed framework trainable in practice, we introduce two effective approximation techniques. Empirically, the experiment results closely match our theoretical analysis, verifying the discriminator is able to recover the energy of data distribution.",
    "In this work we perform outlier detection using ensembles of neural networks obtained by variational approximation of the posterior in a Bayesian neural network setting. The variational parameters are obtained by sampling from the true posterior by gradient descent. We show our outlier detection results are comparable to those obtained using other efficient ensembling methods.",
    "We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is \"matrix factorization by design\" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the near state-of the art perplexity while using significantly less RNN parameters.",
    "We present observations and discussion of previously unreported phenomena discovered while training residual networks. The goal of this work is to better understand the nature of neural networks through the examination of these new empirical results. These behaviors were identified through the application of Cyclical Learning Rates (CLR) and linear network interpolation. Among these behaviors are counterintuitive increases and decreases in training loss and instances of rapid training. For example, we demonstrate how CLR can produce greater testing accuracy than traditional training despite using large learning rates. Files to replicate these results are available at https://github.com/lnsmith54/exploring-loss",
    "Machine learning models are often used at test-time subject to constraints and trade-offs not present at training-time. For example, a computer vision model operating on an embedded device may need to perform real-time inference, or a translation model operating on a cell phone may wish to bound its average compute time in order to be power-efficient. In this work we describe a mixture-of-experts model and show how to change its test-time resource-usage on a per-input basis using reinforcement learning. We test our method on a small MNIST-based example.",
    "Adversarial examples have been shown to exist for a variety of deep learning architectures. Deep reinforcement learning has shown promising results on training agent policies directly on raw inputs such as image pixels. In this paper we present a novel study into adversarial attacks on deep reinforcement learning polices. We compare the effectiveness of the attacks using adversarial examples vs. random noise. We present a novel method for reducing the number of times adversarial examples need to be injected for a successful attack, based on the value function. We further explore how re-training on random noise and FGSM perturbations affects the resilience against adversarial examples.",
    "This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that VCL outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.",
    "Automatically determining the optimal size of a neural network for a given task without prior information currently requires an expensive global search and training many networks from scratch. In this paper, we address the problem of automatically finding a good network size during a single training cycle. We introduce *nonparametric neural networks*, a non-probabilistic framework for conducting optimization over all possible network sizes and prove its soundness when network growth is limited via an L_p penalty. We train networks under this framework by continuously adding new units while eliminating redundant units via an L_2 penalty. We employ a novel optimization algorithm, which we term *adaptive radial-angular gradient descent* or *AdaRad*, and obtain promising results.",
    "Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It's noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to the strongest published system.",
    "The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.",
    "We extend Stochastic Gradient Variational Bayes to perform posterior inference for the weights of Stick-Breaking processes. This development allows us to define a Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric version of the variational autoencoder that has a latent representation with stochastic dimensionality. We experimentally demonstrate that the SB-VAE, and a semi-supervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE's.",
    "We propose a framework for training multiple neural networks simultaneously. The parameters from all models are regularised by the tensor trace norm, so that each neural network is encouraged to reuse others' parameters if possible -- this is the main motivation behind multi-task learning. In contrast to many deep multi-task learning models, we do not predefine a parameter sharing strategy by specifying which layers have tied parameters. Instead, our framework considers sharing for all shareable layers, and the sharing strategy is learned in a data-driven way.",
    "This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.",
    "Many machine learning classifiers are vulnerable to adversarial perturbations. An adversarial perturbation modifies an input to change a classifier's prediction without causing the input to seem substantially different to human perception. We deploy three methods to detect adversarial images. Adversaries trying to bypass our detectors must make the adversarial image less pathological or they will fail trying. Our best detection method reveals that adversarial images place abnormal emphasis on the lower-ranked principal components from PCA. Other detectors and a colorful saliency map are in an appendix.",
    "We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.",
    "State-of-the-art deep reading comprehension models are dominated by recurrent neural nets. Their sequential nature is a natural fit for language, but it also precludes parallelization within an instances and often becomes the bottleneck for deploying such models to latency critical scenarios. This is particularly problematic for longer texts. Here we present a convolutional architecture as an alternative to these recurrent architectures. Using simple dilated convolutional units in place of recurrent ones, we achieve results comparable to the state of the art on two question answering tasks, while at the same time achieving up to two orders of magnitude speedups for question answering.",
    "This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.",
    "Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content (\"bit rate\") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.",
    "Methods that learn representations of nodes in a graph play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as point vectors in a low-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, we propose an unsupervised method that handles inductive learning scenarios and is applicable to different types of graphs: plain/attributed, directed/undirected. By leveraging both the network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering of the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks. Additionally, we demonstrate the benefits of modeling uncertainty - by analyzing it we can estimate neighborhood diversity and detect the intrinsic latent dimensionality of a graph.",
    "This paper explores the use of self-ensembling for visual domain adaptation problems. Our technique is derived from the mean teacher variant (Tarvainen et al., 2017) of temporal ensembling (Laine et al;, 2017), a technique that achieved state of the art results in the area of semi-supervised learning. We introduce a number of modifications to their approach for challenging domain adaptation scenarios and evaluate its effectiveness. Our approach achieves state of the art results in a variety of benchmarks, including our winning entry in the VISDA-2017 visual domain adaptation challenge. In small image benchmarks, our algorithm not only outperforms prior art, but can also achieve accuracy that is close to that of a classifier trained in a supervised fashion.",
    "Most machine learning classifiers, including deep neural networks, are vulnerable to adversarial examples. Such inputs are typically generated by adding small but purposeful modifications that lead to incorrect outputs while imperceptible to human eyes. The goal of this paper is not to introduce a single method, but to make theoretical steps towards fully understanding adversarial examples. By using concepts from topology, our theoretical analysis brings forth the key reasons why an adversarial example can fool a classifier ($f_1$) and adds its oracle ($f_2$, like human eyes) in such analysis. By investigating the topological relationship between two (pseudo)metric spaces corresponding to predictor $f_1$ and oracle $f_2$, we develop necessary and sufficient conditions that can determine if $f_1$ is always robust (strong-robust) against adversarial examples according to $f_2$. Interestingly our theorems indicate that just one unnecessary feature can make $f_1$ not strong-robust, and the right feature representation learning is the key to getting a classifier that is both accurate and strong-robust.",
    "We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.",
    "We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.",
    "Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful.",
    "We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.",
    "We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data.",
    "In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts.",
    "We compared the efficiency of the FlyHash model, an insect-inspired sparse neural network (Dasgupta et al., 2017), to similar but non-sparse models in an embodied navigation task. This requires a model to control steering by comparing current visual inputs to memories stored along a training route. We concluded the FlyHash model is more efficient than others, especially in terms of data encoding.",
    "In peer review, reviewers are usually asked to provide scores for the papers. The scores are then used by Area Chairs or Program Chairs in various ways in the decision-making process. The scores are usually elicited in a quantized form to accommodate the limited cognitive ability of humans to describe their opinions in numerical values. It has been found that the quantized scores suffer from a large number of ties, thereby leading to a significant loss of information. To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed. There are however two key challenges. First, there is no standard procedure for using this ranking information and Area Chairs may use it in different ways (including simply ignoring them), thereby leading to arbitrariness in the peer-review process. Second, there are no suitable interfaces for judicious use of this data nor methods to incorporate it in existing workflows, thereby leading to inefficiencies. We take a principled approach to integrate the ranking information into the scores. The output of our method is an updated score pertaining to each review that also incorporates the rankings. Our approach addresses the two aforementioned challenges by: (i) ensuring that rankings are incorporated into the updates scores in the same manner for all papers, thereby mitigating arbitrariness, and (ii) allowing to seamlessly use existing interfaces and workflows designed for scores. We empirically evaluate our method on synthetic datasets as well as on peer reviews from the ICLR 2017 conference, and find that it reduces the error by approximately 30% as compared to the best performing baseline on the ICLR 2017 data.",
    "Many recent studies have probed status bias in the peer-review process of academic journals and conferences. In this article, we investigated the association between author metadata and area chairs' final decisions (Accept/Reject) using our compiled database of 5,313 borderline submissions to the International Conference on Learning Representations (ICLR) from 2017 to 2022. We carefully defined elements in a cause-and-effect analysis, including the treatment and its timing, pre-treatment variables, potential outcomes and causal null hypothesis of interest, all in the context of study units being textual data and under Neyman and Rubin's potential outcomes (PO) framework. We found some weak evidence that author metadata was associated with articles' final decisions. We also found that, under an additional stability assumption, borderline articles from high-ranking institutions (top-30% or top-20%) were less favored by area chairs compared to their matched counterparts. The results were consistent in two different matched designs (odds ratio = 0.82 [95% CI: 0.67 to 1.00] in a first design and 0.83 [95% CI: 0.64 to 1.07] in a strengthened design). We discussed how to interpret these results in the context of multiple interactions between a study unit and different agents (reviewers and area chairs) in the peer-review system.",
    "We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method \"Deep Variational Information Bottleneck\", or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.",
    "Attention networks have proven to be an effective approach for embedding categorical inference within a deep neural network. However, for many tasks we may want to model richer structural dependencies without abandoning end-to-end training. In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees. We experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. We further find that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention.",
    "We are proposing to use an ensemble of diverse specialists, where speciality is defined according to the confusion matrix. Indeed, we observed that for adversarial instances originating from a given class, labeling tend to be done into a small subset of (incorrect) classes. Therefore, we argue that an ensemble of specialists should be better able to identify and reject fooling instances, with a high entropy (i.e., disagreement) over the decisions in the presence of adversaries. Experimental results obtained confirm that interpretation, opening a way to make the system more robust to adversarial examples through a rejection mechanism, rather than trying to classify them properly at any cost.",
    "In this paper, we present Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Different from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. Instead, it directly outputs phrases in a sequential order and can decode in linear time. Our experiments show that NPMT achieves superior performances on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks compared with strong NMT baselines. We also observe that our method produces meaningful phrases in output languages.",
    "We present LR-GAN: an adversarial image generation model which takes scene structure and context into account. Unlike previous generative adversarial networks (GANs), the proposed GAN learns to generate image background and foregrounds separately and recursively, and stitch the foregrounds on the background in a contextually relevant manner to produce a complete natural image. For each foreground, the model learns to generate its appearance, shape and pose. The whole model is unsupervised, and is trained in an end-to-end manner with gradient descent methods. The experiments demonstrate that LR-GAN can generate more natural images with objects that are more human recognizable than DCGAN.",
    "We describe a simple scheme that allows an agent to learn about its environment in an unsupervised manner. Our scheme pits two versions of the same agent, Alice and Bob, against one another. Alice proposes a task for Bob to complete; and then Bob attempts to complete the task. In this work we will focus on two kinds of environments: (nearly) reversible environments and environments that can be reset. Alice will \"propose\" the task by doing a sequence of actions and then Bob must undo or repeat them, respectively. Via an appropriate reward structure, Alice and Bob automatically generate a curriculum of exploration, enabling unsupervised training of the agent. When Bob is deployed on an RL task within the environment, this unsupervised training reduces the number of supervised episodes needed to learn, and in some cases converges to a higher reward.",
    "Maximum entropy modeling is a flexible and popular framework for formulating statistical models given partial knowledge. In this paper, rather than the traditional method of optimizing over the continuous density directly, we learn a smooth and invertible transformation that maps a simple distribution to the desired maximum entropy distribution. Doing so is nontrivial in that the objective being maximized (entropy) is a function of the density itself. By exploiting recent developments in normalizing flow networks, we cast the maximum entropy problem into a finite-dimensional constrained optimization, and solve the problem by combining stochastic optimization with the augmented Lagrangian method. Simulation results demonstrate the effectiveness of our method, and applications to finance and computer vision show the flexibility and accuracy of using maximum entropy flow networks.",
    "With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal. However, most current research focuses instead on important but narrow applications, such as image classification or machine translation. We believe this to be largely due to the lack of objective ways to measure progress towards broad machine intelligence. In order to fill this gap, we propose here a set of concrete desiderata for general AI, together with a platform to test machines on how well they satisfy such desiderata, while keeping all further complexities to a minimum.",
    "Neural networks that compute over graph structures are a natural fit for problems in a variety of domains, including natural language (parse trees) and cheminformatics (molecular graphs). However, since the computation graph has a different shape and size for every input, such networks do not directly support batched training or inference. They are also difficult to implement in popular deep learning libraries, which are based on static data-flow graphs. We introduce a technique called dynamic batching, which not only batches together operations between different input graphs of dissimilar shape, but also between different nodes within a single input graph. The technique allows us to create static graphs, using popular libraries, that emulate dynamic computation graphs of arbitrary shape and size. We further present a high-level library of compositional blocks that simplifies the creation of dynamic graph models. Using the library, we demonstrate concise and batch-wise parallel implementations for a variety of models from the literature.",
    "Although deep learning models have proven effective at solving problems in natural language processing, the mechanism by which they come to their conclusions is often unclear. As a result, these models are generally treated as black boxes, yielding no insight of the underlying learned patterns. In this paper we consider Long Short Term Memory networks (LSTMs) and demonstrate a new approach for tracking the importance of a given input to the LSTM for a given output. By identifying consistently important patterns of words, we are able to distill state of the art LSTMs on sentiment analysis and question answering into a set of representative phrases. This representation is then quantitatively validated by using the extracted phrases to construct a simple, rule-based classifier which approximates the output of the LSTM.",
    "Deep reinforcement learning has achieved many impressive results in recent years. However, tasks with sparse rewards or long horizons continue to pose significant challenges. To tackle these important problems, we propose a general framework that first learns useful skills in a pre-training environment, and then leverages the acquired skills for learning faster in downstream tasks. Our approach brings together some of the strengths of intrinsic motivation and hierarchical methods: the learning of useful skill is guided by a single proxy reward, the design of which requires very minimal domain knowledge about the downstream tasks. Then a high-level policy is trained on top of these skills, providing a significant improvement of the exploration and allowing to tackle sparse rewards in the downstream tasks. To efficiently pre-train a large span of skills, we use Stochastic Neural Networks combined with an information-theoretic regularizer. Our experiments show that this combination is effective in learning a wide span of interpretable skills in a sample-efficient way, and can significantly boost the learning performance uniformly across a wide range of downstream tasks.",
    "Deep generative models have achieved impressive success in recent years. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as emerging families for generative model learning, have largely been considered as two distinct paradigms and received extensive independent studies respectively. This paper aims to establish formal connections between GANs and VAEs through a new formulation of them. We interpret sample generation in GANs as performing posterior inference, and show that GANs and VAEs involve minimizing KL divergences of respective posterior and inference distributions with opposite directions, extending the two learning phases of classic wake-sleep algorithm, respectively. The unified view provides a powerful tool to analyze a diverse set of existing model variants, and enables to transfer techniques across research lines in a principled way. For example, we apply the importance weighting method in VAE literatures for improved GAN learning, and enhance VAEs with an adversarial mechanism that leverages generated samples. Experiments show generality and effectiveness of the transferred techniques.",
    "We consider the problem of detecting out-of-distribution images in neural networks. We propose ODIN, a simple and effective method that does not require any change to a pre-trained neural network. Our method is based on the observation that using temperature scaling and adding small perturbations to the input can separate the softmax score distributions between in- and out-of-distribution images, allowing for more effective detection. We show in a series of experiments that ODIN is compatible with diverse network architectures and datasets. It consistently outperforms the baseline approach by a large margin, establishing a new state-of-the-art performance on this task. For example, ODIN reduces the false positive rate from the baseline 34.7% to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is 95%.",
    "A framework is presented for unsupervised learning of representations based on infomax principle for large-scale neural populations. We use an asymptotic approximation to the Shannon's mutual information for a large neural population to demonstrate that a good initial approximation to the global information-theoretic optimum can be obtained by a hierarchical infomax method. Starting from the initial solution, an efficient algorithm based on gradient descent of the final objective function is proposed to learn representations from the input datasets, and the method works for complete, overcomplete, and undercomplete bases. As confirmed by numerical experiments, our method is robust and highly efficient for extracting salient features from input datasets. Compared with the main existing methods, our algorithm has a distinct advantage in both the training speed and the robustness of unsupervised representation learning. Furthermore, the proposed method is easily extended to the supervised or unsupervised model for training deep structure networks.",
    "Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models. Source code is publicly available at https://imatge-upc.github.io/skiprnn-2017-telecombcn/ .",
    "Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR",
    "Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from the large variance issue on policy gradient estimation, which leads to poor sample efficiency during training. In this work, we propose a control variate method to effectively reduce variance for policy gradient methods. Motivated by the Stein's identity, our method extends the previous control variate methods used in REINFORCE and advantage actor-critic by introducing more general action-dependent baseline functions. Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy gradient approaches.",
    "Skip connections made the training of very deep networks possible and have become an indispensable component in a variety of neural architectures. A completely satisfactory explanation for their success remains elusive. Here, we present a novel explanation for the benefits of skip connections in training very deep networks. The difficulty of training deep networks is partly due to the singularities caused by the non-identifiability of the model. Several such singularities have been identified in previous works: (i) overlap singularities caused by the permutation symmetry of nodes in a given layer, (ii) elimination singularities corresponding to the elimination, i.e. consistent deactivation, of nodes, (iii) singularities generated by the linear dependence of the nodes. These singularities cause degenerate manifolds in the loss landscape that slow down learning. We argue that skip connections eliminate these singularities by breaking the permutation symmetry of nodes, by reducing the possibility of node elimination and by making the nodes less linearly dependent. Moreover, for typical initializations, skip connections move the network away from the \"ghosts\" of these singularities and sculpt the landscape around them to alleviate the learning slow-down. These hypotheses are supported by evidence from simplified models, as well as from experiments with deep networks trained on real-world datasets.",
    "We have tried to reproduce the results of the paper \"Natural Language Inference over Interaction Space\" submitted to ICLR 2018 conference as part of the ICLR 2018 Reproducibility Challenge. Initially, we were not aware that the code was available, so we started to implement the network from scratch. We have evaluated our version of the model on Stanford NLI dataset and reached 86.38% accuracy on the test set, while the paper claims 88.0% accuracy. The main difference, as we understand it, comes from the optimizers and the way model selection is performed.",
    "We have successfully implemented the \"Learn to Pay Attention\" model of attention mechanism in convolutional neural networks, and have replicated the results of the original paper in the categories of image classification and fine-grained recognition.",
    "Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose a method to learn such representations by encoding the suffixes of word sequences in a sentence and training on the Stanford Natural Language Inference (SNLI) dataset. We demonstrate the effectiveness of our approach by evaluating it on the SentEval benchmark, improving on existing approaches on several transfer tasks.",
    "In many neural models, new features as polynomial functions of existing ones are used to augment representations. Using the natural language inference task as an example, we investigate the use of scaled polynomials of degree 2 and above as matching features. We find that scaling degree 2 features has the highest impact on performance, reducing classification error by 5% in the best models.",
    "We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.",
    "In this work, we investigate Batch Normalization technique and propose its probabilistic interpretation. We propose a probabilistic model and show that Batch Normalization maximazes the lower bound of its marginalized log-likelihood. Then, according to the new probabilistic model, we design an algorithm which acts consistently during train and test. However, inference becomes computationally inefficient. To reduce memory and computational cost, we propose Stochastic Batch Normalization -- an efficient approximation of proper inference procedure. This method provides us with a scalable uncertainty estimation technique. We demonstrate the performance of Stochastic Batch Normalization on popular architectures (including deep convolutional architectures: VGG-like and ResNets) for MNIST and CIFAR-10 datasets.",
    "It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural image representations.",
    "Deep latent variable models are powerful tools for representation learning. In this paper, we adopt the deep information bottleneck model, identify its shortcomings and propose a model that circumvents them. To this end, we apply a copula transformation which, by restoring the invariance properties of the information bottleneck method, leads to disentanglement of the features in the latent space. Building on that, we show how this transformation translates to sparsity of the latent space in the new model. We evaluate our method on artificial and real data.",
    "We introduce a variant of the MAC model (Hudson and Manning, ICLR 2018) with a simplified set of equations that achieves comparable accuracy, while training faster. We evaluate both models on CLEVR and CoGenT, and show that, transfer learning with fine-tuning results in a 15 point increase in accuracy, matching the state of the art. Finally, in contrast, we demonstrate that improper fine-tuning can actually reduce a model's accuracy as well.",
    "Adaptive Computation Time for Recurrent Neural Networks (ACT) is one of the most promising architectures for variable computation. ACT adapts to the input sequence by being able to look at each sample more than once, and learn how many times it should do it. In this paper, we compare ACT to Repeat-RNN, a novel architecture based on repeating each sample a fixed number of times. We found surprising results, where Repeat-RNN performs as good as ACT in the selected tasks. Source code in TensorFlow and PyTorch is publicly available at https://imatge-upc.github.io/danifojo-2018-repeatrnn/",
    "Generative adversarial networks (GANs) are able to model the complex highdimensional distributions of real-world data, which suggests they could be effective for anomaly detection. However, few works have explored the use of GANs for the anomaly detection task. We leverage recently developed GAN models for anomaly detection, and achieve state-of-the-art performance on image and network intrusion datasets, while being several hundred-fold faster at test time than the only published GAN-based method.",
    "Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It's noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to the strongest published system.",
    "The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.",
    "Deep neural networks (DNNs) have achieved impressive predictive performance due to their ability to learn complex, non-linear relationships between variables. However, the inability to effectively visualize these relationships has led to DNNs being characterized as black boxes and consequently limited their applications. To ameliorate this problem, we introduce the use of hierarchical interpretations to explain DNN predictions through our proposed method, agglomerative contextual decomposition (ACD). Given a prediction from a trained DNN, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive. Using examples from Stanford Sentiment Treebank and ImageNet, we show that ACD is effective at diagnosing incorrect predictions and identifying dataset bias. Through human experiments, we demonstrate that ACD enables users both to identify the more accurate of two DNNs and to better trust a DNN's outputs. We also find that ACD's hierarchy is largely robust to adversarial perturbations, implying that it captures fundamental aspects of the input and ignores spurious noise.",
    "In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies \"image\" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.",
    "We consider the task of word-level language modeling and study the possibility of combining hidden-states-based short-term representations with medium-term representations encoded in dynamical weights of a language model. Our work extends recent experiments on language models with dynamically evolving weights by casting the language modeling problem into an online learning-to-learn framework in which a meta-learner is trained by gradient-descent to continuously update a language model weights.",
    "GANS are powerful generative models that are able to model the manifold of natural images. We leverage this property to perform manifold regularization by approximating the Laplacian norm using a Monte Carlo approximation that is easily computed with the GAN. When incorporated into the feature-matching GAN of Improved GAN, we achieve state-of-the-art results for GAN-based semi-supervised learning on the CIFAR-10 dataset, with a method that is significantly easier to implement than competing methods.",
    "We identify a class of over-parameterized deep neural networks with standard activation functions and cross-entropy loss which provably have no bad local valley, in the sense that from any point in parameter space there exists a continuous path on which the cross-entropy loss is non-increasing and gets arbitrarily close to zero. This implies that these networks have no sub-optimal strict local minima.",
    "Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.",
    "One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally light and easy to incorporate into existing implementations. We tested the efficacy of spectral normalization on CIFAR10, STL-10, and ILSVRC2012 dataset, and we experimentally confirmed that spectrally normalized GANs (SN-GANs) is capable of generating images of better or equal quality relative to the previous training stabilization techniques.",
    "Embedding graph nodes into a vector space can allow the use of machine learning to e.g. predict node classes, but the study of node embedding algorithms is immature compared to the natural language processing field because of a diverse nature of graphs. We examine the performance of node embedding algorithms with respect to graph centrality measures that characterize diverse graphs, through systematic experiments with four node embedding algorithms, four or five graph centralities, and six datasets. Experimental results give insights into the properties of node embedding algorithms, which can be a basis for further research on this topic.",
    "We introduce a new dataset of logical entailments for the purpose of measuring models' ability to capture and exploit the structure of logical expressions against an entailment prediction task. We use this task to compare a series of architectures which are ubiquitous in the sequence-processing literature, in addition to a new model class---PossibleWorldNets---which computes entailment as a \"convolution over possible worlds\". Results show that convolutional networks present the wrong inductive bias for this class of problems relative to LSTM RNNs, tree-structured neural networks outperform LSTM RNNs due to their enhanced ability to exploit the syntax of logic, and PossibleWorldNets outperform all benchmarks.",
    "Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.   We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the \"lottery ticket hypothesis:\" dense, randomly-initialized, feed-forward networks contain subnetworks (\"winning tickets\") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.   We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.",
    "We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2\\% to 5.3\\%.",
    "Understanding theoretical properties of deep and locally connected nonlinear network, such as deep convolutional neural network (DCNN), is still a hard problem despite its empirical success. In this paper, we propose a novel theoretical framework for such networks with ReLU nonlinearity. The framework explicitly formulates data distribution, favors disentangled representations and is compatible with common regularization techniques such as Batch Norm. The framework is built upon teacher-student setting, by expanding the student forward/backward propagation onto the teacher's computational graph. The resulting model does not impose unrealistic assumptions (e.g., Gaussian inputs, independence of activation, etc). Our framework could help facilitate theoretical analysis of many practical issues, e.g. overfitting, generalization, disentangled representations in deep networks.",
    "We present a Neural Program Search, an algorithm to generate programs from natural language description and a small number of input/output examples. The algorithm combines methods from Deep Learning and Program Synthesis fields by designing rich domain-specific language (DSL) and defining efficient search algorithm guided by a Seq2Tree model on it. To evaluate the quality of the approach we also present a semi-synthetic dataset of descriptions with test examples and corresponding programs. We show that our algorithm significantly outperforms a sequence-to-sequence model with attention baseline.",
    "Most state-of-the-art neural machine translation systems, despite being different in architectural skeletons (e.g. recurrence, convolutional), share an indispensable feature: the Attention. However, most existing attention methods are token-based and ignore the importance of phrasal alignments, the key ingredient for the success of phrase-based statistical machine translation. In this paper, we propose novel phrase-based attention methods to model n-grams of tokens as attention entities. We incorporate our phrase-based attentions into the recently proposed Transformer network, and demonstrate that our approach yields improvements of 1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English translation tasks on WMT newstest2014 using WMT'16 training data.",
    "We introduce the problem of learning distributed representations of edits. By combining a \"neural editor\" with an \"edit encoder\", our models learn to represent the salient information of an edit and can be used to apply edits to new inputs. We experiment on natural language and source code edit data. Our evaluation yields promising results that suggest that our neural network models learn to capture the structure and semantics of edits. We hope that this interesting task and data source will inspire other researchers to work further on this problem.",
    "We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.",
    "This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that VCL outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.",
    "This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.",
    "In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts.",
    "We propose a single neural probabilistic model based on variational autoencoder that can be conditioned on an arbitrary subset of observed features and then sample the remaining features in \"one shot\". The features may be both real-valued and categorical. Training of the model is performed by stochastic variational Bayes. The experimental evaluation on synthetic data, as well as feature imputation and image inpainting problems, shows the effectiveness of the proposed approach and diversity of the generated samples.",
    "Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content (\"bit rate\") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.",
    "Understanding and characterizing the subspaces of adversarial examples aid in studying the robustness of deep neural networks (DNNs) to adversarial perturbations. Very recently, Ma et al. (ICLR 2018) proposed to use local intrinsic dimensionality (LID) in layer-wise hidden representations of DNNs to study adversarial subspaces. It was demonstrated that LID can be used to characterize the adversarial subspaces associated with different attack methods, e.g., the Carlini and Wagner's (C&W) attack and the fast gradient sign attack.   In this paper, we use MNIST and CIFAR-10 to conduct two new sets of experiments that are absent in existing LID analysis and report the limitation of LID in characterizing the corresponding adversarial subspaces, which are (i) oblivious attacks and LID analysis using adversarial examples with different confidence levels; and (ii) black-box transfer attacks. For (i), we find that the performance of LID is very sensitive to the confidence parameter deployed by an attack, and the LID learned from ensembles of adversarial examples with varying confidence levels surprisingly gives poor performance. For (ii), we find that when adversarial examples are crafted from another DNN model, LID is ineffective in characterizing their adversarial subspaces. These two findings together suggest the limited capability of LID in characterizing the subspaces of adversarial examples.",
    "Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam.",
    "Neural message passing algorithms for semi-supervised classification on graphs have recently achieved great success. However, for classifying a node these methods only consider nodes that are a few propagation steps away and the size of this utilized neighborhood is hard to extend. In this paper, we use the relationship between graph convolutional networks (GCN) and PageRank to derive an improved propagation scheme based on personalized PageRank. We utilize this propagation procedure to construct a simple model, personalized propagation of neural predictions (PPNP), and its fast approximation, APPNP. Our model's training time is on par or faster and its number of parameters on par or lower than previous models. It leverages a large, adjustable neighborhood for classification and can be easily combined with any neural network. We show that this model outperforms several recently proposed methods for semi-supervised classification in the most thorough study done so far for GCN-like models. Our implementation is available online.",
    "We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimization-based attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining non-certified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.",
    "Methods that learn representations of nodes in a graph play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as point vectors in a low-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, we propose an unsupervised method that handles inductive learning scenarios and is applicable to different types of graphs: plain/attributed, directed/undirected. By leveraging both the network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering of the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks. Additionally, we demonstrate the benefits of modeling uncertainty - by analyzing it we can estimate neighborhood diversity and detect the intrinsic latent dimensionality of a graph.",
    "Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. However, a number of problems of recent interest have created a demand for models that can analyze spherical images. Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective.   In this paper we introduce the building blocks for constructing spherical CNNs. We propose a definition for the spherical cross-correlation that is both expressive and rotation-equivariant. The spherical correlation satisfies a generalized Fourier theorem, which allows us to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs applied to 3D model recognition and atomization energy regression.",
    "This paper shows how one can directly apply natural language processing (NLP) methods to classification problems in cheminformatics. Connection between these seemingly separate fields is shown by considering standard textual representation of compound, SMILES. The problem of activity prediction against a target protein is considered, which is a crucial part of computer aided drug design process. Conducted experiments show that this way one can not only outrank state of the art results of hand crafted representations but also gets direct structural insights into the way decisions are made.",
    "The inclusion of Computer Vision and Deep Learning technologies in Agriculture aims to increase the harvest quality, and productivity of farmers. During postharvest, the export market and quality evaluation are affected by assorting of fruits and vegetables. In particular, apples are susceptible to a wide range of defects that can occur during harvesting or/and during the post-harvesting period. This paper aims to help farmers with post-harvest handling by exploring if recent computer vision and deep learning methods such as the YOLOv3 (Redmon & Farhadi (2018)) can help in detecting healthy apples from apples with defects.",
    "We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is \"matrix factorization by design\" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the near state-of the art perplexity while using significantly less RNN parameters.",
    "State-of-the-art deep reading comprehension models are dominated by recurrent neural nets. Their sequential nature is a natural fit for language, but it also precludes parallelization within an instances and often becomes the bottleneck for deploying such models to latency critical scenarios. This is particularly problematic for longer texts. Here we present a convolutional architecture as an alternative to these recurrent architectures. Using simple dilated convolutional units in place of recurrent ones, we achieve results comparable to the state of the art on two question answering tasks, while at the same time achieving up to two orders of magnitude speedups for question answering.",
    "In this work, we analyze the reinstatement mechanism introduced by Ritter et al. (2018) to reveal two classes of neurons that emerge in the agent's working memory (an epLSTM cell) when trained using episodic meta-RL on an episodic variant of the Harlow visual fixation task. Specifically, Abstract neurons encode knowledge shared across tasks, while Episodic neurons carry information relevant for a specific episode's task.",
    "The rate-distortion-perception function (RDPF; Blau and Michaeli, 2019) has emerged as a useful tool for thinking about realism and distortion of reconstructions in lossy compression. Unlike the rate-distortion function, however, it is unknown whether encoders and decoders exist that achieve the rate suggested by the RDPF. Building on results by Li and El Gamal (2018), we show that the RDPF can indeed be achieved using stochastic, variable-length codes. For this class of codes, we also prove that the RDPF lower-bounds the achievable rate",
    "In this paper, we present Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Different from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. Instead, it directly outputs phrases in a sequential order and can decode in linear time. Our experiments show that NPMT achieves superior performances on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks compared with strong NMT baselines. We also observe that our method produces meaningful phrases in output languages.",
    "It is by now well-known that small adversarial perturbations can induce classification errors in deep neural networks (DNNs). In this paper, we make the case that sparse representations of the input data are a crucial tool for combating such attacks. For linear classifiers, we show that a sparsifying front end is provably effective against $\\ell_{\\infty}$-bounded attacks, reducing output distortion due to the attack by a factor of roughly $K / N$ where $N$ is the data dimension and $K$ is the sparsity level. We then extend this concept to DNNs, showing that a \"locally linear\" model can be used to develop a theoretical foundation for crafting attacks and defenses. Experimental results for the MNIST dataset show the efficacy of the proposed sparsifying front end.",
    "We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU formulates and solves a constrained optimization problem in the non-parameterized proximal policy space. Using supervised regression, it then converts the optimal non-parameterized policy to a parameterized policy, from which it draws new samples. The methodology is general in that it applies to both discrete and continuous action spaces, and can handle a wide variety of proximity constraints for the non-parameterized optimization problem. We show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology. The SPU implementation is much simpler than TRPO. In terms of sample efficiency, our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.",
    "We present a parameterized synthetic dataset called Moving Symbols to support the objective study of video prediction networks. Using several instantiations of the dataset in which variation is explicitly controlled, we highlight issues in an existing state-of-the-art approach and propose the use of a performance metric with greater semantic meaning to improve experimental interpretability. Our dataset provides canonical test cases that will help the community better understand, and eventually improve, the representations learned by such networks in the future. Code is available at https://github.com/rszeto/moving-symbols ."
  ],
  "sampled": [
    "This report details the implementation of AlexNet (Krizhevsky et al., 2012) using Theano and its straightforward data parallelism across multiple GPUs. The performance achieved with 2 GPUs is similar to the state-of-the-art results obtained with the Caffe library (Jia et al., 2014) on a single GPU. As far as we know, this is the first open-source AlexNet implementation written in Python.",
    "Deep narrow Boltzmann machines (DNBMs) have the impressive ability to universally approximate probability distributions on their visible layer activities, as long as they possess enough hidden layers, each mirroring the visible layer's unit count. Intriguingly, within specific parameter ranges, DNBMs can be analyzed as if they were feedforward networks. We have established upper and lower bounds for the necessary depth and width required for these machines to function as universal approximators. These findings affirm various intuitions about undirected networks and, notably, demonstrate that DNBMs are just as efficient in terms of compactness as narrow sigmoid belief networks and restricted Boltzmann machines, according to the latest available bounds for these models.",
    "We propose enhancing recurrent neural networks with latent variables to create Stochastic Recurrent Networks (STORNs), utilizing advances in variational inference. STORNs can be trained with stochastic gradient methods, handle structured and multi-modal conditionals at each time step, reliably estimate marginal likelihood, and generalize deterministic recurrent neural networks. We evaluate this method on four polyphonic musical datasets and motion capture data.",
    "We present a framework for online adaptation of optimization hyperparameters by swapping their values during learning. Using an explore-exploit strategy from multi-armed bandit literature, we focus on adaptive learning rate selection. Experiments on a benchmark neural network demonstrate that this approach outperforms AdaDelta and stochastic gradient methods with exhaustive hyperparameter search.",
    "Modern problems involving multiple classes or labels often have very large sets of possible outputs. Label embeddings help by making these problems more computationally and statistically efficient. In this study, we discover a connection between limited rank estimation and low-dimensional label embeddings, leading to a fast embedding algorithm for both multiclass and multilabel situations. Our approach uses a randomized algorithm for partial least squares that is much quicker than traditional methods. We test our techniques on two large public datasets—from the Large Scale Hierarchical Text Challenge and the Open Directory Project—and achieve top results.",
    "Accurate representational learning of both explicit and implicit relationships within data is crucial for enabling machines to perform more complex and abstract reasoning tasks. We present our Dynamic Adaptive Network Intelligence (DANI) model, which efficiently employs weakly supervised learning to make such inferences. Our results demonstrate that DANI achieves state-of-the-art performance on question answering tasks within the bAbI dataset, surpassing the capabilities of contemporary representation learning approaches (Weston et al., 2015).",
    "Automatic speech recognition systems often depend on spectral-based features like MFCC or PLP, which are derived from established knowledge about speech perception and production. However, the game has changed with the advent of convolutional neural networks (CNNs), which can now estimate phoneme conditional probabilities purely through a data-driven approach, directly using the raw temporal speech signal as input. Remarkably, these CNN-based systems have demonstrated comparable or superior performance to the traditional HMM/ANN systems in both phoneme recognition and large-scale continuous speech recognition tasks, all while utilizing fewer parameters.\n\nInspired by these advancements, we explore the potential of integrating a simple linear classifier within the CNN framework. This innovative approach enables the network to learn linearly separable features directly from raw speech. Our experiments reveal that this system achieves performance on par with, or even better than, multi-layer perceptron (MLP) based systems that rely on cepstral-based features.",
    "In the realm of the Kaldi speech recognition toolkit, we unveil an innovative neural-network training framework designed specifically for Deep Neural Networks (DNNs) to handle vast amounts of data using the combined power of multiple GPU-equipped or multi-core systems. Striving for hardware versatility, we crafted a solution to harness multiple machines efficiently without overwhelming network bandwidth. Our approach involves periodically synchronizing and averaging the neural network parameters—typically every minute or two—and then redistributing these averaged parameters across the machines to continue the training process. Each machine processes distinct data sets.\n\nAlthough this alone might not yield stellar results, the introduction of an efficient implementation of the Natural Gradient for Stochastic Gradient Descent (NG-SGD) transforms our method's performance. This innovative blend not only enhances the effectiveness of our periodic parameter averaging strategy but also significantly accelerates the convergence rate of SGD on individual machines, creating a harmonious and powerful training ecosystem.",
    "We have developed a novel method to visualize and refine the invariances of learned representations. Our approach tests for \"linearization invariance,\" where transformations are confined to a low-dimensional subspace. By synthesizing a sequence of minimal-length images (\"representational geodesics\") between two reference images, we can observe the gradual evolution of transformations if linearized correctly.\n\nTesting this method on a state-of-the-art image classification network, we found that transformations—such as translation, rotation, and dilation—do not show the expected gradual evolution. Our method not only identifies these failures but also offers a solution. Implementing this remedy, we demonstrated that the modified representation successfully linearizes various geometric image transformations.",
    "Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We explore these questions using group theory, introducing a novel theoretical approach to understanding Deep Learning.\n\nA significant reason for the recent advancements in Deep Learning is a crucial algorithmic step known as **pretraining**. This involves first discovering a good generative model for the input samples and then iteratively repeating this process layer by layer. We reveal deeper implications of this principle by connecting it to the relationship between orbits and stabilizers in group actions. Although neural networks themselves may not constitute groups, we demonstrate the existence of **shadow groups** whose elements closely approximate the networks' behavior.\n\nWithin these shadow groups, the pretraining step, initially designed to better initialize a network, transforms into a search for features with minimal orbits, which are intuitively the simplest features. This explains why a deep learning network first learns simple features. Furthermore, we show that this principle, when applied to deeper layers, can capture higher-order representations. Consequently, representation complexity increases as the layers deepen.",
    "Introducing the \"stacked what-where auto-encoders\" (SWWAE) – a cutting-edge architecture that seamlessly blends discriminative and generative pathways for a unified take on supervised, semi-supervised, and unsupervised learning, all without the need for sampling during training. Imagine using a convolutional net (Convnet) (LeCun et al., 1998) to smartly encode your input, while a deconvolutional net (Deconvnet) (Zeiler et al., 2010) works its magic to reconstruct it. Our goal? To make hidden states in the Deconvnet mirror those in the Convnet. Each pooling layer dishes out two sets of variables: the \"what\" that moves forward to the next layer, and the \"where\" that heads to the matching layer in the generative decoder. Get ready to see learning in a whole new light with SWWAE!",
    "We look into creating word embeddings designed for specific word pairs. Our method starts with current word vectors and refines them to predict these word pairs better. Our experiments show that these specialized embeddings can improve both accuracy and speed in predicting word relationships.",
    "A sophisticated generative model has been designed for deep convolutional dictionary learning, which employs multiple layers. This model incorporates an innovative probabilistic pooling operation, enhancing its functionality. This addition facilitates efficient probabilistic learning processes in both bottom-up directions for pretraining and top-down directions for refinement. Experimental evaluations highlight the model's impressive ability to extract multi-layer features from images. It achieves outstanding classification performance, as demonstrated on the MNIST and Caltech 101 datasets.",
    "Inspired by advances in generative models, we present a model that creates images from natural language descriptions. Our model iteratively draws patches on a canvas, focusing on pertinent words from the description. Trained on Microsoft COCO, we compare it with various baseline models on image generation and retrieval tasks. Our results show that our model produces higher quality images and generates scenes based on novel, previously unseen captions.",
    "Convolutional neural networks (CNNs) excel on large datasets, but collecting labeled data can be challenging, especially when extensive amounts of data are unavailable. This raises the issue of how to effectively use CNNs with small datasets, as they tend to overfit easily. We introduce a Bayesian CNN that enhances robustness against overfitting on limited data compared to traditional methods. By placing a probability distribution over the CNN’s kernels, we tackle this issue. We approximate our model's complex posterior distribution using Bernoulli variational distributions, which necessitate no additional model parameters.\n\nTheoretically, we interpret dropout network training as approximate inference in Bayesian neural networks. This approach allows us to leverage existing deep learning tools without increasing time complexity and reveals a significant limitation in the field. Our model demonstrates a notable improvement in classification accuracy over standard techniques and surpasses existing state-of-the-art results for the CIFAR-10 dataset.",
    "We introduce a novel approach for designing computationally efficient convolutional neural networks (CNNs) through the use of low-rank representations for convolutional filters. Rather than optimizing filters in pre-trained networks with more efficient counterparts, we build a set of compact basis filters from the ground up. During the training process, the network learns to combine these basis filters to form complex, discriminative filters suitable for image classification tasks. A new weight initialization strategy is employed to effectively set the initial connection weights in convolutional layers that consist of groups of diverse filter shapes. We validate our technique by implementing it on several established CNN architectures and training these models from scratch using the CIFAR, ILSVRC, and MIT Places datasets. Our findings demonstrate that our approach achieves similar or superior accuracy compared to traditional CNNs while significantly reducing computational requirements. \n\nFor instance, applying our method to an enhanced version of the VGG-11 network with global max-pooling, we obtain comparable validation accuracy while using 41% less computational resources and only 24% of the parameters of the original VGG-11 model. Another variation of our approach results in a 1% increase in accuracy over the improved VGG-11 model, achieving a top-5 center-crop validation accuracy of 89.7% with a 16% reduction in computation compared to the original VGG-11 model. Additionally, integrating our method with the GoogLeNet architecture for the ILSVRC dataset, we achieve comparable accuracy with 26% less computation and 41% fewer model parameters. Similarly, applying our approach to a near state-of-the-art network for the CIFAR dataset, we realize comparable accuracy with 46% less computation and a 55% reduction in parameters.",
    "Distributed representations of words have revolutionized the performance of numerous Natural Language Processing tasks. Yet, traditional methods often capture only a single representation per word, overlooking the crucial aspect that many words possess multiple meanings. This simplification undermines not just individual word representations, but the entire language model's effectiveness. In this paper, we introduce an innovative yet straightforward model that enhances modern techniques for constructing word vectors by distinguishing the various senses of polysemous words. Our evaluation demonstrates that this model successfully differentiates between word senses with remarkable computational efficiency.",
    "Introducing the Diverse Embedding Neural Network (DENN), an innovative twist on traditional language models (LMs). Unlike typical feed-forward neural networks that map input word history vectors into a single high-dimensional space, DENNLMs scatter these vectors across multiple diverse, low-dimensional sub-spaces. This diversity is actively fostered during training with an enhanced loss function. When put to the test on the Penn Treebank dataset, DENNLMs demonstrated a clear performance edge.",
    "Collaborative Filtering (CF) aims to predict user ratings on items, often using Matrix Factorization techniques. These methods compute representations for both users and items from observed ratings to make predictions. However, traditional approaches struggle with the \"cold-start\" problem, where new users have no initial ratings. Typically, users are asked to provide a few ratings to overcome this issue. This paper introduces a model designed to address two challenges: (i) identifying the best questions to ask new users and (ii) creating effective representations from limited data. Additionally, the model performs well in standard (warm) scenarios. Our approach is tested on four datasets, demonstrating improved performance over baselines in both traditional CF and cold-start contexts.",
    "We introduce a sophisticated deep learning framework for capturing intricate, high-dimensional probability distributions, termed Non-linear Independent Component Estimation (NICE). This methodology hinges on the principle that an effective representation renders the data's distribution straightforward to model. To this end, a non-linear deterministic transformation of the data is developed, mapping it to a latent space to induce a factorized distribution, thereby ensuring the independence of the latent variables. \n\nThe transformation is carefully parametrized to facilitate the computation of both the Jacobian determinant and its inverse, all while retaining the capacity to learn intricate non-linear mappings through a compositional approach, utilizing simple yet powerful building blocks based on deep neural networks. The training objective is derived from the exact log-likelihood, ensuring tractability. Furthermore, unbiased ancestral sampling is effortlessly achievable.\n\nOur empirical evaluations demonstrate that this method produces robust generative models across four image datasets, with the additional capability of performing image inpainting.",
    "Introducing Deep Linear Discriminant Analysis (DeepLDA), our novel approach that develops linearly separable latent representations in a seamless, end-to-end manner. Traditional LDA focuses on extracting features that maintain class separability and is commonly employed for dimensionality reduction in various classification tasks. This research builds on that foundation by integrating LDA with a deep neural network, effectively creating a non-linear enhancement of classic LDA.\n\nRather than simply maximizing the likelihood of target labels for individual samples, we propose an objective function designed to drive the network toward generating feature distributions that exhibit low intra-class variance and high inter-class variance. This objective is formulated based on the general LDA eigenvalue problem, allowing the model to be trained using stochastic gradient descent and back-propagation.\n\nWe validate our method across three benchmark datasets: MNIST, CIFAR-10, and STL-10. Our findings demonstrate that DeepLDA achieves competitive performance on MNIST and CIFAR-10 datasets and surpasses a network trained with categorical cross-entropy (using the same architecture) in a supervised setting on the STL-10 dataset.",
    "Hey there!\n\nWe’re excited to introduce a new and straightforward method for initializing weights in deep learning called Layer-sequential unit-variance (LSUV) initialization. Here’s how it works: \n\n1. First, you start by pre-initializing the weights of each convolutional or fully-connected layer with orthonormal matrices.\n2. Next, you move from the first to the last layer, normalizing the variance of each layer’s output to equal one.\n\nWe tried out this method with different activation functions like maxout, ReLU-family, and tanh, and found that it helps in training very deep networks. The results were pretty awesome – not only did we get test accuracy as good as or better than standard methods, but it also matched the speed of other complex techniques designed for very deep nets, like FitNets and Highway.\n\nWe tested LSUV on models like GoogLeNet, CaffeNet, FitNets, and Residual nets. The performance was top-notch, or very close to it, on datasets such as MNIST, CIFAR-10/100, and ImageNet.\n\nGive it a shot and see how it works for you!",
    "We present a parametric nonlinear transformation ideal for Gaussianizing data from natural images. The process includes linear transformation, followed by normalization of each component using a pooled activity measure, derived from an exponentiated weighted sum of rectified and exponentiated components plus a constant. We refine the parameters (linear transform, exponents, weights, constant) by minimizing the negentropy of responses over a natural image database. The resulting transformation substantially Gaussianizes the data, outperforming methods like ICA and radial Gaussianization in reducing mutual information between transformed components. This differentiable, efficiently invertible transformation also defines a density model on images. Samples from this model visually resemble natural image patches. Additionally, as a prior probability density, it effectively removes additive noise. The transformation can be cascaded, optimizing each layer with the Gaussianization objective, offering an unsupervised method to enhance deep network architectures.",
    "We present flattened convolutional neural networks optimized for fast feedforward execution. By using a sequence of one-dimensional filters in all 3D directions during training, our flattened networks achieve performance comparable to traditional convolutional networks. Testing on various datasets shows that flattened layers can replace 3D filters without sacrificing accuracy. These networks offer approximately twice the speed in feedforward passes by significantly reducing the number of learned parameters. Moreover, no manual tuning or post-processing is needed after training.",
    "In this paper, we present a groundbreaking deep learning framework called Purine. Purine reimagines a deep network as a bipartite graph (bi-graph), which consists of interconnected operators and data tensors. This bi-graph abstraction allows networks to be efficiently handled by an event-driven task dispatcher. We showcase that various parallelism strategies across GPUs and/or CPUs on single or multiple PCs can be universally implemented through graph composition. This significantly simplifies the coding required for different parallelization schemes, as the same dispatcher can manage diverse graphs. By scheduling with the task dispatcher, memory transfers seamlessly overlap with other computations, significantly reducing communication overhead and enabling near-linear speedup.",
    "In this study, we introduce a novel model that leverages the advantages of both RNNs and SGVB, termed the Variational Recurrent Auto-Encoder (VRAE). This model is particularly well-suited for scalable and efficient unsupervised learning with time series data, enabling the transformation of such data into a latent vector representation. As a generative model, it allows for the creation of new data instances by sampling from the latent space. A key contribution of this research is the model's ability to utilize unlabeled data, thereby enhancing the supervised training of RNNs by initializing the weights and network state.",
    "Present-day research on lexical distributed embeddings typically maps individual words to point vectors within a low-dimensional space. In contrast, mapping to a density offers several notable benefits, such as enhanced representation of uncertainty and relationships, more natural expression of asymmetries compared to dot product or cosine similarity, and the potential for more sophisticated parameterization of decision boundaries. This paper champions the use of density-based distributed embeddings and introduces a technique for deriving representations within the domain of Gaussian distributions. We evaluate the approach on various word embedding benchmarks, examine how well these embeddings model entailment and other asymmetric relationships, and explore unique properties of the representation.",
    "Multipliers are the most space and power-consuming arithmetic operators in the digital implementation of deep neural networks. We train a set of advanced neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10, and SVHN. These networks are trained using three different formats: floating point, fixed point, and dynamic fixed point. For each dataset and each format, we evaluate the effect of multiplication precision on the final error after training. Our findings indicate that very low precision is adequate not only for running trained networks but also for training them. For instance, it is possible to train Maxout networks using 10-bit multiplications.",
    "Multiple instance learning (MIL) can help reduce the expensive need for detailed annotations in tasks like semantic segmentation by using less precise supervision. We introduce a new MIL approach for multi-class semantic segmentation using a fully convolutional network. This method aims to train a semantic segmentation model using only weak image-level labels. The training process, which is end-to-end, optimizes both the representation and the assignment of pixel labels. Fully convolutional training works with inputs of any size, avoids the need for object proposal preprocessing, and provides a pixelwise loss map to identify hidden instances. Our multi-class MIL loss takes advantage of the extra supervision provided by images with multiple labels. We tested this approach with preliminary experiments on the PASCAL VOC segmentation challenge.",
    "Nested dropout is a newly suggested technique for organizing the units in autoencoders based on their information value, while still maintaining the reconstruction efficiency. So far, this method has been used exclusively to train fully-connected autoencoders in an unsupervised manner. In our study, we analyze how nested dropout affects convolutional layers within a CNN that is trained using backpropagation. We aim to determine whether nested dropout offers an efficient and structured approach to identify the ideal representation size that balances accuracy, task requirements, and data complexity.",
    "Stochastic gradient algorithms have been pivotal in addressing large-scale learning challenges, leading to significant advancements in machine learning. The convergence of Stochastic Gradient Descent (SGD) is contingent upon the meticulous selection of the learning rate and the degree of noise present in the stochastic estimates of the gradients. In this paper, we introduce a novel adaptive learning rate algorithm that leverages curvature information to automatically adjust learning rates. This curvature information is derived from the local statistics of the stochastic first-order gradients, providing element-wise insights into the loss function. Additionally, we propose a new variance reduction technique designed to accelerate convergence. Our preliminary experiments with deep neural networks demonstrate superior performance when compared to widely used stochastic gradient algorithms.",
    "Imagine watching a 3D object as it moves around you. Your view of it changes, right? Well, when this happens, both the image you see and the visual model a computer has learned get updated. Here's a fascinating idea: a great visual model adapts smoothly to movements in the scene. Applying the theory of group representations, we've discovered that this kind of model can actually break down into a mix of simple, fundamental parts.\n\nWe've also found something amazing: these basic parts, or 'irreducible representations,' have unique statistical properties—they’re decorrelated under certain conditions. When you only see part of a scene, like through a camera, things get trickier. The movement doesn't transform the images in a straightforward way. So, we need to make educated guesses about a deeper representation that does change predictably.\n\nTo bring this concept to life, look at our model of rotating NORB objects. It uses hidden representations of the intricate 3D rotation group, SO(3), to make sense of the movement. This approach helps the computer understand and represent the object's rotation accurately, even when viewed from different angles.",
    "Finding the most efficient Maximum Inner Product Search (MIPS) is crucial for applications in recommendation systems and in handling vast numbers of classes in classification tasks. Recent studies have explored solutions such as locality-sensitive hashing (LSH) and tree-based methods to achieve approximate MIPS in sublinear time. In this paper, we introduce a straightforward alternative approach using adaptations of the k-means clustering algorithm. Specifically, we apply spherical k-means clustering after transforming the MIPS problem into a Maximum Cosine Similarity Search (MCSS). Our experiments on two standard benchmarks for recommendation systems and on sizeable word embedding vocabularies demonstrate that this uncomplicated method significantly accelerates search processes while maintaining retrieval precision, outperforming current top-tier hashing-based and tree-based methods. Additionally, this method delivers more reliable retrievals in the presence of noisy queries.",
    "The variational autoencoder (VAE), introduced by Kingma and Welling in 2014, is a generative model consisting of two networks: a top-down generative network and a bottom-up recognition network. The recognition network approximates posterior inference, often assuming that the posterior distribution can be factored and its parameters can be estimated through nonlinear regression from observations. However, we find that the VAE objective can result in overly simplified representations that do not fully utilize the network's modeling capacity.\n\nWe introduce the importance weighted autoencoder (IWAE), which has the same architecture as the VAE but uses a more accurate log-likelihood lower bound obtained through importance weighting. In the IWAE, the recognition network uses multiple samples to better approximate the posterior, giving it more flexibility to model complex distributions that the VAE cannot handle. Our empirical results show that IWAEs learn more detailed latent space representations than VAEs, which leads to better performance on density estimation benchmarks.",
    "This work explores how using lower precision data in Convolutional Neural Networks (CNNs) affects their accuracy during classification. Specifically, it looks at networks where different layers can use different precision levels. The main finding is that CNNs' tolerance to lower precision data varies not only between different networks but also within individual networks. Using varied precision per layer could lead to energy and performance benefits. This study examines how error tolerance varies across layers and proposes a method to find a low precision setup that maintains high accuracy. By analyzing various CNNs, it shows that, compared to a standard implementation using 32-bit floating-point data for all layers, and with less than 1% decrease in relative accuracy, the data footprint of these networks can be reduced by an average of 74% and up to 92%.",
    "The effectiveness of graph-based semi-supervised algorithms hinges on the quality of the graph constructed from the given instances. Typically, these instances are initially represented in vector form before being used to form a graph. The graph's construction depends on a metric derived from the vector space, which determines the weights of connections between entities. Traditionally, a distance or similarity measure based on the Euclidean norm is used for this purpose. However, we argue that the Euclidean norm may not always be the optimal choice for efficiently solving the task. Therefore, we propose an algorithm designed to learn the most suitable vector representation for constructing a graph that can address the task more effectively.",
    "Hypernymy, textual entailment, and image captioning can be viewed as specific instances of a unified visual-semantic hierarchy encompassing words, sentences, and images. In this paper, we propose explicitly modeling the partial order structure inherent in this hierarchy. To achieve this, we present a general method for learning ordered representations and demonstrate its applicability to various tasks involving images and language. Our results show that the resulting representations enhance performance in hypernym prediction and image-caption retrieval compared to current approaches.",
    "We introduce Local Distributional Smoothness (LDS), a groundbreaking smoothness criterion for statistical models that can be used as a regularization term to enhance the model's distributional smoothness. We call this LDS-based regularization technique Virtual Adversarial Training (VAT). The LDS at a given input point is defined as the KL-divergence robustness of the model's distribution against localized perturbations around that point. VAT is akin to adversarial training but stands out by determining the adversarial direction based solely on the model distribution, independent of label information, making it ideal for semi-supervised learning.\n\nVAT is computationally efficient: for neural networks, the approximated gradient of LDS can be computed with just three pairs of forward and backward propagations. When applied to supervised and semi-supervised learning on the MNIST dataset, our technique surpassed all but the most advanced generative model-based state-of-the-art method. Additionally, our method demonstrated superior performance over the current state-of-the-art semi-supervised methods on the SVHN and NORB datasets.",
    "Large labeled datasets have enabled Convolutional Network models to achieve impressive recognition results. However, manual annotation is often impractical, and our data may have noisy labels, meaning the labels may not be accurate. In this paper, we examine the performance of discriminatively-trained ConvNets on such noisy data. We introduce an additional noise layer in the network that adapts the outputs to match the noisy label distribution. The parameters of this noise layer can be estimated during training with simple modifications to existing deep network training infrastructures. We demonstrate our approach on several datasets, including large-scale experiments on the ImageNet classification benchmark.",
    "We offer innovative, guaranteed methods for training feedforward neural networks with sparse connections. Building on established techniques for learning linear networks, we demonstrate their effective application to non-linear networks. By focusing on the moments related to the label and the input's score function, we prove that their factorization can reliably produce the first layer's weight matrix in a deep network under reasonable conditions. In practical terms, our method's results can serve as highly effective initializers for gradient descent.",
    "Discourse relations connect smaller pieces of language to form clear and meaningful texts. Identifying these relations automatically is tough because it requires understanding the meaning of the linked sentences. Furthermore, it’s challenging because it’s not enough just to understand each sentence individually; the connections between smaller elements, like mentions of entities, also matter.\n\nOur solution involves creating meaning representations by building up the syntactic parse tree. Unlike previous approaches, we also create representations for entity mentions using a new method that works in a downward manner. We predict discourse relations using these representations not only for sentences but also for the key entities within them. This approach significantly improves the accuracy of predicting implicit discourse relations in the Penn Discourse Treebank compared to previous methods.",
    "In this study, we introduce a new approach for combining two recent research areas: unsupervised learning of shallow semantics, such as semantic roles, and the factorization of relations in text and knowledge bases. Our model includes two key components: (1) an encoding component, which uses a semantic role labeling model to predict roles based on a comprehensive set of syntactic and lexical features; (2) a reconstruction component, which uses a tensor factorization model to predict argument fillers based on the roles. When these components are jointly optimized to reduce errors in argument reconstruction, the resulting roles align closely with those defined in annotated resources. Our method achieves performance comparable to the best role induction methods for English, despite not using any prior linguistic knowledge about the language.",
    "The concept of a metric is crucial in machine learning tasks like classification, clustering, and ranking. However, it should be noted that there is a significant lack of theoretical guarantees regarding the generalization ability of a classifier linked to a specific metric. The theoretical framework of $(\\epsilon, \\gamma, \\tau)$-good similarity functions (Balcan et al., 2008) was among the first efforts to connect the attributes of a similarity function with those of a linear classifier utilizing it. In this paper, we build upon and complete this theory by offering a new generalization bound for the associated classifier based on the algorithmic robustness framework.",
    "We introduce the multiplicative recurrent neural network as a general framework for understanding compositional meaning in language, and we assess its performance on fine-grained sentiment analysis. We draw a link to previously studied matrix-space models for compositionality, demonstrating that these are specific instances of the multiplicative recurrent network. Our experiments indicate that these models perform as well as or better than Elman-type additive recurrent neural networks and surpass matrix-space models on a standard fine-grained sentiment analysis dataset. Additionally, they achieve results comparable to structural deep models on the recently released Stanford Sentiment Treebank, without the necessity of generating parse trees.",
    "Finding minima of real-valued non-convex functions in high-dimensional spaces is a major scientific challenge. We show that some high-dimensional functions have a narrow range of values containing most of their critical points, unlike the wider range seen in low-dimensional cases. Our simulations support previous theoretical work on spin glasses that proves the existence of such a range as the dimension increases. Additionally, our experiments with teacher-student networks on the MNIST dataset reveal a similar phenomenon in deep networks. Finally, we observe that both gradient descent and stochastic gradient descent methods reach this range within the same number of steps.",
    "We've come up with a new way to look at photos using a fresh statistical model. The idea is that how local areas respond to a bunch of linear filters is described as jointly Gaussian with zero mean, and the covariance changes slowly over different parts of the image. We focus on optimizing these filters to minimize the nuclear norms of their local activations, which basically means we're encouraging a flexible form of sparsity without being tied to a specific dictionary or system.\n\nWhen we optimize the filters this way, they tend to be oriented and bandpass, and their responses are highly locally correlated. Interestingly, we found that you can almost perfectly reconstruct images just from these local filter response covariances, even when using low-rank approximations, without much visual loss or increase in MSE.\n\nThis approach looks really promising for various tasks like denoising, compressing images, and representing textures. It could also be a great foundation for more complex hierarchical decompositions in the future.",
    "Modern convolutional neural networks (CNNs) for object recognition typically use alternating convolution and max-pooling layers followed by a few fully connected layers. We question the necessity of these components and find that max-pooling can be replaced by a convolutional layer with increased stride, maintaining accuracy on several benchmarks. Building on this and other recent work, we propose a new architecture consisting solely of convolutional layers that achieves competitive or state-of-the-art performance on datasets like CIFAR-10, CIFAR-100, and ImageNet. We also introduce a new variant of the \"deconvolution approach\" for visualizing features learned by CNNs, applicable to a wider range of network structures.",
    "Artificial neural networks usually use a fixed, non-linear activation function for each neuron. We've come up with a new type of piecewise linear activation function that is learned separately for each neuron through gradient descent. This adaptive activation function lets us enhance deep neural network models that typically use static rectified linear units. As a result, we've achieved top-notch performance on datasets like CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics related to Higgs boson decay modes.",
    "This paper presents an innovative greedy parser predicated on neural networks, which utilizes an advanced compositional sub-tree representation. This parser and the compositional mechanism are jointly trained, exhibiting a high degree of interdependence. The composition mechanism yields a vector representation that encapsulates both syntactic (parsing tags) and semantic (words) information of sub-trees. The composition and tagging processes are executed over continuous (word or tag) representations via recurrent neural networks. Our methodology achieves F1 performance comparable to that of established parsers, while benefiting from enhanced speed due to the inherent efficiency of the greedy algorithm. We also offer a fully operational implementation of the described method.",
    "Refined text: \n\nImplementing suitable lateral connections between the encoder and decoder in a denoising autoencoder (dAE) allows higher layers to concentrate on invariant representations. Unlike regular autoencoders, where detailed information must flow through the highest layers, lateral connections from the encoder to the decoder reduce this burden. This setup enables abstract invariant features to be effectively translated into detailed reconstructions by modulating the strength of the lateral connections. Experiments comparing three dAE structures—those with modulated lateral connections, additive lateral connections, and no lateral connections—using real-world images, demonstrate that modulated lateral connections 1) enhance the accuracy of the input probability model, as evidenced by improved denoising performance; 2) lead to a faster progression of invariance in representations towards the higher layers; and 3) promote the development of diverse invariant poolings.",
    "We have developed a new method for visualizing and improving the invariances in learned representations. Specifically, we examine a general form of invariance called linearization, where the effect of a transformation is restricted to a low-dimensional subspace. Using two reference images that typically differ due to some transformation, we generate a sequence of images that trace a minimally-extended path between them in the representation space, known as a \"representational geodesic.\" If the representation linearizes the transformation between the reference images, this sequence should progressively reflect the transformation.\n\nWe apply this method to evaluate the invariance properties of a cutting-edge image classification network. Our findings indicate that the geodesics for image pairs differing by translation, rotation, and dilation do not correspond to their respective transformations. Additionally, our method offers a solution to these shortcomings. By following this proposed adjustment, we demonstrate that the modified representation can effectively linearize various geometric image transformations.",
    "Genomics are transforming medical practice and research, offering insights into disease mechanisms and improving cancer therapies. Predicting patient outcomes from genomic data is crucial but challenging. This abstract explores using neural networks to analyze genomic data for cancer survival predictions, showing benefits over current methods using brain tumor data.",
    "Current methods for merging additive and multiplicative neural units typically involve either assigning operations in a predetermined manner or relying on discrete optimization to decide the neuron's function. This, however, significantly escalates the computational demands of the training process. We introduce an innovative, parameterizable transfer function inspired by non-integer functional iteration, enabling smooth and differentiable adjustments of each neuron's operation between addition and multiplication. This integration allows the addition/multiplication decision to be seamlessly incorporated into the conventional backpropagation training process.",
    "Training deep neural networks often faces challenges due to improper scaling between layers, leading to detrimental exploding or vanishing gradient problems. Traditionally, these issues have been managed through meticulous scale-preserving initialization. However, our research delves deeper into the significance of maintaining scale, or isometry, beyond just the initial weights. We introduce two innovative methods to uphold isometry: one precise and the other stochastic. Early experiments reveal that both determinant and scale normalization significantly accelerate the learning process. These findings underscore that preserving isometry is critical early in the learning phase, and consistently maintaining it results in faster, more efficient training.",
    "We've advanced Stochastic Gradient Variational Bayes to enable posterior inference for the weights in Stick-Breaking processes, leading to the creation of the Stick-Breaking Variational Autoencoder (SB-VAE). This novel, Bayesian nonparametric variant of the traditional variational autoencoder boasts a latent representation with dynamic dimensionality. Our experiments reveal that both the SB-VAE and its semi-supervised counterpart excel in learning highly discriminative latent representations, frequently surpassing the performance of Gaussian VAEs.",
    "Unsupervised learning on imbalanced data presents significant challenges, as current models frequently become dominated by the majority category, thereby neglecting those categories represented by smaller amounts of data. To address this issue, we have developed a latent variable model capable of managing imbalanced data by partitioning the latent space into a shared space and a private space. Utilizing Gaussian Process Latent Variable Models, we propose an innovative kernel formulation that facilitates the separation of latent space and derive an efficient method for variational inference. The efficacy of our model is demonstrated through its application to an imbalanced medical image dataset.",
    "Generative adversarial networks (GANs) are effective deep learning models. GANs work like a game between two players. To improve learning, the original objective function is adjusted to provide stronger gradients. We introduce a new algorithm that repeatedly estimates density ratios and minimizes f-divergence. This new method gives fresh insights into GANs and benefits from multiple perspectives on density ratio estimation, such as stable divergences and useful relative density ratios.",
    "This paper demonstrates the application of natural language processing (NLP) to classification problems in cheminformatics using SMILES for chemical representation. It addresses activity prediction against target proteins, essential for computer-aided drug design. The experiments reveal that this approach surpasses traditional methods and provides structural insights into decision-making.",
    "We present a neural network architecture and a learning algorithm designed to generate factorized symbolic representations. Our method involves learning concepts by observing sequential frames. All components of the hidden representation, except for a small set of discrete gating units, are predicted from the previous frame. These discrete gating units then exclusively represent the factors of variation in the next frame, corresponding to symbolic representations. We showcase the effectiveness of our approach using datasets consisting of faces undergoing 3D transformations and Atari 2600 games.",
    "We examine the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution consists of two parts: the bulk, which is concentrated around zero, and the edges, which are spread away from zero. We provide empirical evidence showing that the bulk reflects how over-parameterized the system is, while the edges depend on the input data.",
    "We present a parametric nonlinear transformation specifically designed to Gaussianize data from natural images. This process begins with a linear transformation of the data, followed by normalizing each component using a pooled activity measure. This measure is derived by exponentiating a weighted sum of the rectified and exponentiated components, plus a constant.\n\nTo optimize the transformation parameters—which include the linear transform, exponents, weights, and constant—we use a database of natural images and directly minimize the negentropy of the responses. The result is a transformation that significantly Gaussianizes the data, reducing the mutual information between transformed components more effectively than alternative methods such as Independent Component Analysis (ICA) and radial Gaussianization.\n\nOur transformation is fully differentiable and can be efficiently inverted, enabling it to induce a density model on images. Samples generated from this model closely resemble patches of natural images. We also demonstrate the utility of this model as a prior probability density for tasks such as additive noise removal.\n\nAdditionally, the transformation can be cascaded into multiple layers, with each layer optimized using the same Gaussianization objective. This provides an unsupervised approach to optimizing a deep network architecture.",
    "Approximate variational inference has emerged as a formidable technique for unraveling intricate and unknown probability distributions. Today’s cutting-edge breakthroughs empower us to construct probabilistic models for sequences that deftly harness spatial and temporal dynamics. By employing a Stochastic Recurrent Network (STORN), we can effectively decode robotic time series data. Our findings reveal a robust capability to detect anomalies, both in real-time and post-analysis.",
    "We create a general framework to train and test agents on efficiently gathering information. Our framework includes tasks where agents must search through partially-observed environments to find and assemble information to achieve goals. We use deep learning and reinforcement learning techniques to develop agents capable of completing these tasks. By combining external and internal rewards, we guide the agents' behavior. Our experiments show that these agents learn to actively and intelligently search for new information to reduce uncertainty and use the information they have already obtained effectively.",
    "We propose an enhancement to neural network language models that adapts their predictions based on recent history. Our model is a streamlined variant of memory-augmented networks, retaining past hidden activations as memory and accessing them via a dot product with the current hidden activation. This mechanism is highly efficient and scalable to extensive memory sizes. Furthermore, we establish a connection between external memory usage in neural networks and cache models employed in count-based language models. Through extensive experiments on various language model datasets, we demonstrate that our approach significantly outperforms recent memory-augmented networks.",
    "Inspired by recent advancements in generative models, we present a novel model that creates images based on natural language descriptions. This model works by progressively adding patches to a canvas, focusing on pertinent words in the description. After training on the Microsoft COCO dataset, we evaluated our model against several baseline generative models for both image generation and image retrieval tasks. Our results show that our model not only produces higher quality images than existing methods but also generates unique scene compositions that align with previously unseen captions in the dataset.",
    "We've come up with a way to train several neural networks at once. By using the tensor trace norm, we regularize the parameters across all the models so each network can borrow from the others if it makes sense—this is the main idea behind multi-task learning. Unlike many deep multi-task learning models where you have to decide ahead of time which layers will share parameters, our method takes a more flexible approach. It looks at all possible layers for sharing and figures out the best strategy based on the data.",
    "This paper presents an advanced actor-critic deep reinforcement learning agent that uses experience replay. The agent is stable, efficient with samples, and excels in challenging environments such as the 57-game Atari domain and various continuous control problems. Key innovations introduced in this paper include truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.",
    "We introduce an innovative framework for generating pop music, utilizing a hierarchical Recurrent Neural Network (RNN). This hierarchical structure is designed to incorporate our prior knowledge of pop music composition. Specifically, the lower layers of the network are responsible for generating the melody, while the upper layers handle the creation of drum patterns and chord progressions. Our human studies indicate a strong preference for music generated by our model compared to music produced by Google's recent method. Furthermore, we demonstrate two compelling applications of our framework: neural dancing and karaoke, and neural story singing.",
    "Many machine learning classifiers can be easily fooled by adversarial perturbations. These perturbations subtly alter inputs in a way that misleads classifiers without making noticeable changes to human observers. We have implemented three methods to identify adversarial images. For adversaries to evade our detectors, they must produce less noticeable adversarial images, otherwise, they will be detected. Our most effective detection technique shows that adversarial images disproportionately emphasize the lower-ranked principal components in PCA. Additional detection methods and a detailed saliency map are provided in the appendix.",
    "We propose a method to create computationally efficient convolutional neural networks (CNNs) using low-rank representations of convolutional filters. Instead of approximating pre-trained filters, we learn small basis filters from scratch, which the network combines into complex filters during training. A novel weight initialization scheme allows effective weight initialization for convolutional layers with differently shaped filters. We validated our approach on existing CNN architectures using the CIFAR, ILSVRC, and MIT Places datasets, achieving similar or higher accuracy with significantly less compute.\n\nFor an improved VGG-11 network with global max-pooling, we attained comparable validation accuracy using 41% less compute and only 24% of the original parameters. Another variant showed a 1% increase in accuracy, achieving 89.7% top-5 center-crop validation accuracy while reducing computation by 16%. For GoogLeNet on ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer parameters. For a near state-of-the-art CIFAR network, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.",
    "Layer-sequential unit-variance (LSUV) initialization is a straightforward method for initializing weights in deep neural networks. This method involves two key steps. First, each convolutional or fully connected layer's weights are pre-initialized with orthonormal matrices. Second, progressing from the initial to the final layer, the output variance of each layer is normalized to one.\n\nExperiments with various activation functions, including maxout, ReLU-family, and tanh, demonstrate that LSUV initialization facilitates the training of very deep networks. These networks achieve test accuracies that are comparable to or better than those obtained using standard methods and are trained at least as quickly as complex schemes designed for very deep networks, such as FitNets (Romero et al., 2015) and Highway Networks (Srivastava et al., 2015).\n\nThe performance of LSUV initialization is evaluated on architectures such as GoogLeNet, CaffeNet, FitNets, and Residual networks. The results show state-of-the-art or near state-of-the-art performance on datasets like MNIST, CIFAR-10/100, and ImageNet.",
    "This paper expands on recent research by Kiperwasser & Goldberg (2016) that utilizes neural attention in a straightforward graph-based dependency parser. Our approach features a more extensive and rigorously regularized parser compared to other contemporary BiLSTM-based methods, employing biaffine classifiers to predict arcs and labels. Our parser attains state-of-the-art or near state-of-the-art performance across standard treebanks for six different languages, recording 95.7% UAS and 94.1% LAS on the widely-used English PTB dataset. This marks it as the highest-performing graph-based parser for this benchmark, surpassing Kiperwasser & Goldberg (2016) by 1.8% in UAS and 2.2% in LAS, and comparing closely to the top transition-based parser by Kuncoro et al. (2016), which achieves 95.8% UAS and 94.6% LAS. Additionally, we highlight the hyperparameter choices that significantly impacted parsing accuracy, enabling us to achieve substantial improvements over other graph-based methods.",
    "Understanding both the clear and hidden connections within data is crucial for machines to handle more complex reasoning tasks. Our Dynamic Adaptive Network Intelligence (DANI) model does just that through efficient weakly supervised learning. We're excited to share that DANI has achieved top-notch results on challenging question-answering tasks in the bAbI dataset, outperforming current representation learning methods (Weston et al., 2015).",
    "Spherical data appears in numerous applications. By representing the discretized sphere as a graph, we can handle non-uniform distributions, incomplete data, and dynamic sampling. Additionally, graph convolutions offer greater computational efficiency compared to spherical convolutions. Equivariance is crucial for leveraging rotational symmetries, so we explore achieving rotation equivariance using the graph neural network proposed by Defferrard et al. (2016). Our experiments demonstrate strong performance on tasks requiring rotation-invariant learning. You can find the code and examples at https://github.com/SwissDataScienceCenter/DeepSphere",
    "The high computational complexity of Convolutional Neural Networks (CNNs) poses a significant challenge for their widespread adoption, particularly in mobile devices. Hardware accelerators offer a promising solution to mitigate both execution time and power consumption. A critical component of accelerator development is the hardware-oriented approximation of models. This paper introduces Ristretto, a model approximation framework that evaluates CNNs based on the numerical resolution utilized for representing weights and outputs in both convolutional and fully connected layers. Ristretto optimizes models by employing fixed-point arithmetic and representation in place of traditional floating-point methods. Additionally, Ristretto fine-tunes the resultant fixed-point network. Under a maximum error tolerance of 1%, Ristretto effectively condenses CaffeNet and SqueezeNet to 8-bit precision. The Ristretto codebase is publicly available.",
    "The variety of painting styles offers a rich visual vocabulary for image creation. Our ability to learn and efficiently capture this visual vocabulary reflects our comprehension of the complex features of paintings, and possibly images at large. This study explores the development of a single, scalable deep network capable of efficiently encapsulating the artistic styles found in diverse paintings. We show that this network generalizes across various artistic styles by mapping each painting to a specific point in an embedding space. Notably, our model allows users to discover new painting styles by blending the styles learned from individual paintings. We aim for this work to advance the development of comprehensive painting models and provide insights into the structured representation of artistic styles.",
    "Sum-Product Networks (SPNs) are an expressive yet tractable class of hierarchical graphical models. LearnSPN, a structure learning algorithm for SPNs, employs hierarchical co-clustering to identify similar entities and features concurrently. The original LearnSPN algorithm assumes all variables are discrete and there is no missing data. We present MiniSPN, a practical and simplified version of LearnSPN, which runs faster and handles missing data and heterogeneous features, common in real-world applications. We demonstrate MiniSPN's performance on standard benchmark datasets and on two datasets from Google's Knowledge Graph that exhibit high rates of missingness and a mix of discrete and continuous features.",
    "Recent research has focused on improving the accuracy of deep neural networks (DNNs). For a given accuracy, multiple DNN architectures can be identified. Smaller DNNs with equivalent accuracy offer three main advantages: (1) reduced server communication during distributed training, (2) lower bandwidth for exporting models to autonomous cars, and (3) better feasibility for deployment on hardware with limited memory. We propose SqueezeNet, a small DNN architecture that achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Model compression techniques reduce SqueezeNet to less than 0.5MB (510x smaller than AlexNet). Download SqueezeNet here: https://github.com/DeepScale/SqueezeNet",
    "In this paper, we tackle the challenge of question answering that demands reasoning across multiple facts. Introducing the Query-Reduction Network (QRN), our innovative variant of the Recurrent Neural Network (RNN), we adeptly navigate both short-term (local) and long-term (global) sequential dependencies necessary for multi-fact reasoning. QRN views context sentences as a sequence of state-altering catalysts, progressively refining the original query into a more insightful version as each context sentence unfolds over time. Our experiments demonstrate that QRN achieves state-of-the-art performance on the bAbI QA and dialog tasks, as well as on a real-world goal-oriented dialog dataset. Additionally, QRN's design supports parallelization along the RNN's time axis, significantly reducing training and inference time complexity by an order of magnitude.",
    "We propose a language-agnostic methodology for automatically generating sets of semantically similar clusters of entities, along with corresponding sets of \"outlier\" elements. This approach facilitates the intrinsic evaluation of word embeddings within the context of outlier detection tasks. Utilizing this framework, we have developed a gold-standard dataset named WikiSem500, and conducted evaluations on multiple state-of-the-art embeddings. The findings demonstrate a correlation between the performance on this dataset and the performance on sentiment analysis tasks.",
    "Recurrent neural networks (RNNs) have garnered considerable use in the domain of temporal data prediction due to their capacity to learn intricate sequential patterns through their intrinsic deep feedforward architectures. However, there is a growing consensus that top-down feedback could serve as a crucial yet missing component, which theoretically has the potential to resolve ambiguities between similar patterns by leveraging broader contextual information. In this research paper, we propose an innovative model called surprisal-driven recurrent networks. This model incorporates historical error data to inform its future predictions by perpetually assessing the deviation between its most recent predictions and the actual observed data. Our experimental results indicate that this approach significantly surpasses the performance of both stochastic and fully deterministic methods. Specifically, our model achieves a test performance of 1.37 bits per character (BPC) in the challenging enwik8 character-level prediction task.",
    "Generative Adversarial Networks (GANs) excel at producing cutting-edge results across various generative tasks, but they are notoriously unstable and prone to bypass critical data patterns. We assert that these adverse behaviors of GANs stem from the unique functional forms of trained discriminators in high-dimensional spaces, which can either stagnate the training process or misdirect probability mass, causing it to concentrate more densely than in the original data distribution. To counter these issues, we propose several methods for regularizing the objective function, significantly enhancing the stability of GAN training. Additionally, our regularizers ensure a more equitable distribution of probability mass among the data modes during the early training stages, offering a cohesive solution to the problem of missing modes.",
    "Learning policies with reinforcement learning for real-world tasks faces two major challenges: high sample complexity and ensuring safety. These challenges are especially pronounced when using complex models like deep neural networks. Model-based methods can help by using simulations to approximate the real world, thus supplementing real data with simulated data. However, differences between the simulation and the real world can hinder training.\n\nWe present the EPOpt algorithm which addresses this by using multiple simulated environments and an adversarial training approach. This helps create policies that are robust and can adapt to various real-world conditions, including unexpected ones. Additionally, the algorithm adjusts the probability distribution of these simulated environments based on real-world data, using approximate Bayesian methods to make them more accurate over time. This combination of model ensemble learning and adaptation improves both robustness and learning efficiency.",
    "Divnet represents an innovative approach to enhancing neural networks by leveraging a variety of neurons. Unlike traditional methods, Divnet employs a Determinantal Point Process (DPP) to prioritize neuronal diversity within each layer. This strategy involves selecting a diverse subset of neurons through the DPP and integrating redundant neurons into these selections. By doing so, Divnet not only captures neuronal diversity more effectively but also implicitly implements regularization, facilitating automatic fine-tuning of network architecture. This results in more compact networks without compromising performance. Furthermore, Divnet's emphasis on diversity and neuron integration ensures compatibility with other techniques aimed at minimizing the memory usage of networks. Our experimental findings support these assertions, demonstrating Divnet's clear advantage over other pruning methods in neural networks.",
    "The efficiency of graph-based semi-supervised algorithms depends on the graph of instances to which they are applied. These instances typically exist in a vector form before creating a graph to link them. The graph construction relies on a metric in the vector space to define the connection weights between entities. Commonly, this metric is a distance or similarity measure based on the Euclidean norm. We argue that, in some cases, the Euclidean norm in the initial vector space may not be the most effective for solving the task efficiently. Therefore, we propose an algorithm designed to learn the most suitable vector representation for building a graph, thereby enhancing task performance.",
    "Training Deep Neural Networks faces a significant challenge: preventing overfitting. To address this, various techniques like data augmentation and innovative regularizers such as Dropout have been developed, eliminating the need for massive datasets. In our work, we introduce a groundbreaking regularizer called DeCov, which drastically reduces overfitting, as evidenced by the difference between training and validation performance, and enhances generalization. DeCov promotes diverse, non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations. Although this straightforward concept has been explored in previous research, it has surprisingly never been utilized as a regularizer in supervised learning. Our experiments on multiple datasets and network architectures demonstrate that DeCov consistently reduces overfitting while almost always maintaining or boosting generalization performance, often surpassing the effectiveness of Dropout.",
    "Deep neural networks are often trained using stochastic, non-convex optimization methods that rely on gradient information derived from small subsets (batches) of the dataset. While it's widely acknowledged that batch size is a critical parameter for offline tuning, the advantages of online batch selection are not well understood. In our study, we explore online batch selection strategies for two leading stochastic gradient-based optimization techniques, AdaDelta and Adam. Since the overall loss function for the dataset is an aggregation of the loss functions of individual data points, it seems logical that data points with the highest losses should be selected more frequently in batches. However, the limitations of this assumption and the optimal regulation of selection pressure over time are still unresolved issues. We propose a straightforward strategy: rank all data points according to their most recent loss values, and the probability of each data point being selected decays exponentially with its rank. Our experiments on the MNIST dataset demonstrate that this batch selection method accelerates both AdaDelta and Adam by approximately a factor of five.",
    "We introduce a user-friendly method for semi-supervised learning with graph data using an efficient type of convolutional neural network that works directly on graphs. Our approach is inspired by a simplified version of spectral graph convolutions, focused on the local structure. Our model is highly efficient, scaling with the number of graph edges, and it learns hidden representations that capture both the local structure and node features. Through experiments on citation networks and a knowledge graph dataset, we show that our method significantly outperforms other similar techniques.",
    "We present the \"Energy-based Generative Adversarial Network\" (EBGAN) model, which conceptualizes the discriminator as an energy function. This function assigns low energies to areas near the data manifold and higher energies to other regions. Like probabilistic GANs, the generator in EBGAN is trained to produce samples with minimal energy, while the discriminator's goal is to assign high energies to these generated samples. This energy function perspective allows for a broader range of architectures and loss functions beyond the typical binary classifier with a logistic output. For instance, we demonstrate one EBGAN variant utilizing an auto-encoder architecture where the energy corresponds to the reconstruction error instead of the traditional discriminator. This approach leads to more stable training compared to regular GANs. Additionally, we show that a single-scale architecture can effectively generate high-resolution images.",
    "Recent research in the field of deep learning has yielded numerous new architectures. Concurrently, an increasing number of groups are applying deep learning to novel applications. Some of these groups, likely composed of inexperienced deep learning practitioners, may feel overwhelmed by the vast array of architectural options and consequently choose to use older architectures (such as AlexNet). In this work, we aim to address this gap by extracting the collective knowledge from recent deep learning research to uncover fundamental principles for designing neural network architectures. Additionally, we detail several architectural innovations, including the Fractal of FractalNet network, Stagewise Boosting Networks, and Taylor Series Networks. Our Caffe code and prototxt files are available at https://github.com/iPhysicist/CNNDesignPatterns. We hope this preliminary work inspires others to build upon it.",
    "Machine comprehension (MC) involves answering a question based on a provided context paragraph and requires understanding the intricate interactions between the context and the query. Recently, attention mechanisms have been effectively applied to MC. These methods typically employ attention to zero in on a specific part of the context, summarizing it with a fixed-size vector, coordinating attentions over time, and often creating a uni-directional attention. In this paper, we present the Bi-Directional Attention Flow (BIDAF) network, which employs a multi-stage hierarchical process to represent the context at varying levels of granularity. This model uses a bi-directional attention flow mechanism to generate a query-aware context representation without premature summarization. Our experiments demonstrate that our model achieves state-of-the-art performance on both the Stanford Question Answering Dataset (SQuAD) and the CNN/DailyMail cloze test.",
    "Despite significant advances, the challenges of model learning and posterior inference persist in the use of deep generative models, particularly when dealing with discrete hidden variables. This paper primarily addresses algorithms for training Helmholtz machines, which are distinguished by the integration of the generative model with an auxiliary inference model. A common limitation of previous learning algorithms is their indirect optimization of bounds on the desired marginal log-likelihood. In contrast, we introduce a novel class of algorithms based on the stochastic approximation (SA) theory of the Robbins-Monro type, which directly optimizes the marginal log-likelihood while simultaneously minimizing the inclusive KL-divergence. We term this new learning algorithm joint SA (JSA). Additionally, we develop an efficient Markov Chain Monte Carlo (MCMC) operator for JSA. Our empirical results on the MNIST dataset demonstrate that JSA consistently outperforms competing algorithms, such as RWS, in learning a variety of complex models.",
    "Detecting objects using deep neural networks typically involves processing thousands of candidate bounding boxes for each image. These boxes are closely related because they all come from the same image. In this study, we explore how to use patterns at the whole image level to streamline the neural network that processes these boxes. By identifying and removing units that show almost no activity across the image, we can significantly cut down the network's size. Our findings, based on the PASCAL 2007 Object Detection Challenge, reveal that we can eliminate up to 40% of units in certain fully-connected layers without significantly affecting the detection results.",
    "Amp up your machine learning solutions with Exponential Machines (ExM)! By modeling interactions between features, ExM takes performance to a whole new level across various domains like recommender systems and sentiment analysis. ExM is a game-changing predictor that captures interactions of every order, using an innovative Tensor Train (TT) format to represent a massive tensor of parameters efficiently. This approach not only streamlines the model but also gives you fine-grained control over the number of parameters. We've engineered a robust stochastic Riemannian optimization procedure, empowering us to handle tensors with a mind-blowing 2^160 entries. Our results speak for themselves: ExM hits state-of-the-art performance with high-order interactions in synthetic data and matches the prowess of high-order factorization machines on the MovieLens 100K recommender system dataset. Get ready to supercharge your models with ExM!",
    "We introduce Deep Variational Bayes Filters (DVBF), a method for unsupervised learning of latent Markovian state space models. Using Stochastic Gradient Variational Bayes for variational inference, DVBF handles nonlinear input data with temporal and spatial dependencies like image sequences, without domain knowledge. Experiments show that backpropagation through transitions enhances latent embedding and enables realistic long-term prediction.",
    "Traditional dialog systems used in goal-oriented applications require extensive domain-specific handcrafting, which hinders scaling to new domains. End-to-end dialog systems, where all components are trained directly from the dialogs, overcome this limitation. However, the recent success in chit-chat dialog may not translate to goal-oriented settings. This paper introduces a testbed to evaluate the strengths and weaknesses of end-to-end dialog systems in goal-oriented applications. Focusing on restaurant reservations, our tasks involve manipulating sentences and symbols to conduct conversations, issue API calls, and use their outputs. We demonstrate that an end-to-end dialog system based on Memory Networks can achieve promising, though imperfect, performance and learn to perform complex operations. We validate these results by comparing our system to a handcrafted slot-filling baseline using data from the second Dialog State Tracking Challenge (Henderson et al., 2014a). We observe similar result patterns on data from an online concierge service.",
    "Adversarial training regularizes supervised learning algorithms, and virtual adversarial training extends them to semi-supervised settings. However, both methods involve small perturbations to many entries of the input vector, which is unsuitable for sparse, high-dimensional inputs like one-hot word representations. We adapt adversarial and virtual adversarial training for text by perturbing word embeddings in a recurrent neural network instead of the original input. This method achieves state-of-the-art results on various benchmark semi-supervised and purely supervised tasks. Visualizations and analyses indicate improved quality of learned word embeddings and reduced overfitting during training. Code is available at https://github.com/tensorflow/models/tree/master/research/adversarial_text.",
    "Unsupervised learning of probabilistic models is a fundamental yet difficult challenge in machine learning. Crucial to addressing this challenge is the design of models that allow for tractable learning, sampling, inference, and evaluation. We enhance the range of such models by employing real-valued non-volume preserving (real NVP) transformations, which are powerful, invertible, and learnable transformations. This approach results in an unsupervised learning algorithm capable of exact log-likelihood computation, exact sampling, precise inference of latent variables, and an interpretable latent space. We showcase its effectiveness in modeling natural images across four datasets through sampling, log-likelihood evaluation, and latent variable manipulations.",
    "This paper investigates the view-manifold structure within the feature spaces of different layers in Convolutional Neural Networks (CNNs). It addresses several key questions: Does the learned CNN representation achieve viewpoint invariance? How is this invariance achieved—by collapsing the view manifolds or by separating them while preserving their structure? At which layer is view invariance attained? How can the structure of the view manifold at each layer of a deep CNN be quantified experimentally? Moreover, how does fine-tuning a pre-trained CNN on a multi-view dataset influence the representation at each layer of the network? To answer these questions, we propose a methodology to quantify the deformation and degeneracy of view manifolds across CNN layers. By applying this methodology, we obtain and report results that provide insights into these questions.",
    "Bilinear models offer richer representations than linear models and have achieved state-of-the-art performance in tasks like object recognition, segmentation, and visual question-answering. However, their high-dimensional nature limits their use in complex tasks. We introduce low-rank bilinear pooling using the Hadamard product as an efficient attention mechanism for multimodal learning. Our model surpasses compact bilinear pooling and achieves state-of-the-art results on the VQA dataset with better efficiency.",
    "Importance-weighted autoencoders are typically seen as maximizing a tighter lower bound on the marginal likelihood than the standard evidence lower bound. We offer a different perspective: this procedure optimizes the standard variational lower bound using a more complex distribution. We formally derive this, present a tighter lower bound, and visualize the implicit importance-weighted distribution.",
    "We derive a generalization bound for feedforward neural networks based on the product of the layers' spectral norm and weights' Frobenius norm, using PAC-Bayes analysis.",
    "In this paper, we propose an enhancement to Generative Adversarial Networks (GANs) by enabling them to produce direct energy estimates for samples. Specifically, we introduce a flexible adversarial training framework and demonstrate that this framework not only ensures that the generator converges to the true data distribution but also allows the discriminator to retain density information at the global optimum. We derive the analytic form of the induced solution and analyze its properties. To render the proposed framework practically trainable, we incorporate two effective approximation techniques. Empirically, our experimental results closely align with our theoretical analysis, thereby verifying that the discriminator is capable of recovering the energy of the data distribution.",
    "In this study, we jazz up outlier detection by using ensembles of neural networks created through a variational approximation in a Bayesian setup. We tune the variational parameters by sampling from the true posterior using the magic of gradient descent. Our findings reveal that our outlier detection is right up there with the best, holding its own against other top-notch ensembling techniques.",
    "We propose two straightforward methods to reduce parameters and accelerate training in large LSTM networks. The first method uses \"matrix factorization by design\" to decompose the LSTM matrix into the product of two smaller matrices. The second method partitions the LSTM matrix, its inputs, and states into independent groups. These approaches enable significantly faster training of large LSTM networks, achieving near state-of-the-art perplexity with substantially fewer RNN parameters.",
    "In this work, we unveil striking new phenomena unearthed during the training of residual networks. Our aim is to gain deeper insights into the essence of neural networks by dissecting these fresh empirical findings. These intriguing behaviors emerged through the use of Cyclical Learning Rates (CLR) and linear network interpolation. Among the surprises were unexpected fluctuations in training loss and periods of extraordinarily fast training. For instance, we show that CLR can outperform traditional training methods in terms of testing accuracy, even when employing large learning rates. You can replicate our results using the files available at https://github.com/lnsmith54/exploring-loss.",
    "Machine learning models often face different constraints during testing compared to training. For instance, a computer vision model on an embedded device might need to work in real-time, or a translation model on a cell phone might need to limit its compute time to save power. In this study, we present a mixture-of-experts model and demonstrate how to adjust its resource usage for each input using reinforcement learning. We validate our method with a simple example based on MNIST.",
    "Adversarial examples pose a serious threat across various deep learning architectures. Remarkably, deep reinforcement learning has yielded impressive results in training agent policies directly from raw inputs like image pixels. In this groundbreaking paper, we delve into adversarial attacks targeting deep reinforcement learning policies. Our comprehensive study demonstrates the superior effectiveness of adversarial examples over random noise in crippling these policies. We introduce an innovative technique that minimizes the frequency of adversarial example injections necessary for a successful attack, leveraging the value function. Additionally, we offer insights into how re-training on random noise and FGSM perturbations can significantly bolster resilience against such adversarial threats. This research not only highlights the vulnerability of current systems but also provides actionable strategies to fortify them against sophisticated attacks.",
    "This paper introduces Variational Continual Learning (VCL), a versatile and robust framework designed for continual learning. VCL seamlessly integrates online variational inference (VI) with recent advancements in Monte Carlo VI for neural networks. This framework effectively trains both deep discriminative and generative models, even in dynamic, evolving continual learning scenarios where existing tasks change and new tasks arise. Experimental results demonstrate that VCL surpasses state-of-the-art continual learning methods across various tasks, mitigating catastrophic forgetting autonomously.",
    "Determining the best size for a neural network usually involves costly searches and training many different networks from scratch. In this paper, we tackle the problem of finding a good network size in just one training cycle. We introduce *nonparametric neural networks*, a straightforward method for optimizing network sizes. We prove our approach works well when we control network growth using an L_p penalty. During training, we continuously add new units and remove unnecessary ones with an L_2 penalty. We use a new optimization algorithm called *adaptive radial-angular gradient descent* (or *AdaRad*), and our results are promising.",
    "The Natural Language Inference (NLI) task requires determining the logical relationship between a natural language premise and hypothesis. We present Interactive Inference Network (IIN), a novel neural network architecture that hierarchically extracts semantic features from interaction spaces for high-level understanding of sentence pairs. An interaction tensor (attention weight) holds semantic information crucial for solving NLI, with denser tensors containing richer information. One such architecture, Densely Interactive Inference Network (DIIN), achieves state-of-the-art performance on large-scale NLI corpora. Notably, DIIN reduces errors by over 20% on the challenging Multi-Genre NLI (MultiNLI) dataset compared to the best published system.",
    "Get excited because we're tackling one of the biggest challenges in deploying neural networks in real-world, safety-critical systems: the pesky adversarial examples! These slight input perturbations can throw off a network's classification, but we've got groundbreaking news. In the past few years, numerous techniques have been explored to enhance robustness against these adversarial examples. The twist? Many of these defenses have quickly crumbled under subsequent attacks – in fact, over half of the proposed defenses at ICLR 2018 have already been compromised. But don't worry, we have a game-changing solution: formal verification techniques! We're thrilled to reveal that we've developed a method to construct provably minimally distorted adversarial examples. With this revolutionary approach, we can take any neural network and input sample and create adversarial examples that we guarantee are minimally distorted. Even better, our approach shows that the recent ICLR defense strategy of adversarial retraining significantly amplifies the distortion required for crafting adversarial examples by an astounding factor of 4.2! Let's transform the future of neural networks together!",
    "Sure! Here's a more dynamic version of the text:\n\n\"We've taken Stochastic Gradient Variational Bayes to the next level by using it for posterior inference of Stick-Breaking process weights. This breakthrough lets us introduce the Stick-Breaking Variational Autoencoder (SB-VAE)—a cutting-edge, Bayesian nonparametric twist on the variational autoencoder. What sets the SB-VAE apart is its ability to use a latent representation with a stochastic dimensionality. Our experiments reveal that both the SB-VAE and its semi-supervised variant excel in learning highly discriminative latent representations, frequently surpassing the performance of traditional Gaussian VAEs.\"",
    "We introduce a method for training multiple neural networks at the same time. By using the tensor trace norm, we encourage each network to reuse parameters from the others, which is the core idea of multi-task learning. Unlike many existing models, we don't set a fixed parameter sharing strategy. Instead, our approach allows all layers to potentially share parameters, and the sharing strategy is determined based on the data.",
    "This paper unveils a groundbreaking actor-critic deep reinforcement learning agent with experience replay, exhibiting unparalleled stability, sample efficiency, and exceptional performance across formidable environments, notably the discrete 57-game Atari domain and multiple continuous control challenges. Key innovations feature truncated importance sampling with bias correction, revolutionary stochastic dueling network architectures, and an advanced trust region policy optimization method.",
    "Many machine learning classifiers are susceptible to adversarial perturbations, which subtly alter inputs to change classifier predictions without noticeable differences to human perception. We use three methods to detect these adversarial images. Adversaries must reduce the pathology of their images to evade our detectors, or they will fail. Our top detection method shows that adversarial images disproportionately emphasize lower-ranked principal components from PCA. Additional detectors and a colorful saliency map are provided in the appendix.",
    "We introduce a robust method for kernel learning grounded in a Fourier-analytic framework that characterizes translation-invariant or rotation-invariant kernels. This technique generates a succession of feature maps that iteratively enhance the SVM margin. We offer solid guarantees for both optimality and generalization, and interpret our algorithm as online equilibrium-finding dynamics within a specific two-player min-max game. Extensive evaluations on both synthetic and real-world datasets reveal that our approach scales effectively and consistently outperforms existing methods based on random features.",
    "Current top reading comprehension models use recurrent neural networks (RNNs). These networks work well with language because they process information in order. However, they can't handle multiple pieces of information at once, causing delays, especially with longer texts. This makes them less suitable for scenarios where quick responses are needed. We propose a new method using a convolutional architecture instead of RNNs. By using dilated convolutions, we achieve similar results to the best models in two question-answering tasks, but with much faster processing speeds.",
    "This report serves several purposes. First, it investigates the reproducibility of the paper \"On the regularization of Wasserstein GANs\" (2018). Second, it reproduces five key aspects from the paper's experiments: learning speed, stability, robustness against hyperparameters, accuracy in estimating the Wasserstein distance, and different sampling methods. Finally, it identifies which contributions can be reproduced and the associated resource costs. All source code for reproduction is publicly available.",
    "Variational Autoencoders (VAEs) started out (thanks to Kingma & Welling, 2014) as fancy probabilistic models where you do some complex Bayesian inference stuff. Then along came $\\beta$-VAEs (shoutout to Higgins et al., 2017) which changed the game. They expanded VAEs beyond just generative modeling to cool stuff like representation learning, clustering, and lossy data compression by letting you balance between the info content (\"bit rate\") of the latent representation and the distortion of the reconstructed data (thanks Alemi et al., 2018). \n\nIn this paper, we take a fresh look at this rate/distortion trade-off but for hierarchical VAEs, which are VAEs with multiple layers of latent variables. We found a way to split the rate into parts for each layer, so you can tweak them separately. We also figured out some theoretical limits on how well these layers perform in downstream tasks and backed up our theories with big experiments. Our results give useful tips for practitioners on where to aim in the rate-space for whatever you're working on.",
    "Methods for learning representations of nodes within a graph are pivotal in network analysis because they facilitate a broad range of subsequent learning tasks. In this context, we introduce Graph2Gauss, an innovative approach designed to efficiently learn versatile node embeddings on large-scale (attributed) graphs, exhibiting robust performance in tasks such as link prediction and node classification. Unlike most existing methods that depict nodes as point vectors within a low-dimensional continuous space, our approach embeds each node as a Gaussian distribution. This unique representation enables the capture of uncertainty regarding the node's representation.\n\nMoreover, we propose an unsupervised method that adeptly handles inductive learning scenarios and is versatile across various types of graphs, including plain/attributed and directed/undirected graphs. By leveraging both the network structure and the supplementary node attributes, our approach can generalize to previously unseen nodes without the need for further training. To derive these embeddings, we employ a personalized ranking framework concerning the node distances, which utilizes the natural ordering imposed by the network structure on the nodes.\n\nExtensive experiments conducted on real-world networks validate the superior performance of our method, demonstrating that it outperforms state-of-the-art network embedding techniques across several tasks. Additionally, the advantages of incorporating uncertainty into the model are evident. Through the analysis of this uncertainty, we can estimate the diversity within neighborhoods and detect the intrinsic latent dimensionality of a graph.",
    "This study investigates the application of self-ensembling techniques to visual domain adaptation challenges. Our method is based on the mean teacher model (Tarvainen et al., 2017) within the temporal ensembling framework (Laine et al., 2017), which has previously set new benchmarks in semi-supervised learning. We propose several modifications to enhance their approach for complex domain adaptation scenarios and assess its performance. Our technique demonstrates state-of-the-art results across various benchmarks, including our winning submission to the VISDA-2017 visual domain adaptation competition. In smaller image benchmarks, our method not only surpasses existing techniques but also achieves accuracy levels comparable to supervised classifiers.",
    "Machine learning models, including popular ones like deep neural networks, can be easily tricked by what's called adversarial examples. These are inputs that have tiny, deliberate changes that make the model produce wrong results, even though they look normal to humans. This paper isn't about creating a new method to avoid such problems; instead, it dives deep into understanding why these tricky inputs work. By using concepts from topology (a branch of math), we explain why a classifier (think of it as a decision-making model) can be fooled by adversarial examples. We compare the classifier ($f_1$) to an oracle ($f_2$), which could be something like human judgment, to find out how they're related. Our study lays out the exact conditions that make a classifier either weak or strong against these adversarial tricks, according to the human-like oracle. Surprisingly, we found that even a single unnecessary feature in the data can make the classifier not strong-robust. So, learning the right features is crucial for building a model that is both accurate and strong-robust against adversarial examples.",
    "We're excited to introduce a fun and engaging way to train and test the ability of agents to gather information efficiently. We've created a series of tasks where success means exploring a partially-observed environment to find bits of information that can be pieced together to achieve various goals. By blending deep learning architectures with reinforcement learning techniques, we've developed agents capable of solving these tasks. To guide their behavior, we use a mix of extrinsic and intrinsic rewards. Our experiments show that these agents learn to actively and smartly search for new information to reduce uncertainty, while making the most of the information they already have.",
    "We're introducing an enhancement to neural network language models that allows them to adjust their predictions based on recent history. Our approach simplifies memory-augmented networks by storing past hidden activations as memory and retrieving them via a dot product with the current hidden activation. This method is highly efficient and can handle large memory sizes. Additionally, we highlight similarities between the use of external memory in neural networks and cache models used in count-based language models. Our testing on various language model datasets shows that our method significantly outperforms current memory-augmented networks.",
    "Generative adversarial networks (GANs) are highly successful deep generative models that rely on a two-player minimax game framework. However, the original objective function is modified to ensure stronger gradients during the generator's learning process. We propose a novel algorithm that alternates between density ratio estimation and f-divergence minimization. This algorithm provides a fresh perspective on understanding GANs and leverages insights from density ratio estimation research, such as the stability of different divergences and the utility of relative density ratios.",
    "We introduce an innovative framework for the generation of pop music. Our model employs a hierarchical Recurrent Neural Network, meticulously designed to encapsulate our extensive understanding of pop music composition through its layered architecture. Specifically, the basal layers are dedicated to melody creation, whereas the superior layers are responsible for the generation of drums and chords. Through a series of rigorous human studies, our generated music has demonstrated a marked preference over the outputs from Google's latest methodology. Furthermore, we showcase two intriguing applications of our framework: neural dancing and karaoke, and neural story singing.",
    "We analyze the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution comprises two distinct parts: the bulk, centered around zero, and the edges, dispersed away from zero. Our empirical evidence reveals that the bulk reflects the degree of over-parameterization in the system, while the edges are influenced by the input data.",
    "In this paper, we unveil an innovative feature extraction method designed to analyze program execution logs. Our approach begins with the automatic extraction of intricate patterns from a program's behavior graph. These patterns are then transformed into a continuous space via an autoencoder. We put our proposed features to the test on a real-world malicious software detection task, finding that the embedding space not only enhances detection but also reveals meaningful, interpretable structures within the pattern space.",
    "We evaluated the efficiency of the FlyHash model, which is an insect-inspired sparse neural network (Dasgupta et al., 2017), against comparable but non-sparse models in an embodied navigation task. This task involves a model controlling steering by comparing current visual inputs with memories stored from a training route. Our findings indicate that the FlyHash model outperforms the others, particularly in terms of data encoding efficiency.",
    "In the peer review process, reviewers are usually tasked with assigning scores to the papers they evaluate. These scores play a crucial role in aiding Area Chairs or Program Chairs in making decisions. To facilitate reviewers in quantifying their opinions despite cognitive limitations, the scores are typically provided in a quantized format. However, this approach often results in a significant number of tied scores, leading to considerable information loss. To address this issue, some conferences now also request reviewers to rank the papers they review in addition to scoring them.\n\nThis dual-scoring system introduces two main challenges. Firstly, there is no standardized method for using the ranking information, leading Area Chairs to handle it differently, or sometimes not at all, which can inject arbitrariness into the review process. Secondly, there are no effective interfaces or methods to incorporate the ranking information into existing workflows, causing inefficiencies.\n\nWe propose a systematic approach to integrate the ranking data with the traditional scores. Our method generates an updated score for each reviewed paper that includes the ranking information. This approach tackles the two primary challenges by (i) ensuring consistent incorporation of rankings into the scores for all papers to eliminate arbitrariness, and (ii) allowing the continued use of existing interfaces and workflows optimized for scores.\n\nOur method has been empirically evaluated on synthetic datasets and peer reviews from the ICLR 2017 conference. Results show a reduction in error by approximately 30% compared to the best-performing baseline on the ICLR 2017 data, demonstrating its effectiveness.",
    "Recent studies have explored status bias in academic peer review. This article analyzes the impact of author metadata on area chairs' decisions using a database of 5,313 borderline submissions to ICLR from 2017 to 2022. Employing a cause-and-effect analysis within Neyman and Rubin's potential outcomes framework, we found weak evidence linking author metadata to final decisions. Moreover, under additional assumptions, submissions from top-30% or top-20% institutions were less favored compared to their peers. This result held across two matched designs (odds ratio = 0.82 [95% CI: 0.67 to 1.00] and 0.83 [95% CI: 0.64 to 1.07]). We discussed these findings in the context of interactions between study units and peer-review agents.",
    "We introduce a variational approximation to the information bottleneck method established by Tishby et al. (1999). This novel approach enables the parameterization of the information bottleneck model through neural networks, incorporating the reparameterization trick to facilitate efficient training. We designate this method as the \"Deep Variational Information Bottleneck\" (Deep VIB). Our empirical evidence demonstrates that models trained using the VIB objective surpass those trained with alternative regularization techniques in terms of generalization performance and resilience against adversarial attacks.",
    "Attention networks have revolutionized deep neural networks by effectively embedding categorical inference. However, many tasks require capturing richer structural dependencies without sacrificing the benefits of end-to-end training. In this study, we enhance deep networks by integrating richer structural distributions through graphical models. Our structured attention networks build upon the basic attention mechanism, enabling us to extend attention beyond standard soft-selection to include partial segmentations and subtrees.\n\nWe introduce two innovative structured attention network classes: linear-chain conditional random fields and graph-based parsing models, detailing practical implementation as neural network layers. Experimental results demonstrate that these networks not only effectively incorporate structural biases but also significantly outperform baseline attention models in various synthetic and real-world tasks such as tree transduction, neural machine translation, question answering, and natural language inference. Moreover, models trained using this approach discover intriguing unsupervised hidden representations, showcasing the potential to generalize traditional attention mechanisms.",
    "We propose using a group of diverse specialists, each defined based on the confusion matrix. We noticed that adversarial instances from a specific class tend to be mislabeled into a small group of incorrect classes. Thus, we believe that a team of specialists can better identify and reject misleading instances by showing high disagreement in their decisions when faced with adversarial examples. Our experimental results support this idea and suggest that enhancing the system's ability to reject these instances can make it more robust against adversarial attacks, rather than attempting to classify them correctly at all costs.",
    "In this paper, we introduce Neural Phrase-based Machine Translation (NPMT). Our method uses Sleep-WAke Networks (SWAN) to clearly model phrase structures in the output text. To address SWAN's need for monotonic alignment, we add a new layer that allows for flexible (soft) local rearrangement of input sequences. Unlike other neural machine translation (NMT) methods that rely on attention-based decoding, NPMT outputs phrases directly in order and works in linear time. Our experiments show that NPMT performs better than strong NMT baselines on the IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese translation tasks. We also found that our method generates meaningful phrases in the output languages.",
    "We introduce LR-GAN, an adversarial image generation model that incorporates scene structure and context. Unlike previous generative adversarial networks (GANs), LR-GAN learns to generate image backgrounds and foregrounds separately and recursively, seamlessly integrating the foregrounds with the background to produce a cohesive and natural image. For each foreground, the model independently generates its appearance, shape, and pose. The entire model operates in an unsupervised manner and is trained end-to-end using gradient descent methods. Our experiments show that LR-GAN produces more natural images with objects that are more easily recognizable by humans than those generated by DCGAN.",
    "We present a simple method that enables an agent to learn about its environment in an unsupervised manner. Our approach involves two versions of the same agent, named Alice and Bob, competing against each other. Alice proposes a task, and Bob tries to complete it. In this work, we focus on two types of environments: (nearly) reversible environments and environments that can be reset. Alice sets the task by performing a sequence of actions, which Bob must either undo or replicate, depending on the environment. Through a suitable reward system, Alice and Bob create a curriculum of exploration that facilitates unsupervised training of the agent. When Bob is later used for a reinforcement learning task within the environment, this unsupervised training reduces the number of supervised episodes required, and in some cases, leads to higher rewards.",
    "Maximum entropy modeling is a flexible and popular framework for creating statistical models using partial knowledge. Instead of the traditional method of directly optimizing over the continuous density, this paper introduces a new approach. We learn a smooth and invertible transformation that maps a simple distribution to the desired maximum entropy distribution. This process is complex because the objective (entropy) depends on the density itself. By using recent advancements in normalizing flow networks, we transform the maximum entropy problem into a finite-dimensional constrained optimization problem. We then solve this using a combination of stochastic optimization and the augmented Lagrangian method. Simulation results show our method is effective, and applications in finance and computer vision demonstrate the flexibility and accuracy of using maximum entropy flow networks.",
    "As machine learning continues to conquer new and formidable challenges on a daily basis, the vision of achieving general AI begins to seem within reach. Despite this, the bulk of current research is dedicated to pivotal yet specialized tasks like image classification and machine translation. We attribute this focus to the absence of objective metrics for tracking progress towards broad machine intelligence. To address this gap, we introduce a comprehensive set of criteria for general AI, along with a streamlined platform designed to evaluate machines on their ability to meet these criteria, minimizing any additional complexities.",
    "Neural networks designed to process graph structures are particularly suited for a range of tasks across different domains, such as natural language processing (e.g., parse trees) and cheminformatics (e.g., molecular graphs). Nevertheless, because each input results in a computation graph with a unique shape and size, these networks do not naturally accommodate batched training or inference. This also poses a challenge when implementing them in widely-used deep learning libraries that rely on static data-flow graphs. We introduce a method known as dynamic batching, which enables the batching of operations not only across various input graphs with different shapes but also across different nodes within a single graph. This approach allows the creation of static graphs that simulate dynamic computation graphs of any shape and size using popular deep learning libraries. Additionally, we present a high-level library containing compositional blocks, which simplifies the construction of dynamic graph models. Employing this library, we demonstrate concise, batch-wise parallel implementations for a variety of models from existing literature.",
    "Deep learning models have shown great success in natural language processing, but their decision-making processes remain largely opaque. Consequently, these models are often viewed as black boxes, offering little insight into the underlying patterns they learn. In this paper, we focus on Long Short Term Memory networks (LSTMs) and introduce a novel method for tracing the influence of a specific input on the LSTM’s output. By pinpointing consistently crucial word patterns, we are able to distill cutting-edge LSTMs used for sentiment analysis and question answering into a collection of representative phrases. We then quantitatively validate this representation by using the extracted phrases to build a straightforward, rule-based classifier that approximates the LSTM’s performance.",
    "Deep reinforcement learning has seen impressive results but struggles with sparse rewards and long horizons. To address this, we propose a framework that first learns useful skills in a pre-training environment and then uses these skills for faster learning in downstream tasks. Our approach combines intrinsic motivation and hierarchical methods, with skill learning guided by a single proxy reward needing minimal domain knowledge. A high-level policy is then trained on these skills, improving exploration and tackling sparse rewards. We use Stochastic Neural Networks with an information-theoretic regularizer for efficient pre-training. Experiments show this method effectively and sample-efficiently learns a wide range of skills, boosting learning performance across various downstream tasks.",
    "Deep generative models, including Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), have seen impressive success and are often studied independently as distinct paradigms. This paper aims to connect GANs and VAEs through a new formulation, interpreting sample generation in GANs as posterior inference. It reveals that GANs and VAEs minimize KL divergences of their respective posterior and inference distributions in opposite directions, extending the two learning phases of the classic wake-sleep algorithm. This unified view helps analyze various existing model variants and transfer techniques across research lines effectively. For instance, the importance weighting method from VAE literature is applied to improve GAN learning, and VAEs are enhanced with an adversarial mechanism using generated samples. Experiments demonstrate the generality and effectiveness of these transferred techniques.",
    "We address the challenge of detecting out-of-distribution images in neural networks and present ODIN, a straightforward and powerful method that doesn’t require altering a pre-trained network. ODIN leverages temperature scaling and tiny input perturbations to distinguish between in-distribution and out-of-distribution images based on their softmax scores. Through various experiments, we demonstrate that ODIN works well with different network architectures and datasets, consistently outperforming standard methods by a significant margin. Notably, ODIN reduces the false positive rate from 34.7% to 4.3% on DenseNet when tested with the CIFAR-10 dataset, while maintaining a 95% true positive rate.",
    "We introduce a fresh approach for unsupervised learning of representations inspired by the infomax principle, tailored for large-scale neural networks. By leveraging an asymptotic approximation of Shannon's mutual information in extensive neural populations, we show that a promising initial estimate of the global information-theoretic optimum is achievable through a hierarchical infomax strategy. Building on this initial solution, we propose an effective gradient descent algorithm to derive representations from input datasets. This method is versatile, working with complete, overcomplete, and undercomplete bases. Our numerical experiments confirm the robustness and high efficiency of this approach in identifying key features from input datasets. Compared to existing methods, our algorithm stands out for its faster training speed and enhanced robustness in unsupervised representation learning. Additionally, this method can be seamlessly extended to both supervised and unsupervised models for training deep structural networks.",
    "Recurrent Neural Networks (RNNs) have consistently demonstrated exceptional performance in sequence modeling tasks. Nonetheless, training RNNs on extended sequences frequently encounters challenges such as slow inference, vanishing gradients, and difficulties in capturing long-term dependencies. These issues are closely tied to the expansive, sequential computational graph that arises from unfolding the RNN over time during backpropagation through time settings. We present the Skip RNN model, an enhancement of existing RNN models that learns to skip state updates, thereby reducing the effective size of the computational graph. Additionally, the model can be incentivized to perform fewer state updates through the implementation of a budget constraint. We evaluate the proposed model across various tasks, illustrating its capability to decrease the number of necessary RNN updates while maintaining, and in some cases enhancing, the performance of baseline RNN models. The source code is publicly accessible at https://imatge-upc.github.io/skiprnn-2017-telecombcn/.",
    "Restart techniques are commonly employed in gradient-free optimization to address multimodal functions. In gradient-based optimization, partial warm restarts have been gaining traction to enhance convergence rates within accelerated gradient schemes, specifically for ill-conditioned functions. In this paper, we introduce a straightforward warm restart technique for stochastic gradient descent aimed at improving its performance during deep neural network training. We conduct an empirical evaluation on the CIFAR-10 and CIFAR-100 datasets, achieving new state-of-the-art results of 3.14% and 16.21%, respectively. Additionally, we highlight the technique's efficacy on a dataset of EEG recordings and a downsampled version of the ImageNet dataset. Our source code is accessible at https://github.com/loshchil/SGDR.",
    "Policy gradient methods have been highly successful in addressing difficult reinforcement learning problems. Nevertheless, they frequently encounter large variance issues in policy gradient estimation, resulting in poor sample efficiency during training. In this work, we introduce a control variate method to effectively reduce variance in policy gradient methods. Inspired by Stein's identity, our method enhances previous control variate techniques used in REINFORCE and advantage actor-critic by incorporating more general action-dependent baseline functions. Empirical studies demonstrate that our method substantially enhances the sample efficiency of leading policy gradient approaches.",
    "The introduction of skip connections has revolutionized the training of very deep neural networks, making it feasible and leading to their widespread adoption in various neural network architectures. Despite their success, a comprehensive explanation for why skip connections are so effective has remained unclear. In this work, we offer a new perspective on the benefits of skip connections when training deep networks.\n\nOne major challenge in training deep networks is the presence of singularities arising due to the model's non-identifiability. Previous research has identified several types of such singularities: \n\n1. Overlap singularities, which result from the nodes' permutation symmetry within a layer.\n2. Elimination singularities, which occur when nodes are consistently deactivated.\n3. Singularities caused by the linear dependence among nodes.\n\nThese singularities introduce degenerate manifolds in the loss landscape, impeding the learning process. We propose that skip connections mitigate these singularities by disrupting the permutation symmetry of nodes, reducing the likelihood of node elimination, and decreasing linear dependence among nodes. Additionally, typical initializations of networks with skip connections help to avoid these problematic singularities and reshape the loss landscape, thereby easing the learning slowdown.\n\nOur hypotheses are supported by evidence from both simplified models and experiments on deep networks trained with real-world datasets, demonstrating the efficacy of skip connections in overcoming these training difficulties.",
    "As part of the ICLR 2018 Reproducibility Challenge, we set out to replicate the findings from the paper \"Natural Language Inference over Interaction Space,\" which was submitted to the ICLR 2018 conference. At first, we didn't realize that the code for the paper was available, so we ambitiously began crafting the network from the ground up. After some intense effort, we evaluated our handmade version of the model on the Stanford NLI dataset and achieved a respectable 86.38% accuracy on the test set. The original paper boasts an 88.0% accuracy, and we believe the gap stems from variations in optimizers and model selection techniques.",
    "We have successfully implemented the \"Learn to Pay Attention\" model for attention mechanisms in convolutional neural networks, replicating the original paper's results in both image classification and fine-grained recognition categories.",
    "Crafting universal distributed representations of sentences is a cornerstone of natural language processing. We've developed an innovative method to generate these representations by encoding the suffixes of word sequences within sentences, and we fine-tuned this technique using the Stanford Natural Language Inference (SNLI) dataset. Our approach has proven its strength, outperforming current methods across various transfer tasks when assessed with the SentEval benchmark.",
    "In various neural network models, enhancing representations is achieved by incorporating new features formed as polynomial functions of the existing ones. Taking the natural language inference task as an illustrative case, we explore the impact of employing scaled polynomial features of degree 2 and higher for matching purposes. Our findings indicate that scaling features of degree 2 significantly boosts performance, resulting in a 5% reduction in classification error in the top-performing models.",
    "We present a generalization bound for feedforward neural networks based on the product of the layers' spectral norms and the weights' Frobenius norms, derived using PAC-Bayes analysis.",
    "In this work, we explore Batch Normalization and introduce a new probabilistic interpretation. We develop a probabilistic model showing that Batch Normalization maximizes the lower bound of its marginalized log-likelihood. Using this model, we design an algorithm that remains consistent during training and testing, but suffers from computational inefficiency in inference. To address this, we propose Stochastic Batch Normalization, an efficient approximation that reduces memory and computational costs while offering scalable uncertainty estimation. Our experiments on popular architectures, including VGG-like and ResNets, demonstrate the effectiveness of Stochastic Batch Normalization on MNIST and CIFAR-10 datasets.",
    "Many people think that deep convolutional networks are successful because they gradually filter out unimportant variations in the input data. This idea is backed by how hard it is to reconstruct images from their hidden layers in most common network setups. In this paper, we show that losing information isn't actually necessary to create representations that work well for complex tasks like ImageNet. We introduce the i-RevNet, a network built from a series of homeomorphic layers, which can be fully reversed up to the final class projection, meaning no information is lost. Building an invertible network is challenging partly because local inversion is tricky, but we solve this by providing an explicit inverse. Our analysis of i-RevNet's learned representations suggests that the success of deep networks could be due to a progressive tightening and linear separation with depth. To better understand the model learned by i-RevNet, we also reconstruct linear interpolations between natural image representations.",
    "Deep latent variable models serve as robust instruments for representation learning. In this paper, we adopt the deep information bottleneck model, identify its limitations, and propose an enhanced model that addresses these deficiencies. We incorporate a copula transformation to restore the invariance properties of the information bottleneck method, which facilitates the disentanglement of features in the latent space. Consequently, this transformation fosters sparsity within the latent space of the new model. We rigorously evaluate our approach using both synthetic and real-world datasets.",
    "We present an enhanced variant of the MAC model (Hudson and Manning, ICLR 2018) featuring a streamlined set of equations that maintains comparable accuracy and offers accelerated training times. Both models are evaluated on the CLEVR and CoGenT datasets, demonstrating that transfer learning with fine-tuning yields a 15-point increase in accuracy, thereby achieving state-of-the-art performance. Conversely, we also illustrate that inadequate fine-tuning can significantly diminish a model's accuracy.",
    "Adaptive Computation Time for Recurrent Neural Networks (ACT) stands out as one of the most groundbreaking architectures for variable computation. What makes ACT remarkable is its ability to adapt to the input sequence by examining each sample multiple times and learning the optimal number of iterations. This paper introduces a comparative study between ACT and Repeat-RNN, a novel architecture designed to repeat each sample a fixed number of times. The findings are astonishing, revealing that Repeat-RNN performs just as well as ACT across the selected tasks. You can explore the source code for both TensorFlow and PyTorch at https://imatge-upc.github.io/danifojo-2018-repeatrnn/.",
    "Generative adversarial networks (GANs) have the capability to model the complex, high-dimensional distributions of real-world data, indicating their potential effectiveness for anomaly detection. Despite this, there has been limited research on using GANs specifically for anomaly detection tasks. In our work, we utilize newly developed GAN models for anomaly detection and achieve state-of-the-art performance on both image and network intrusion datasets. Additionally, our method is several hundred times faster at test time compared to the only other published GAN-based approach.",
    "The Natural Language Inference (NLI) task involves determining the logical relationship between a natural language premise and hypothesis. We introduce the Interactive Inference Network (IIN), a new type of neural network that hierarchically extracts semantic features from interactions for high-level sentence pair understanding. We demonstrate that interaction tensors (attention weights) contain crucial semantic information for solving NLI, and denser tensors provide richer information. One such architecture, the Densely Interactive Inference Network (DIIN), achieves state-of-the-art performance on large-scale NLI corpora. Notably, DIIN reduces error by over 20% on the challenging Multi-Genre NLI (MultiNLI) dataset compared to the previous best system.",
    "Deploying neural networks in real-world, safety-critical systems is significantly hindered by adversarial examples—inputs that are subtly altered to cause the network to misclassify them. In recent years, numerous techniques have been proposed to enhance robustness against these adversarial examples. However, many of these methods quickly prove vulnerable to subsequent attacks. For instance, more than half of the defenses introduced in papers accepted at ICLR 2018 have already been compromised. To tackle this challenge, we propose using formal verification techniques. Our approach involves constructing provably minimally distorted adversarial examples: for any given neural network and input sample, we can generate adversarial examples with the least possible distortion. Utilizing this method, we demonstrate that a recent defense proposal from ICLR, known as adversarial retraining, effectively increases the distortion needed to produce adversarial examples by a factor of 4.2.",
    "Deep neural networks (DNNs) excel at predicting outcomes by learning intricate, non-linear relationships among variables. However, their complex nature makes these relationships hard to visualize, earning DNNs the label of \"black boxes\" and limiting their use. To address this issue, we present a method called agglomerative contextual decomposition (ACD) for explaining DNN predictions through hierarchical interpretations. ACD produces a hierarchical clustering of input features for a given DNN prediction, showing the contribution of each cluster to the final outcome. This hierarchy is designed to highlight feature clusters that the DNN identified as predictive. Through examples from the Stanford Sentiment Treebank and ImageNet, we demonstrate that ACD successfully diagnoses incorrect predictions and detects dataset biases. Human experiments show that ACD helps users discern the more accurate of two DNNs and increases trust in a DNN's outputs. Additionally, ACD's hierarchies are generally resistant to adversarial perturbations, suggesting they capture essential input characteristics while ignoring irrelevant noise.",
    "Imagine if you could wave a magic wand and transform the sound of a piano into a violin, all while keeping the same melody, rhythm, and volume. That's exactly what we're diving into with our work on musical timbre transfer. Instead of just changing the appearance of a photo using style transfer, we're shifting the \"color\" of sounds between different instruments.\n\nIntroducing TimbreTron: a groundbreaking method that takes the concept of style transfer from the visual world and brings it to music. By using a time-frequency representation of an audio signal, TimbreTron allows us to manipulate the timbre, or unique quality of sound, of an instrument. The real magic happens when TimbreTron teams up with a conditional WaveNet synthesizer to produce high-quality waveforms, ensuring the transformed sound is crisp and clear.\n\nWe've discovered that the Constant Q Transform (CQT) is the secret ingredient, making the system exceptionally good at handling pitch variations — crucial for both single-note and full-chord music samples. Through extensive human testing, we've found that TimbreTron effectively swaps the timbre while keeping everything else intact. Whether it's a solo or a symphony, TimbreTron manages to keep the soul of the music alive.",
    "We delve into the realm of word-level language modeling and explore the potential of merging short-term representations based on hidden states with medium-term representations embedded in the dynamic weights of a language model. Our research builds on recent experiments with language models that have adaptable weights, transforming the language modeling challenge into an online learning-to-learn framework. In this framework, a meta-learner is trained through gradient descent to perpetually refine the weights of the language model.",
    "GANs are strong generative models that can understand the structure of natural images. We use this ability for manifold regularization by estimating the Laplacian norm using a simple Monte Carlo method with the GAN. By adding this to the feature-matching GAN from Improved GAN, we get state-of-the-art results for GAN-based semi-supervised learning on the CIFAR-10 dataset, with a much easier method than other competing techniques.",
    "We uncover a fascinating class of over-parameterized deep neural networks that utilize standard activation functions and the cross-entropy loss. Remarkably, these networks are free from troublesome local valleys. No matter where you start in the parameter space, there's always a continuous path where the cross-entropy loss consistently decreases, approaching zero. This groundbreaking discovery means that these networks are devoid of sub-optimal strict local minima.",
    "Visual Question Answering (VQA) models have historically struggled with counting objects in natural images. We have identified that the use of soft attention in these models is a fundamental issue contributing to this difficulty. To address this problem, we propose a neural network component that enhances the robustness of counting using object proposals. Our experiments on a toy task demonstrate the effectiveness of this component, achieving state-of-the-art accuracy in the number category of the VQA v2 dataset without sacrificing performance in other categories. Remarkably, our single model outperforms ensemble models. Additionally, on a challenging balanced pair metric, our proposed component improves counting accuracy by 6.6% over a strong baseline.",
    "A significant hurdle in the exploration of generative adversarial networks is the instability encountered during training. In response, this paper introduces an innovative weight normalization method known as spectral normalization, designed to stabilize the discriminator's training process. This new technique is both computationally efficient and straightforward to integrate into current frameworks. Upon evaluating spectral normalization using the CIFAR10, STL-10, and ILSVRC2012 datasets, we have empirically validated that spectrally normalized GANs (SN-GANs) can produce images of comparable or superior quality to those created using earlier training stabilization methods.",
    "Embedding graph nodes into a vector space enables the application of machine learning techniques to tasks such as predicting node classes. However, research on node embedding algorithms remains less developed compared to natural language processing, largely due to the diverse nature of graphs. In this study, we evaluate the performance of various node embedding algorithms in relation to graph centrality measures, which describe the characteristics of different graphs. Through systematic experiments involving four node embedding algorithms, multiple centrality measures, and six datasets, our results provide insights into the properties of these algorithms. These findings can serve as a foundation for further research in the field.",
    "We present a new dataset designed to assess models' ability to understand and utilize the structure of logical expressions through an entailment prediction task. To this end, we evaluate several commonly used architectures in sequence-processing, along with a novel model class—PossibleWorldNets—which determine entailment by performing a \"convolution over possible worlds.\" Our results indicate that convolutional networks have an unsuitable inductive bias for this type of problem compared to LSTM RNNs. Furthermore, tree-structured neural networks surpass LSTM RNNs due to their superior capability to leverage logical syntax, while PossibleWorldNets achieve the best performance across all benchmarks.",
    "Neural network pruning can reduce parameter counts by over 90%, decreasing storage needs and boosting inference performance without losing accuracy. However, pruned architectures are hard to train from scratch. We discover that standard pruning reveals subnetworks with initial weights that enable effective training. This leads to the \"lottery ticket hypothesis\": dense, randomly-initialized networks contain subnetworks (\"winning tickets\") that, when trained alone, achieve comparable test accuracy to the original network in a similar time. Winning tickets owe their success to advantageous initial weights. We provide an algorithm to identify these tickets and present experiments showing their importance. We consistently find winning tickets less than 10-20% the size of standard architectures (MNIST, CIFAR10). Beyond this size, these tickets train faster and achieve higher test accuracy than the original network.",
    "We describe the singular values of the linear transformation linked to a standard 2D multi-channel convolutional layer, allowing for their efficient calculation. This understanding also gives rise to an algorithm for projecting a convolutional layer onto an operator-norm ball. We demonstrate that this serves as an effective regularizer; for instance, it reduces the test error of a deep residual network with batch normalization on CIFAR-10 from 6.2% to 5.3%.",
    "Despite their remarkable empirical successes, deep and locally connected nonlinear networks like deep convolutional neural networks (DCNNs) still pose significant theoretical challenges. In this paper, we introduce an innovative theoretical framework for such networks with ReLU nonlinearity. This framework captures data distribution precisely, encourages disentangled representations, and aligns seamlessly with regularization methods such as Batch Norm. Built on a teacher-student architecture, the framework expands the student's forward and backward propagation into the teacher's computational graph. Crucially, it avoids impractical assumptions like Gaussian inputs or activation independence. Our framework promises to shed light on critical practical issues in deep network theory, including overfitting, generalization, and the nature of disentangled representations.",
    "We introduce Neural Program Search, an advanced algorithm designed to generate programs from natural language descriptions and a limited set of input/output examples. This algorithm synergizes techniques from Deep Learning and Program Synthesis by devising a robust domain-specific language (DSL) and implementing an efficient search algorithm, guided by a Seq2Tree model. To assess the effectiveness of our approach, we also provide a semi-synthetic dataset comprising descriptions, test examples, and corresponding programs. Our findings demonstrate that our algorithm significantly surpasses the performance of a sequence-to-sequence model with an attention mechanism baseline.",
    "Most state-of-the-art neural machine translation systems, regardless of their architectural differences (e.g., recurrence, convolutional), share a crucial feature: Attention. However, current attention methods are primarily token-based and overlook phrasal alignments, which are essential for phrase-based statistical machine translation. This paper introduces novel phrase-based attention methods that model n-grams of tokens as attention entities. By integrating these phrase-based attentions into the Transformer network, our approach achieves improvements of 1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English translation tasks on WMT newstest2014, using WMT'16 training data.",
    "Imagine a groundbreaking approach to understanding how edits work! We're unveiling a creative solution that learns to capture the essence of changes made to text or code. By cleverly pairing a \"neural editor\" with an \"edit encoder,\" our innovative models grasp the key details of an edit and apply similar changes to new content. After experimenting with natural language and code modifications, we've seen exciting results—our neural networks are picking up on the structural and meaningful aspects of edits. We believe this intriguing challenge and dataset will spark curiosity and further research in the community.",
    "We propose a kernel learning method based on a Fourier-analytic characterization of invariant kernels. Our method iteratively refines the SVM margin through feature maps and provides rigorous guarantees for optimality and generalization. Interpreted as online equilibrium dynamics in a two-player min-max game, our approach demonstrates scalability and consistent improvements on synthetic and real-world datasets, outperforming related random features methods.",
    "This paper introduces variational continual learning (VCL), an easy yet flexible method for continual learning. It combines online variational inference (VI) with recent improvements in Monte Carlo VI for neural networks. VCL can effectively train both deep discriminative and deep generative models in complex settings where tasks change or new tasks appear over time. Tests show that VCL performs better than the latest continual learning methods on various tasks and prevents forgetting old information automatically.",
    "This report serves multiple purposes. Firstly, it investigates the reproducibility of the 2018 paper, \"On the Regularization of Wasserstein GANs.\" Secondly, it emphasizes the replication of five key experimental aspects from the original paper: learning speed, stability, robustness against hyperparameters, estimating the Wasserstein distance, and various sampling methods. Lastly, the report assesses which parts of the study can be reproduced and the resource costs involved. All reproduction source code is publicly available.",
    "In this paper, we introduce a novel technique for extracting features from program execution logs. Initially, we automatically identify complex patterns within a program's behavior graph. These patterns are then embedded into a continuous space using a trained autoencoder. We assess the effectiveness of the proposed features through a real-world malicious software detection task. Additionally, we discover that the embedding space reveals interpretable structures within the pattern components.",
    "Introducing a groundbreaking neural probabilistic model built on the pioneering principles of variational autoencoders! This versatile model can ingeniously condition itself on any chosen subset of observed features, then spectacularly generate the remaining features all at once, like magic. Whether dealing with real-valued or categorical features, this model handles them seamlessly. Propelled by the power of stochastic variational Bayes, its training is both sophisticated and robust. Our experimental results, spanning synthetic data, feature imputation, and image inpainting, highlight the model's remarkable effectiveness and the vibrant diversity of its generated samples.",
    "Variational Autoencoders (VAEs) were initially inspired by Kingma & Welling (2014) as probabilistic generative frameworks utilizing approximate Bayesian inference. The concept of $\\beta$-VAEs introduced by Higgins et al. (2017) diverges from this original interpretation and broadens the application spectrum of VAEs to various fields such as representation learning, clustering, and lossy data compression. This is achieved through an objective function that enables users to balance the information content (\"bit rate\") of the latent representation against the distortion of the reconstructed data (Alemi et al., 2018). In this study, we re-evaluate this rate/distortion trade-off in the scope of hierarchical VAEs, which incorporate multiple layers of latent variables. We recognize a general category of inference models allowing the rate to be divided into contributions from each layer, enabling individual tuning. We establish theoretical limits on the efficacy of downstream tasks based on the rates of individual layers and confirm these theoretical insights through extensive experiments. Our findings offer useful recommendations for practitioners on selecting the optimal rate-space region for specific applications.",
    "Exploring the intricate landscapes of adversarial examples reveals much about the resilience of deep neural networks (DNNs) against subtle perturbations. In a groundbreaking study, Ma et al. (ICLR 2018) introduced the concept of local intrinsic dimensionality (LID) within layer-wise hidden representations of DNNs to delve into these adversarial realms. Demonstrating prowess, LID effectively mapped the subspaces influenced by various attack strategies, such as the Carlini and Wagner (C&W) attack and the fast gradient sign attack.\n\nVenturing further, this paper presents novel experimental insights using the MNIST and CIFAR-10 datasets, exposing some limitations of LID in portraying adversarial subspaces. These explorations focus on two previously uncharted territories in LID analysis: (i) oblivious attacks intertwined with varying confidence levels of adversarial examples and (ii) the nuances of black-box transfer attacks.\n\nRegarding (i), our findings reveal that the efficacy of LID is remarkably sensitive to the attack’s confidence parameter. Surprisingly, when adversarial examples with mixed confidence levels are analyzed, LID underperforms. In the case of (ii), we observe that adversarial examples generated from alternative DNN models elude effective characterization by LID.\n\nCollectively, these discoveries underscore significant constraints in LID's ability to map out adversarial subspaces, suggesting that its application might be more limited than previously anticipated.",
    "Generative adversarial networks (GANs) are a buzzword in the generative modeling field, renowned for their ability to create strikingly realistic samples. Despite their potential, they present a notorious challenge: they’re a bear to train. The regular approach has been to tweak the GAN objective in various innovative ways, but surprisingly little attention has been paid to optimizing the adversarial training itself. Our research reinterprets GAN optimization problems through the lens of variational inequalities. By diving into mathematical programming, we debunk some myths about the challenges of saddle point optimization. We suggest enhancing GAN training by adapting methods from the variational inequalities domain. Specifically, we bring averaging, extrapolation, and a less resource-intensive adaptation called extrapolation from the past, to both SGD and Adam optimization techniques.",
    "Neural message passing algorithms for semi-supervised classification on graphs have recently revolutionized the field. Yet, their limitation lies in considering only a small, difficult-to-expand neighborhood for node classification. Our breakthrough research addresses this constraint by leveraging the relationship between graph convolutional networks (GCNs) and PageRank to create a superior propagation scheme based on personalized PageRank. This innovative approach forms the basis of our simple yet powerful model, personalized propagation of neural predictions (PPNP), and its efficient variant, APPNP. Remarkably, our model boasts training times that are on par with or faster than existing models, with a comparable or even lower number of parameters. By harnessing a significantly larger, adjustable neighborhood for classification, our model seamlessly integrates with any neural network. Extensive studies demonstrate that our model consistently outperforms recently proposed methods for semi-supervised classification, establishing a new benchmark in the field. Our implementation is readily accessible online.",
    "In our research, we uncover the phenomenon of obfuscated gradients, a type of gradient masking, which can lead to a misleading perception of security in defense mechanisms against adversarial examples. Defenses that create obfuscated gradients may seem effective against iterative optimization-based attacks at first glance, but further investigation reveals that these defenses can be bypassed. We document the characteristic behaviors of defenses that exhibit this phenomenon and classify three types of obfuscated gradients. For each type, we develop specific attack techniques to overcome the defense. In a case study focused on non-certified white-box-secure defenses presented at the ICLR 2018 conference, we observe that obfuscated gradients are prevalent, with 7 out of 9 defenses relying on this effect. Our newly devised attack strategies successfully defeat 6 defenses entirely and partially undermine 1, all within the original threat models proposed in the respective papers.",
    "Techniques for learning node representations in graphs are pivotal in network analysis as they facilitate various downstream tasks. We introduce Graph2Gauss, an efficient method for learning flexible node embeddings on large-scale (attributed) graphs, demonstrating robust performance in tasks such as link prediction and node classification. Unlike most methods that represent nodes as point vectors in a low-dimensional space, we represent each node as a Gaussian distribution, thus capturing representation uncertainty. Additionally, we present an unsupervised method suitable for inductive learning scenarios across diverse graph types: plain/attributed and directed/undirected. By integrating both network structure and node attributes, we can generalize to unseen nodes without requiring extra training. To learn embeddings, we use a personalized ranking approach concerning node distances that takes advantage of the inherent ordering imposed by the network structure. Our experiments on real-world networks highlight the superior performance of our method, surpassing state-of-the-art network embedding techniques across various tasks. Moreover, modeling uncertainty provides additional insights by allowing us to estimate neighborhood diversity and identify the inherent latent dimensionality of a graph.",
    "Convolutional Neural Networks (CNNs) have become the preferred approach for solving problems involving 2D planar images. However, recent advancements have highlighted the need for models capable of analyzing spherical images. Such applications include omnidirectional vision for drones, robots, and autonomous vehicles, as well as molecular regression issues and global weather and climate modeling. A straightforward application of convolutional networks to a planar projection of a spherical signal is likely to fail due to space-varying distortions introduced by the projection, which render translational weight sharing ineffective.\n\nIn this paper, we introduce essential components for constructing spherical CNNs. We propose a definition for spherical cross-correlation that is both expressive and rotation-equivariant. This spherical correlation adheres to a generalized Fourier theorem, enabling efficient computation using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs in 3D model recognition and atomization energy regression tasks.",
    "This paper demonstrates the direct application of natural language processing (NLP) techniques to classification problems in cheminformatics. By examining the standard textual representation of compounds known as SMILES, the connection between these seemingly distinct fields is established. The study focuses on predicting compound activity against a target protein, a critical component of computer-aided drug design. The experiments reveal that this approach not only surpasses the performance of traditional handcrafted representations but also provides clear structural insights into decision-making processes.",
    "Integrating Computer Vision and Deep Learning into agriculture seeks to boost harvest quality and farmer productivity. Postharvest, the sorting of fruits and vegetables influences both the export market and quality assessments. Apples, specifically, are vulnerable to numerous defects during and after harvesting. This paper investigates whether advanced computer vision and deep learning techniques, like YOLOv3 (Redmon & Farhadi, 2018), can assist farmers in identifying defect-free apples from those with imperfections, thereby improving post-harvest handling.",
    "We offer two straightforward techniques to decrease the number of parameters and speed up the training process of large Long Short-Term Memory (LSTM) networks: the first technique involves \"matrix factorization by design,\" which breaks down the LSTM matrix into the product of two smaller matrices, and the second technique entails partitioning the LSTM matrix along with its inputs and states into independent groups. Both methods enable us to train large LSTM networks much faster, achieving nearly state-of-the-art perplexity while utilizing significantly fewer parameters.",
    "Cutting-edge models for deep reading comprehension are primarily led by recurrent neural networks (RNNs). Due to their inherently sequential processing ability, RNNs are well-suited for interpreting language. However, this sequential nature restricts parallel processing within a single instance, often creating a significant bottleneck when deploying these models in situations where low latency is critical. This limitation is especially pronounced when dealing with longer texts.\n\nIn this paper, we introduce a convolutional architecture as an alternative approach to these recurrent architectures. By replacing the recurrent units with straightforward dilated convolutional units, we are able to achieve performance levels on par with the state-of-the-art across two question answering benchmarks. Moreover, this convolutional approach yields substantial improvements in speed, offering up to two orders of magnitude faster processing times for question answering tasks.",
    "In this study, we examine the reinstatement mechanism proposed by Ritter et al. (2018) and identify two types of neurons that develop in the agent's working memory (an epLSTM cell) when it undergoes episodic meta-RL training on an episodic form of the Harlow visual fixation task. The Abstract neurons store knowledge common across various tasks, while the Episodic neurons hold information pertinent to the task of a particular episode.",
    "The concept of the rate-distortion-perception function (RDPF), introduced by Blau and Michaeli in 2019, has become an invaluable framework for analyzing the balance between realism and distortion in lossy compression. While the rate-distortion function is well-understood, the existence of encoders and decoders that can match the rates predicted by the RDPF has remained a mystery. Leveraging findings from Li and El Gamal in 2018, we demonstrate that it is indeed possible to achieve the RDPF using stochastic, variable-length codes. Moreover, we establish that for these types of codes, the RDPF sets a lower bound on the achievable rate, pushing the boundaries of what we know about efficient data compression and perception.",
    "Hey there! In this paper, we introduce Neural Phrase-based Machine Translation (NPMT). Our approach specifically models phrase structures in output sequences using Sleep-WAke Networks (SWAN), a cool new segmentation-based sequence modeling method. To get around SWAN's strict alignment rules, we added a new layer that allows for (soft) local reordering of input sequences. Unlike other neural machine translation (NMT) methods, NPMT skips the attention-based decoding and instead outputs phrases in a straightforward order, which makes decoding faster. Our tests show that NPMT knocks it out of the park on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese translation tasks, beating strong NMT benchmarks. Plus, we noticed our method produces meaningful phrases in the output languages.",
    "It is well-established that minor adversarial perturbations can cause classification errors in deep neural networks (DNNs). This paper argues that sparse representations of input data are essential for mitigating such attacks. For linear classifiers, we demonstrate that a sparsifying front end is provably effective against $\\ell_{\\infty}$-bounded attacks, reducing output distortion induced by the attack by a factor of approximately $K / N$, where $N$ represents the data dimension and $K$ denotes the sparsity level. We extend this approach to DNNs by proposing a \"locally linear\" model that serves as a theoretical basis for developing both attacks and defenses. Experimental results on the MNIST dataset validate the effectiveness of the proposed sparsifying front end.",
    "Get ready for a game-changer in deep reinforcement learning! We're excited to introduce Supervised Policy Update (SPU), a revolutionary sample-efficient methodology. Starting with data from the current policy, SPU tackles a constrained optimization problem in the non-parameterized proximal policy space. Through supervised regression, it then transforms this optimal non-parameterized policy into a parameterized policy, generating new samples. This versatile method works wonders in both discrete and continuous action spaces, and skillfully handles various proximity constraints for the non-parameterized optimization problem. Imagine addressing Natural Policy Gradient, Trust Region Policy Optimization (NPG/TRPO), and Proximal Policy Optimization (PPO) problems effortlessly with SPU. Plus, SPU’s implementation is far simpler than TRPO. Our extensive experiments reveal SPU's superior sample efficiency, outperforming TRPO in Mujoco's simulated robotic tasks and PPO in challenging Atari video game tasks. Get ready to take your deep reinforcement learning to the next level with SPU!",
    "Introducing Moving Symbols, a parameterized synthetic dataset crafted for objective analysis of video prediction networks. By utilizing various controlled variations of the dataset, we expose limitations in a state-of-the-art approach and recommend a more semantically meaningful performance metric to enhance experimental clarity. Our dataset offers standard test cases that will enable the research community to better understand and refine the learned representations of these networks. Access the code at https://github.com/rszeto/moving-symbols."
  ]
}