{
  "original": [
    "In this report, we describe a Theano-based AlexNet (Krizhevsky et al., 2012) implementation and its naive data parallelism on multiple GPUs. Our performance on 2 GPUs is comparable with the state-of-art Caffe library (Jia et al., 2014) run on 1 GPU. To the best of our knowledge, this is the first open-source Python-based AlexNet implementation to-date.",
    "We show that deep narrow Boltzmann machines are universal approximators of probability distributions on the activities of their visible units, provided they have sufficiently many hidden layers, each containing the same number of units as the visible layer. We show that, within certain parameter domains, deep Boltzmann machines can be studied as feedforward networks. We provide upper and lower bounds on the sufficient depth and width of universal approximators. These results settle various intuitions regarding undirected networks and, in particular, they show that deep narrow Boltzmann machines are at least as compact universal approximators as narrow sigmoid belief networks and restricted Boltzmann machines, with respect to the currently available bounds for those models.",
    "Leveraging advances in variational inference, we propose to enhance recurrent neural networks with latent variables, resulting in Stochastic Recurrent Networks (STORNs). The model i) can be trained with stochastic gradient methods, ii) allows structured and multi-modal conditionals at each time step, iii) features a reliable estimator of the marginal likelihood and iv) is a generalisation of deterministic recurrent neural networks. We evaluate the method on four polyphonic musical data sets and motion capture data.",
    "We describe a general framework for online adaptation of optimization hyperparameters by `hot swapping' their values during learning. We investigate this approach in the context of adaptive learning rate selection using an explore-exploit strategy from the multi-armed bandit literature. Experiments on a benchmark neural network show that the hot swapping approach leads to consistently better solutions compared to well-known alternatives such as AdaDelta and stochastic gradient with exhaustive hyperparameter search.",
    "Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm for partial least squares, whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.",
    "Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report state-of-the-art results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015).",
    "Automatic speech recognition systems usually rely on spectral-based features, such as MFCC of PLP. These features are extracted based on prior knowledge such as, speech perception or/and speech production. Recently, convolutional neural networks have been shown to be able to estimate phoneme conditional probabilities in a completely data-driven manner, i.e. using directly temporal raw speech signal as input. This system was shown to yield similar or better performance than HMM/ANN based system on phoneme recognition task and on large scale continuous speech recognition task, using less parameters. Motivated by these studies, we investigate the use of simple linear classifier in the CNN-based framework. Thus, the network learns linearly separable features from raw speech. We show that such system yields similar or better performance than MLP based system using cepstral-based features as input.",
    "We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.",
    "We develop a new method for visualizing and refining the invariances of learned representations. Specifically, we test for a general form of invariance, linearization, in which the action of a transformation is confined to a low-dimensional subspace. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images lying on a path between them that is of minimal length in the space of the representation (a \"representational geodesic\"). If the transformation relating the two reference images is linearized by the representation, this sequence should follow the gradual evolution of this transformation. We use this method to assess the invariance properties of a state-of-the-art image classification network and find that geodesics generated for image pairs differing by translation, rotation, and dilation do not evolve according to their associated transformations. Our method also suggests a remedy for these failures, and following this prescription, we show that the modified representation is able to linearize a variety of geometric image transformations.",
    "Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.",
    "We present a novel architecture, the \"stacked what-where auto-encoders\" (SWWAE), which integrates discriminative and generative pathways and provides a unified approach to supervised, semi-supervised and unsupervised learning without relying on sampling during training. An instantiation of SWWAE uses a convolutional net (Convnet) (LeCun et al. (1998)) to encode the input, and employs a deconvolutional net (Deconvnet) (Zeiler et al. (2010)) to produce the reconstruction. The objective function includes reconstruction terms that induce the hidden states in the Deconvnet to be similar to those of the Convnet. Each pooling layer produces two sets of variables: the \"what\" which are fed to the next layer, and its complementary variable \"where\" that are fed to the corresponding layer in the generative decoder.",
    "We investigate the problem of inducing word embeddings that are tailored for a particular bilexical relation. Our learning algorithm takes an existing lexical vector space and compresses it such that the resulting word embeddings are good predictors for a target bilexical relation. In experiments we show that task-specific embeddings can benefit both the quality and efficiency in lexical prediction tasks.",
    "A generative model is developed for deep (multi-layered) convolutional dictionary learning. A novel probabilistic pooling operation is integrated into the deep model, yielding efficient bottom-up (pretraining) and top-down (refinement) probabilistic learning. Experimental results demonstrate powerful capabilities of the model to learn multi-layer features from images, and excellent classification results are obtained on the MNIST and Caltech 101 datasets.",
    "Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.",
    "Convolutional neural networks (CNNs) work well on large datasets. But labelled data is hard to collect, and in some applications larger amounts of data are not available. The problem then is how to use CNNs with small data -- as CNNs overfit quickly. We present an efficient Bayesian CNN, offering better robustness to over-fitting on small data than traditional approaches. This is by placing a probability distribution over the CNN's kernels. We approximate our model's intractable posterior with Bernoulli variational distributions, requiring no additional model parameters.   On the theoretical side, we cast dropout network training as approximate inference in Bayesian neural networks. This allows us to implement our model using existing tools in deep learning with no increase in time complexity, while highlighting a negative result in the field. We show a considerable improvement in classification accuracy compared to standard techniques and improve on published state-of-the-art results for CIFAR-10.",
    "We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.",
    "Distributed representations of words have boosted the performance of many Natural Language Processing tasks. However, usually only one representation per word is obtained, not acknowledging the fact that some words have multiple meanings. This has a negative effect on the individual word representations and the language model as a whole. In this paper we present a simple model that enables recent techniques for building word vectors to represent distinct senses of polysemic words. In our assessment of this model we show that it is able to effectively discriminate between words' senses and to do so in a computationally efficient manner.",
    "We propose Diverse Embedding Neural Network (DENN), a novel architecture for language models (LMs). A DENNLM projects the input word history vector onto multiple diverse low-dimensional sub-spaces instead of a single higher-dimensional sub-space as in conventional feed-forward neural network LMs. We encourage these sub-spaces to be diverse during network training through an augmented loss function. Our language modeling experiments on the Penn Treebank data set show the performance benefit of using a DENNLM.",
    "A standard approach to Collaborative Filtering (CF), i.e. prediction of user ratings on items, relies on Matrix Factorization techniques. Representations for both users and items are computed from the observed ratings and used for prediction. Unfortunatly, these transductive approaches cannot handle the case of new users arriving in the system, with no known rating, a problem known as user cold-start. A common approach in this context is to ask these incoming users for a few initialization ratings. This paper presents a model to tackle this twofold problem of (i) finding good questions to ask, (ii) building efficient representations from this small amount of information. The model can also be used in a more standard (warm) context. Our approach is evaluated on the classical CF problem and on the cold-start problem on four different datasets showing its ability to improve baseline performance in both cases.",
    "We propose a deep learning framework for modeling complex high-dimensional densities called Non-linear Independent Component Estimation (NICE). It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. We parametrize this transformation so that computing the Jacobian determinant and inverse transform is trivial, yet we maintain the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood, which is tractable. Unbiased ancestral sampling is also easy. We show that this approach yields good generative models on four image datasets and can be used for inpainting.",
    "We introduce Deep Linear Discriminant Analysis (DeepLDA) which learns linearly separable latent representations in an end-to-end fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classification problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a non-linear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and back-propagation. For evaluation we test our approach on three different benchmark datasets (MNIST, CIFAR-10 and STL-10). DeepLDA produces competitive results on MNIST and CIFAR-10 and outperforms a network trained with categorical cross entropy (same architecture) on a supervised setting of STL-10.",
    "Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.   Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)).   Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.",
    "We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. The data are linearly transformed, and each component is then normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and a constant. We optimize the parameters of the full transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. The optimized transformation substantially Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than alternative methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We demonstrate the use of the model as a prior probability density that can be used to remove additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized using the same Gaussianization objective, thus offering an unsupervised method of optimizing a deep network architecture.",
    "We present flattened convolutional neural networks that are designed for fast feedforward execution. The redundancy of the parameters, especially weights of the convolutional filters in convolutional neural networks has been extensively studied and different heuristics have been proposed to construct a low rank basis of the filters after training. In this work, we train flattened networks that consist of consecutive sequence of one-dimensional filters across all directions in 3D space to obtain comparable performance as conventional convolutional networks. We tested flattened model on different datasets and found that the flattened layer can effectively substitute for the 3D filters without loss of accuracy. The flattened convolution pipelines provide around two times speed-up during feedforward pass compared to the baseline model due to the significant reduction of learning parameters. Furthermore, the proposed method does not require efforts in manual tuning or post processing once the model is trained.",
    "In this paper, we introduce a novel deep learning framework, termed Purine. In Purine, a deep network is expressed as a bipartite graph (bi-graph), which is composed of interconnected operators and data tensors. With the bi-graph abstraction, networks are easily solvable with event-driven task dispatcher. We then demonstrate that different parallelism schemes over GPUs and/or CPUs on single or multiple PCs can be universally implemented by graph composition. This eases researchers from coding for various parallelization schemes, and the same dispatcher can be used for solving variant graphs. Scheduled by the task dispatcher, memory transfers are fully overlapped with other computations, which greatly reduce the communication overhead and help us achieve approximate linear acceleration.",
    "In this paper we propose a model that combines the strengths of RNNs and SGVB: the Variational Recurrent Auto-Encoder (VRAE). Such a model can be used for efficient, large scale unsupervised learning on time series data, mapping the time series data to a latent vector representation. The model is generative, such that data can be generated from samples of the latent space. An important contribution of this work is that the model can make use of unlabeled data in order to facilitate supervised training of RNNs by initialising the weights and network state.",
    "Current work in lexical distributed representations maps each word to a point vector in low-dimensional space. Mapping instead to a density provides many interesting advantages, including better capturing uncertainty about a representation and its relationships, expressing asymmetries more naturally than dot product or cosine similarity, and enabling more expressive parameterization of decision boundaries. This paper advocates for density-based distributed embeddings and presents a method for learning representations in the space of Gaussian distributions. We compare performance on various word embedding benchmarks, investigate the ability of these embeddings to model entailment and other asymmetric relationships, and explore novel properties of the representation.",
    "Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them. For example, it is possible to train Maxout networks with 10 bits multiplications.",
    "Multiple instance learning (MIL) can reduce the need for costly annotation in tasks such as semantic segmentation by weakening the required degree of supervision. We propose a novel MIL formulation of multi-class semantic segmentation learning by a fully convolutional network. In this setting, we seek to learn a semantic segmentation model from just weak image-level labels. The model is trained end-to-end to jointly optimize the representation while disambiguating the pixel-image label assignment. Fully convolutional training accepts inputs of any size, does not need object proposal pre-processing, and offers a pixelwise loss map for selecting latent instances. Our multi-class MIL loss exploits the further supervision given by images with multiple labels. We evaluate this approach through preliminary experiments on the PASCAL VOC segmentation challenge.",
    "Recently, nested dropout was proposed as a method for ordering representation units in autoencoders by their information content, without diminishing reconstruction cost. However, it has only been applied to training fully-connected autoencoders in an unsupervised setting. We explore the impact of nested dropout on the convolutional layers in a CNN trained by backpropagation, investigating whether nested dropout can provide a simple and systematic way to determine the optimal representation size with respect to the desired accuracy and desired task and data complexity.",
    "Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.",
    "When a three-dimensional object moves relative to an observer, a change occurs on the observer's image plane and in the visual representation computed by a learned model. Starting with the idea that a good visual representation is one that transforms linearly under scene motions, we show, using the theory of group representations, that any such representation is equivalent to a combination of the elementary irreducible representations. We derive a striking relationship between irreducibility and the statistical dependency structure of the representation, by showing that under restricted conditions, irreducible representations are decorrelated. Under partial observability, as induced by the perspective projection of a scene onto the image plane, the motion group does not have a linear action on the space of images, so that it becomes necessary to perform inference over a latent representation that does transform linearly. This idea is demonstrated in a model of rotating NORB objects that employs a latent representation of the non-commutative 3D rotation group SO(3).",
    "Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm. Specifically, we propose to train a spherical k-means, after having reduced the MIPS problem to a Maximum Cosine Similarity Search (MCSS). Experiments on two standard recommendation system benchmarks as well as on large vocabulary word embeddings, show that this simple approach yields much higher speedups, for the same retrieval precision, than current state-of-the-art hashing-based and tree-based methods. This simple method also yields more robust retrievals when the query is corrupted by noise.",
    "The variational autoencoder (VAE; Kingma, Welling (2014)) is a recently proposed generative model pairing a top-down generative network with a bottom-up recognition network which approximates posterior inference. It typically makes strong assumptions about posterior inference, for instance that the posterior distribution is approximately factorial, and that its parameters can be approximated with nonlinear regression from the observations. As we show empirically, the VAE objective can lead to overly simplified representations which fail to use the network's entire modeling capacity. We present the importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log-likelihood lower bound derived from importance weighting. In the IWAE, the recognition network uses multiple samples to approximate the posterior, giving it increased flexibility to model complex posteriors which do not fit the VAE modeling assumptions. We show empirically that IWAEs learn richer latent space representations than VAEs, leading to improved test log-likelihood on density estimation benchmarks.",
    "This work investigates how using reduced precision data in Convolutional Neural Networks (CNNs) affects network accuracy during classification. More specifically, this study considers networks where each layer may use different precision data. Our key result is the observation that the tolerance of CNNs to reduced precision data not only varies across networks, a well established observation, but also within networks. Tuning precision per layer is appealing as it could enable energy and performance improvements. In this paper we study how error tolerance across layers varies and propose a method for finding a low precision configuration for a network while maintaining high accuracy. A diverse set of CNNs is analyzed showing that compared to a conventional implementation using a 32-bit floating-point representation for all layers, and with less than 1% loss in relative accuracy, the data footprint required by these networks can be reduced by an average of 74% and up to 92%.",
    "The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.",
    "Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.",
    "We propose local distributional smoothness (LDS), a new notion of smoothness for statistical model that can be used as a regularization term to promote the smoothness of the model distribution. We named the LDS based regularization as virtual adversarial training (VAT). The LDS of a model at an input datapoint is defined as the KL-divergence based robustness of the model distribution against local perturbation around the datapoint. VAT resembles adversarial training, but distinguishes itself in that it determines the adversarial direction from the model distribution alone without using the label information, making it applicable to semi-supervised learning. The computational cost for VAT is relatively low. For neural network, the approximated gradient of the LDS can be computed with no more than three pairs of forward and back propagations. When we applied our technique to supervised and semi-supervised learning for the MNIST dataset, it outperformed all the training methods other than the current state of the art method, which is based on a highly advanced generative model. We also applied our method to SVHN and NORB, and confirmed our method's superior performance over the current state of the art semi-supervised method applied to these datasets.",
    "The availability of large labeled datasets has allowed Convolutional Network models to achieve impressive recognition results. However, in many settings manual annotation of the data is impractical; instead our data has noisy labels, i.e. there is some freely available label for each image which may or may not be accurate. In this paper, we explore the performance of discriminatively-trained Convnets when trained on such noisy data. We introduce an extra noise layer into the network which adapts the network outputs to match the noisy label distribution. The parameters of this noise layer can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks. We demonstrate the approaches on several datasets, including large scale experiments on the ImageNet classification benchmark.",
    "We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.",
    "Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.",
    "In this work, we propose a new method to integrate two recent lines of work: unsupervised induction of shallow semantics (e.g., semantic roles) and factorization of relations in text and knowledge bases. Our model consists of two components: (1) an encoding component: a semantic role labeling model which predicts roles given a rich set of syntactic and lexical features; (2) a reconstruction component: a tensor factorization model which relies on roles to predict argument fillers. When the components are estimated jointly to minimize errors in argument reconstruction, the induced roles largely correspond to roles defined in annotated resources. Our method performs on par with most accurate role induction methods on English, even though, unlike these previous approaches, we do not incorporate any prior linguistic knowledge about the language.",
    "The notion of metric plays a key role in machine learning problems such as classification, clustering or ranking. However, it is worth noting that there is a severe lack of theoretical guarantees that can be expected on the generalization capacity of the classifier associated to a given metric. The theoretical framework of $(\\epsilon, \\gamma, \\tau)$-good similarity functions (Balcan et al., 2008) has been one of the first attempts to draw a link between the properties of a similarity function and those of a linear classifier making use of it. In this paper, we extend and complete this theory by providing a new generalization bound for the associated classifier based on the algorithmic robustness framework.",
    "We present the multiplicative recurrent neural network as a general model for compositional meaning in language, and evaluate it on the task of fine-grained sentiment analysis. We establish a connection to the previously investigated matrix-space models for compositionality, and show they are special cases of the multiplicative recurrent net. Our experiments show that these models perform comparably or better than Elman-type additive recurrent neural networks and outperform matrix-space models on a standard fine-grained sentiment analysis corpus. Furthermore, they yield comparable results to structural deep models on the recently published Stanford Sentiment Treebank without the need for generating parse trees.",
    "Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.",
    "We develop a new statistical model for photographic images, in which the local responses of a bank of linear filters are described as jointly Gaussian, with zero mean and a covariance that varies slowly over spatial position. We optimize sets of filters so as to minimize the nuclear norms of matrices of their local activations (i.e., the sum of the singular values), thus encouraging a flexible form of sparsity that is not tied to any particular dictionary or coordinate system. Filters optimized according to this objective are oriented and bandpass, and their responses exhibit substantial local correlation. We show that images can be reconstructed nearly perfectly from estimates of the local filter response covariances alone, and with minimal degradation (either visual or MSE) from low-rank approximations of these covariances. As such, this representation holds much promise for use in applications such as denoising, compression, and texture representation, and may form a useful substrate for hierarchical decompositions.",
    "Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the \"deconvolution approach\" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.",
    "Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.",
    "This paper introduces a greedy parser based on neural networks, which leverages a new compositional sub-tree representation. The greedy parser and the compositional procedure are jointly trained, and tightly depends on each-other. The composition procedure outputs a vector representation which summarizes syntactically (parsing tags) and semantically (words) sub-trees. Composition and tagging is achieved over continuous (word or tag) representations, and recurrent neural networks. We reach F1 performance on par with well-known existing parsers, while having the advantage of speed, thanks to the greedy nature of the parser. We provide a fully functional implementation of the method described in this paper.",
    "Suitable lateral connections between encoder and decoder are shown to allow higher layers of a denoising autoencoder (dAE) to focus on invariant representations. In regular autoencoders, detailed information needs to be carried through the highest layers but lateral connections from encoder to decoder relieve this pressure. It is shown that abstract invariant features can be translated to detailed reconstructions when invariant features are allowed to modulate the strength of the lateral connection. Three dAE structures with modulated and additive lateral connections, and without lateral connections were compared in experiments using real-world images. The experiments verify that adding modulated lateral connections to the model 1) improves the accuracy of the probability model for inputs, as measured by denoising performance; 2) results in representations whose degree of invariance grows faster towards the higher layers; and 3) supports the formation of diverse invariant poolings.",
    "We develop a new method for visualizing and refining the invariances of learned representations. Specifically, we test for a general form of invariance, linearization, in which the action of a transformation is confined to a low-dimensional subspace. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images lying on a path between them that is of minimal length in the space of the representation (a \"representational geodesic\"). If the transformation relating the two reference images is linearized by the representation, this sequence should follow the gradual evolution of this transformation. We use this method to assess the invariance properties of a state-of-the-art image classification network and find that geodesics generated for image pairs differing by translation, rotation, and dilation do not evolve according to their associated transformations. Our method also suggests a remedy for these failures, and following this prescription, we show that the modified representation is able to linearize a variety of geometric image transformations.",
    "Genomics are rapidly transforming medical practice and basic biomedical research, providing insights into disease mechanisms and improving therapeutic strategies, particularly in cancer. The ability to predict the future course of a patient's disease from high-dimensional genomic profiling will be essential in realizing the promise of genomic medicine, but presents significant challenges for state-of-the-art survival analysis methods. In this abstract we present an investigation in learning genomic representations with neural networks to predict patient survival in cancer. We demonstrate the advantages of this approach over existing survival analysis methods using brain tumor data.",
    "Existing approaches to combine both additive and multiplicative neural units either use a fixed assignment of operations or require discrete optimization to determine what function a neuron should perform. However, this leads to an extensive increase in the computational complexity of the training procedure.   We present a novel, parameterizable transfer function based on the mathematical concept of non-integer functional iteration that allows the operation each neuron performs to be smoothly and, most importantly, differentiablely adjusted between addition and multiplication. This allows the decision between addition and multiplication to be integrated into the standard backpropagation training procedure.",
    "One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.",
    "We extend Stochastic Gradient Variational Bayes to perform posterior inference for the weights of Stick-Breaking processes. This development allows us to define a Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric version of the variational autoencoder that has a latent representation with stochastic dimensionality. We experimentally demonstrate that the SB-VAE, and a semi-supervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE's.",
    "Unsupervised learning on imbalanced data is challenging because, when given imbalanced data, current model is often dominated by the major category and ignores the categories with small amount of data. We develop a latent variable model that can cope with imbalanced data by dividing the latent space into a shared space and a private space. Based on Gaussian Process Latent Variable Models, we propose a new kernel formulation that enables the separation of latent space and derives an efficient variational inference method. The performance of our model is demonstrated with an imbalanced medical image dataset.",
    "Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful.",
    "This paper shows how one can directly apply natural language processing (NLP) methods to classification problems in cheminformatics. Connection between these seemingly separate fields is shown by considering standard textual representation of compound, SMILES. The problem of activity prediction against a target protein is considered, which is a crucial part of computer aided drug design process. Conducted experiments show that this way one can not only outrank state of the art results of hand crafted representations but also gets direct structural insights into the way decisions are made.",
    "We introduce a neural network architecture and a learning algorithm to produce factorized symbolic representations. We propose to learn these concepts by observing consecutive frames, letting all the components of the hidden representation except a small discrete set (gating units) be predicted from the previous frame, and let the factors of variation in the next frame be represented entirely by these discrete gated units (corresponding to symbolic representations). We demonstrate the efficacy of our approach on datasets of faces undergoing 3D transformations and Atari 2600 games.",
    "We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data.",
    "We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. The data are linearly transformed, and each component is then normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and a constant. We optimize the parameters of the full transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. The optimized transformation substantially Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than alternative methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We demonstrate the use of the model as a prior probability density that can be used to remove additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized using the same Gaussianization objective, thus offering an unsupervised method of optimizing a deep network architecture.",
    "Approximate variational inference has shown to be a powerful tool for modeling unknown complex probability distributions. Recent advances in the field allow us to learn probabilistic models of sequences that actively exploit spatial and temporal structure. We apply a Stochastic Recurrent Network (STORN) to learn robot time series data. Our evaluation demonstrates that we can robustly detect anomalies both off- and on-line.",
    "We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.",
    "We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.",
    "Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.",
    "We propose a framework for training multiple neural networks simultaneously. The parameters from all models are regularised by the tensor trace norm, so that each neural network is encouraged to reuse others' parameters if possible -- this is the main motivation behind multi-task learning. In contrast to many deep multi-task learning models, we do not predefine a parameter sharing strategy by specifying which layers have tied parameters. Instead, our framework considers sharing for all shareable layers, and the sharing strategy is learned in a data-driven way.",
    "This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.",
    "We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.",
    "Many machine learning classifiers are vulnerable to adversarial perturbations. An adversarial perturbation modifies an input to change a classifier's prediction without causing the input to seem substantially different to human perception. We deploy three methods to detect adversarial images. Adversaries trying to bypass our detectors must make the adversarial image less pathological or they will fail trying. Our best detection method reveals that adversarial images place abnormal emphasis on the lower-ranked principal components from PCA. Other detectors and a colorful saliency map are in an appendix.",
    "We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.",
    "Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.   Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)).   Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.",
    "This paper builds off recent work from Kiperwasser & Goldberg (2016) using neural attention in a simple graph-based dependency parser. We use a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels. Our parser gets state of the art or near state of the art performance on standard treebanks for six different languages, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset. This makes it the highest-performing graph-based parser on this benchmark---outperforming Kiperwasser Goldberg (2016) by 1.8% and 2.2%---and comparable to the highest performing transition-based parser (Kuncoro et al., 2016), which achieves 95.8% UAS and 94.6% LAS. We also show which hyperparameter choices had a significant effect on parsing accuracy, allowing us to achieve large gains over other graph-based approaches.",
    "Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report state-of-the-art results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015).",
    "Spherical data is found in many applications. By modeling the discretized sphere as a graph, we can accommodate non-uniformly distributed, partial, and changing samplings. Moreover, graph convolutions are computationally more efficient than spherical convolutions. As equivariance is desired to exploit rotational symmetries, we discuss how to approach rotation equivariance using the graph neural network introduced in Defferrard et al. (2016). Experiments show good performance on rotation-invariant learning problems. Code and examples are available at https://github.com/SwissDataScienceCenter/DeepSphere",
    "High computational complexity hinders the widespread usage of Convolutional Neural Networks (CNNs), especially in mobile devices. Hardware accelerators are arguably the most promising approach for reducing both execution time and power consumption. One of the most important steps in accelerator development is hardware-oriented model approximation. In this paper we present Ristretto, a model approximation framework that analyzes a given CNN with respect to numerical resolution used in representing weights and outputs of convolutional and fully connected layers. Ristretto can condense models by using fixed point arithmetic and representation instead of floating point. Moreover, Ristretto fine-tunes the resulting fixed point network. Given a maximum error tolerance of 1%, Ristretto can successfully condense CaffeNet and SqueezeNet to 8-bit. The code for Ristretto is available.",
    "The diversity of painting styles represents a rich visual vocabulary for the construction of an image. The degree to which one may learn and parsimoniously capture this visual vocabulary measures our understanding of the higher level features of paintings, if not images in general. In this work we investigate the construction of a single, scalable deep network that can parsimoniously capture the artistic style of a diversity of paintings. We demonstrate that such a network generalizes across a diversity of artistic styles by reducing a painting to a point in an embedding space. Importantly, this model permits a user to explore new painting styles by arbitrarily combining the styles learned from individual paintings. We hope that this work provides a useful step towards building rich models of paintings and offers a window on to the structure of the learned representation of artistic style.",
    "Sum-Product Networks (SPNs) are a class of expressive yet tractable hierarchical graphical models. LearnSPN is a structure learning algorithm for SPNs that uses hierarchical co-clustering to simultaneously identifying similar entities and similar features. The original LearnSPN algorithm assumes that all the variables are discrete and there is no missing data. We introduce a practical, simplified version of LearnSPN, MiniSPN, that runs faster and can handle missing data and heterogeneous features common in real applications. We demonstrate the performance of MiniSPN on standard benchmark datasets and on two datasets from Google's Knowledge Graph exhibiting high missingness rates and a mix of discrete and continuous features.",
    "Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet).   The SqueezeNet architecture is available for download here: https://github.com/DeepScale/SqueezeNet",
    "In this paper, we study the problem of question answering when reasoning over multiple facts is required. We propose Query-Reduction Network (QRN), a variant of Recurrent Neural Network (RNN) that effectively handles both short-term (local) and long-term (global) sequential dependencies to reason over multiple facts. QRN considers the context sentences as a sequence of state-changing triggers, and reduces the original query to a more informed query as it observes each trigger (context sentence) through time. Our experiments show that QRN produces the state-of-the-art results in bAbI QA and dialog tasks, and in a real goal-oriented dialog dataset. In addition, QRN formulation allows parallelization on RNN's time axis, saving an order of magnitude in time complexity for training and inference.",
    "We propose a language-agnostic way of automatically generating sets of semantically similar clusters of entities along with sets of \"outlier\" elements, which may then be used to perform an intrinsic evaluation of word embeddings in the outlier detection task. We used our methodology to create a gold-standard dataset, which we call WikiSem500, and evaluated multiple state-of-the-art embeddings. The results show a correlation between performance on this dataset and performance on sentiment analysis.",
    "Recurrent neural nets are widely used for predicting temporal data. Their inherent deep feedforward structure allows learning complex sequential patterns. It is believed that top-down feedback might be an important missing ingredient which in theory could help disambiguate similar patterns depending on broader context. In this paper we introduce surprisal-driven recurrent networks, which take into account past error information when making new predictions. This is achieved by continuously monitoring the discrepancy between most recent predictions and the actual observations. Furthermore, we show that it outperforms other stochastic and fully deterministic approaches on enwik8 character level prediction task achieving 1.37 BPC on the test portion of the text.",
    "Although Generative Adversarial Networks achieve state-of-the-art results on a variety of generative tasks, they are regarded as highly unstable and prone to miss modes. We argue that these bad behaviors of GANs are due to the very particular functional shape of the trained discriminators in high dimensional spaces, which can easily make training stuck or push probability mass in the wrong direction, towards that of higher concentration than that of the data generating distribution. We introduce several ways of regularizing the objective, which can dramatically stabilize the training of GAN models. We also show that our regularizers can help the fair distribution of probability mass across the modes of the data generating distribution, during the early phases of training and thus providing a unified solution to the missing modes problem.",
    "Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks, especially when the policies are represented using rich function approximators like deep neural networks. Model-based methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real data with simulated data. However, discrepancies between the simulated source domain and the target domain pose a challenge for simulated training. We introduce the EPOpt algorithm, which uses an ensemble of simulated source domains and a form of adversarial training to learn policies that are robust and generalize to a broad range of possible target domains, including unmodeled effects. Further, the probability distribution over source domains in the ensemble can be adapted using data from target domain and approximate Bayesian methods, to progressively make it a better approximation. Thus, learning on a model ensemble, along with source domain adaptation, provides the benefit of both robustness and learning/adaptation.",
    "We introduce Divnet, a flexible technique for learning networks with diverse neurons. Divnet models neuronal diversity by placing a Determinantal Point Process (DPP) over neurons in a given layer. It uses this DPP to select a subset of diverse neurons and subsequently fuses the redundant neurons into the selected ones. Compared with previous approaches, Divnet offers a more principled, flexible technique for capturing neuronal diversity and thus implicitly enforcing regularization. This enables effective auto-tuning of network architecture and leads to smaller network sizes without hurting performance. Moreover, through its focus on diversity and neuron fusing, Divnet remains compatible with other procedures that seek to reduce memory footprints of networks. We present experimental results to corroborate our claims: for pruning neural networks, Divnet is seen to be notably superior to competing approaches.",
    "The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.",
    "One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. In this work, we propose a new regularizer called DeCov which leads to significantly reduced overfitting (as indicated by the difference between train and val performance), and better generalization. Our regularizer encourages diverse or non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations. This simple intuition has been explored in a number of past works but surprisingly has never been applied as a regularizer in supervised learning. Experiments across a range of datasets and network architectures show that this loss always reduces overfitting while almost always maintaining or increasing generalization performance and often improving performance over Dropout.",
    "Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.",
    "We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.",
    "We introduce the \"Energy-based Generative Adversarial Network\" model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.",
    "Recent research in the deep learning field has produced a plethora of new architectures. At the same time, a growing number of groups are applying deep learning to new applications. Some of these groups are likely to be composed of inexperienced deep learning practitioners who are baffled by the dizzying array of architecture choices and therefore opt to use an older architecture (i.e., Alexnet). Here we attempt to bridge this gap by mining the collective knowledge contained in recent deep learning research to discover underlying principles for designing neural network architectures. In addition, we describe several architectural innovations, including Fractal of FractalNet network, Stagewise Boosting Networks, and Taylor Series Networks (our Caffe code and prototxt files is available at https://github.com/iPhysicist/CNNDesignPatterns). We hope others are inspired to build on our preliminary work.",
    "Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention. In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.",
    "Though with progress, model learning and performing posterior inference still remains a common challenge for using deep generative models, especially for handling discrete hidden variables. This paper is mainly concerned with algorithms for learning Helmholz machines, which is characterized by pairing the generative model with an auxiliary inference model. A common drawback of previous learning algorithms is that they indirectly optimize some bounds of the targeted marginal log-likelihood. In contrast, we successfully develop a new class of algorithms, based on stochastic approximation (SA) theory of the Robbins-Monro type, to directly optimize the marginal log-likelihood and simultaneously minimize the inclusive KL-divergence. The resulting learning algorithm is thus called joint SA (JSA). Moreover, we construct an effective MCMC operator for JSA. Our results on the MNIST datasets demonstrate that the JSA's performance is consistently superior to that of competing algorithms like RWS, for learning a range of difficult models.",
    "Object detection with deep neural networks is often performed by passing a few thousand candidate bounding boxes through a deep neural network for each image. These bounding boxes are highly correlated since they originate from the same image. In this paper we investigate how to exploit feature occurrence at the image scale to prune the neural network which is subsequently applied to all bounding boxes. We show that removing units which have near-zero activation in the image allows us to significantly reduce the number of parameters in the network. Results on the PASCAL 2007 Object Detection Challenge demonstrate that up to 40% of units in some fully-connected layers can be entirely eliminated with little change in the detection result.",
    "Modeling interactions between features improves the performance of machine learning solutions in many domains (e.g. recommender systems or sentiment analysis). In this paper, we introduce Exponential Machines (ExM), a predictor that models all interactions of every order. The key idea is to represent an exponentially large tensor of parameters in a factorized format called Tensor Train (TT). The Tensor Train format regularizes the model and lets you control the number of underlying parameters. To train the model, we develop a stochastic Riemannian optimization procedure, which allows us to fit tensors with 2^160 entries. We show that the model achieves state-of-the-art performance on synthetic data with high-order interactions and that it works on par with high-order factorization machines on a recommender system dataset MovieLens 100K.",
    "We introduce Deep Variational Bayes Filters (DVBF), a new method for unsupervised learning and identification of latent Markovian state space models. Leveraging recent advances in Stochastic Gradient Variational Bayes, DVBF can overcome intractable inference distributions via variational inference. Thus, it can handle highly nonlinear input data with temporal and spatial dependencies such as image sequences without domain knowledge. Our experiments show that enabling backpropagation through transitions enforces state space assumptions and significantly improves information content of the latent embedding. This also enables realistic long-term prediction.",
    "Traditional dialog systems used in goal-oriented applications require a lot of domain-specific handcrafting, which hinders scaling up to new domains. End-to-end dialog systems, in which all components are trained from the dialogs themselves, escape this limitation. But the encouraging success recently obtained in chit-chat dialog may not carry over to goal-oriented settings. This paper proposes a testbed to break down the strengths and shortcomings of end-to-end dialog systems in goal-oriented applications. Set in the context of restaurant reservation, our tasks require manipulating sentences and symbols, so as to properly conduct conversations, issue API calls and use the outputs of such calls. We show that an end-to-end dialog system based on Memory Networks can reach promising, yet imperfect, performance and learn to perform non-trivial operations. We confirm those results by comparing our system to a hand-crafted slot-filling baseline on data from the second Dialog State Tracking Challenge (Henderson et al., 2014a). We show similar result patterns on data extracted from an online concierge service.",
    "Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting. Code is available at https://github.com/tensorflow/models/tree/master/research/adversarial_text.",
    "Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.",
    "This paper is focused on studying the view-manifold structure in the feature spaces implied by the different layers of Convolutional Neural Networks (CNN). There are several questions that this paper aims to answer: Does the learned CNN representation achieve viewpoint invariance? How does it achieve viewpoint invariance? Is it achieved by collapsing the view manifolds, or separating them while preserving them? At which layer is view invariance achieved? How can the structure of the view manifold at each layer of a deep convolutional neural network be quantified experimentally? How does fine-tuning of a pre-trained CNN on a multi-view dataset affect the representation at each layer of the network? In order to answer these questions we propose a methodology to quantify the deformation and degeneracy of view manifolds in CNN layers. We apply this methodology and report interesting results in this paper that answer the aforementioned questions.",
    "Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmentation, and visual question-answering, to get state-of-the-art performances taking advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally complex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning. We show that our model outperforms compact bilinear pooling in visual question-answering tasks with the state-of-the-art results on the VQA dataset, having a better parsimonious property.",
    "The standard interpretation of importance-weighted autoencoders is that they maximize a tighter lower bound on the marginal likelihood than the standard evidence lower bound. We give an alternate interpretation of this procedure: that it optimizes the standard variational lower bound, but using a more complex distribution. We formally derive this result, present a tighter lower bound, and visualize the implicit importance-weighted distribution.",
    "We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.",
    "In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples.Specifically, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal. We derive the analytic form of the induced solution, and analyze the properties. In order to make the proposed framework trainable in practice, we introduce two effective approximation techniques. Empirically, the experiment results closely match our theoretical analysis, verifying the discriminator is able to recover the energy of data distribution.",
    "In this work we perform outlier detection using ensembles of neural networks obtained by variational approximation of the posterior in a Bayesian neural network setting. The variational parameters are obtained by sampling from the true posterior by gradient descent. We show our outlier detection results are comparable to those obtained using other efficient ensembling methods.",
    "We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is \"matrix factorization by design\" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the near state-of the art perplexity while using significantly less RNN parameters.",
    "We present observations and discussion of previously unreported phenomena discovered while training residual networks. The goal of this work is to better understand the nature of neural networks through the examination of these new empirical results. These behaviors were identified through the application of Cyclical Learning Rates (CLR) and linear network interpolation. Among these behaviors are counterintuitive increases and decreases in training loss and instances of rapid training. For example, we demonstrate how CLR can produce greater testing accuracy than traditional training despite using large learning rates. Files to replicate these results are available at https://github.com/lnsmith54/exploring-loss",
    "Machine learning models are often used at test-time subject to constraints and trade-offs not present at training-time. For example, a computer vision model operating on an embedded device may need to perform real-time inference, or a translation model operating on a cell phone may wish to bound its average compute time in order to be power-efficient. In this work we describe a mixture-of-experts model and show how to change its test-time resource-usage on a per-input basis using reinforcement learning. We test our method on a small MNIST-based example.",
    "Adversarial examples have been shown to exist for a variety of deep learning architectures. Deep reinforcement learning has shown promising results on training agent policies directly on raw inputs such as image pixels. In this paper we present a novel study into adversarial attacks on deep reinforcement learning polices. We compare the effectiveness of the attacks using adversarial examples vs. random noise. We present a novel method for reducing the number of times adversarial examples need to be injected for a successful attack, based on the value function. We further explore how re-training on random noise and FGSM perturbations affects the resilience against adversarial examples.",
    "This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that VCL outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.",
    "Automatically determining the optimal size of a neural network for a given task without prior information currently requires an expensive global search and training many networks from scratch. In this paper, we address the problem of automatically finding a good network size during a single training cycle. We introduce *nonparametric neural networks*, a non-probabilistic framework for conducting optimization over all possible network sizes and prove its soundness when network growth is limited via an L_p penalty. We train networks under this framework by continuously adding new units while eliminating redundant units via an L_2 penalty. We employ a novel optimization algorithm, which we term *adaptive radial-angular gradient descent* or *AdaRad*, and obtain promising results.",
    "Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It's noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to the strongest published system.",
    "The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.",
    "We extend Stochastic Gradient Variational Bayes to perform posterior inference for the weights of Stick-Breaking processes. This development allows us to define a Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric version of the variational autoencoder that has a latent representation with stochastic dimensionality. We experimentally demonstrate that the SB-VAE, and a semi-supervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE's.",
    "We propose a framework for training multiple neural networks simultaneously. The parameters from all models are regularised by the tensor trace norm, so that each neural network is encouraged to reuse others' parameters if possible -- this is the main motivation behind multi-task learning. In contrast to many deep multi-task learning models, we do not predefine a parameter sharing strategy by specifying which layers have tied parameters. Instead, our framework considers sharing for all shareable layers, and the sharing strategy is learned in a data-driven way.",
    "This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.",
    "Many machine learning classifiers are vulnerable to adversarial perturbations. An adversarial perturbation modifies an input to change a classifier's prediction without causing the input to seem substantially different to human perception. We deploy three methods to detect adversarial images. Adversaries trying to bypass our detectors must make the adversarial image less pathological or they will fail trying. Our best detection method reveals that adversarial images place abnormal emphasis on the lower-ranked principal components from PCA. Other detectors and a colorful saliency map are in an appendix.",
    "We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.",
    "State-of-the-art deep reading comprehension models are dominated by recurrent neural nets. Their sequential nature is a natural fit for language, but it also precludes parallelization within an instances and often becomes the bottleneck for deploying such models to latency critical scenarios. This is particularly problematic for longer texts. Here we present a convolutional architecture as an alternative to these recurrent architectures. Using simple dilated convolutional units in place of recurrent ones, we achieve results comparable to the state of the art on two question answering tasks, while at the same time achieving up to two orders of magnitude speedups for question answering.",
    "This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.",
    "Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content (\"bit rate\") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.",
    "Methods that learn representations of nodes in a graph play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as point vectors in a low-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, we propose an unsupervised method that handles inductive learning scenarios and is applicable to different types of graphs: plain/attributed, directed/undirected. By leveraging both the network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering of the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks. Additionally, we demonstrate the benefits of modeling uncertainty - by analyzing it we can estimate neighborhood diversity and detect the intrinsic latent dimensionality of a graph.",
    "This paper explores the use of self-ensembling for visual domain adaptation problems. Our technique is derived from the mean teacher variant (Tarvainen et al., 2017) of temporal ensembling (Laine et al;, 2017), a technique that achieved state of the art results in the area of semi-supervised learning. We introduce a number of modifications to their approach for challenging domain adaptation scenarios and evaluate its effectiveness. Our approach achieves state of the art results in a variety of benchmarks, including our winning entry in the VISDA-2017 visual domain adaptation challenge. In small image benchmarks, our algorithm not only outperforms prior art, but can also achieve accuracy that is close to that of a classifier trained in a supervised fashion.",
    "Most machine learning classifiers, including deep neural networks, are vulnerable to adversarial examples. Such inputs are typically generated by adding small but purposeful modifications that lead to incorrect outputs while imperceptible to human eyes. The goal of this paper is not to introduce a single method, but to make theoretical steps towards fully understanding adversarial examples. By using concepts from topology, our theoretical analysis brings forth the key reasons why an adversarial example can fool a classifier ($f_1$) and adds its oracle ($f_2$, like human eyes) in such analysis. By investigating the topological relationship between two (pseudo)metric spaces corresponding to predictor $f_1$ and oracle $f_2$, we develop necessary and sufficient conditions that can determine if $f_1$ is always robust (strong-robust) against adversarial examples according to $f_2$. Interestingly our theorems indicate that just one unnecessary feature can make $f_1$ not strong-robust, and the right feature representation learning is the key to getting a classifier that is both accurate and strong-robust.",
    "We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.",
    "We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.",
    "Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful.",
    "We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.",
    "We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data.",
    "In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts.",
    "We compared the efficiency of the FlyHash model, an insect-inspired sparse neural network (Dasgupta et al., 2017), to similar but non-sparse models in an embodied navigation task. This requires a model to control steering by comparing current visual inputs to memories stored along a training route. We concluded the FlyHash model is more efficient than others, especially in terms of data encoding.",
    "In peer review, reviewers are usually asked to provide scores for the papers. The scores are then used by Area Chairs or Program Chairs in various ways in the decision-making process. The scores are usually elicited in a quantized form to accommodate the limited cognitive ability of humans to describe their opinions in numerical values. It has been found that the quantized scores suffer from a large number of ties, thereby leading to a significant loss of information. To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed. There are however two key challenges. First, there is no standard procedure for using this ranking information and Area Chairs may use it in different ways (including simply ignoring them), thereby leading to arbitrariness in the peer-review process. Second, there are no suitable interfaces for judicious use of this data nor methods to incorporate it in existing workflows, thereby leading to inefficiencies. We take a principled approach to integrate the ranking information into the scores. The output of our method is an updated score pertaining to each review that also incorporates the rankings. Our approach addresses the two aforementioned challenges by: (i) ensuring that rankings are incorporated into the updates scores in the same manner for all papers, thereby mitigating arbitrariness, and (ii) allowing to seamlessly use existing interfaces and workflows designed for scores. We empirically evaluate our method on synthetic datasets as well as on peer reviews from the ICLR 2017 conference, and find that it reduces the error by approximately 30% as compared to the best performing baseline on the ICLR 2017 data.",
    "Many recent studies have probed status bias in the peer-review process of academic journals and conferences. In this article, we investigated the association between author metadata and area chairs' final decisions (Accept/Reject) using our compiled database of 5,313 borderline submissions to the International Conference on Learning Representations (ICLR) from 2017 to 2022. We carefully defined elements in a cause-and-effect analysis, including the treatment and its timing, pre-treatment variables, potential outcomes and causal null hypothesis of interest, all in the context of study units being textual data and under Neyman and Rubin's potential outcomes (PO) framework. We found some weak evidence that author metadata was associated with articles' final decisions. We also found that, under an additional stability assumption, borderline articles from high-ranking institutions (top-30% or top-20%) were less favored by area chairs compared to their matched counterparts. The results were consistent in two different matched designs (odds ratio = 0.82 [95% CI: 0.67 to 1.00] in a first design and 0.83 [95% CI: 0.64 to 1.07] in a strengthened design). We discussed how to interpret these results in the context of multiple interactions between a study unit and different agents (reviewers and area chairs) in the peer-review system.",
    "We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method \"Deep Variational Information Bottleneck\", or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.",
    "Attention networks have proven to be an effective approach for embedding categorical inference within a deep neural network. However, for many tasks we may want to model richer structural dependencies without abandoning end-to-end training. In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees. We experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. We further find that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention.",
    "We are proposing to use an ensemble of diverse specialists, where speciality is defined according to the confusion matrix. Indeed, we observed that for adversarial instances originating from a given class, labeling tend to be done into a small subset of (incorrect) classes. Therefore, we argue that an ensemble of specialists should be better able to identify and reject fooling instances, with a high entropy (i.e., disagreement) over the decisions in the presence of adversaries. Experimental results obtained confirm that interpretation, opening a way to make the system more robust to adversarial examples through a rejection mechanism, rather than trying to classify them properly at any cost.",
    "In this paper, we present Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Different from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. Instead, it directly outputs phrases in a sequential order and can decode in linear time. Our experiments show that NPMT achieves superior performances on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks compared with strong NMT baselines. We also observe that our method produces meaningful phrases in output languages.",
    "We present LR-GAN: an adversarial image generation model which takes scene structure and context into account. Unlike previous generative adversarial networks (GANs), the proposed GAN learns to generate image background and foregrounds separately and recursively, and stitch the foregrounds on the background in a contextually relevant manner to produce a complete natural image. For each foreground, the model learns to generate its appearance, shape and pose. The whole model is unsupervised, and is trained in an end-to-end manner with gradient descent methods. The experiments demonstrate that LR-GAN can generate more natural images with objects that are more human recognizable than DCGAN.",
    "We describe a simple scheme that allows an agent to learn about its environment in an unsupervised manner. Our scheme pits two versions of the same agent, Alice and Bob, against one another. Alice proposes a task for Bob to complete; and then Bob attempts to complete the task. In this work we will focus on two kinds of environments: (nearly) reversible environments and environments that can be reset. Alice will \"propose\" the task by doing a sequence of actions and then Bob must undo or repeat them, respectively. Via an appropriate reward structure, Alice and Bob automatically generate a curriculum of exploration, enabling unsupervised training of the agent. When Bob is deployed on an RL task within the environment, this unsupervised training reduces the number of supervised episodes needed to learn, and in some cases converges to a higher reward.",
    "Maximum entropy modeling is a flexible and popular framework for formulating statistical models given partial knowledge. In this paper, rather than the traditional method of optimizing over the continuous density directly, we learn a smooth and invertible transformation that maps a simple distribution to the desired maximum entropy distribution. Doing so is nontrivial in that the objective being maximized (entropy) is a function of the density itself. By exploiting recent developments in normalizing flow networks, we cast the maximum entropy problem into a finite-dimensional constrained optimization, and solve the problem by combining stochastic optimization with the augmented Lagrangian method. Simulation results demonstrate the effectiveness of our method, and applications to finance and computer vision show the flexibility and accuracy of using maximum entropy flow networks.",
    "With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal. However, most current research focuses instead on important but narrow applications, such as image classification or machine translation. We believe this to be largely due to the lack of objective ways to measure progress towards broad machine intelligence. In order to fill this gap, we propose here a set of concrete desiderata for general AI, together with a platform to test machines on how well they satisfy such desiderata, while keeping all further complexities to a minimum.",
    "Neural networks that compute over graph structures are a natural fit for problems in a variety of domains, including natural language (parse trees) and cheminformatics (molecular graphs). However, since the computation graph has a different shape and size for every input, such networks do not directly support batched training or inference. They are also difficult to implement in popular deep learning libraries, which are based on static data-flow graphs. We introduce a technique called dynamic batching, which not only batches together operations between different input graphs of dissimilar shape, but also between different nodes within a single input graph. The technique allows us to create static graphs, using popular libraries, that emulate dynamic computation graphs of arbitrary shape and size. We further present a high-level library of compositional blocks that simplifies the creation of dynamic graph models. Using the library, we demonstrate concise and batch-wise parallel implementations for a variety of models from the literature.",
    "Although deep learning models have proven effective at solving problems in natural language processing, the mechanism by which they come to their conclusions is often unclear. As a result, these models are generally treated as black boxes, yielding no insight of the underlying learned patterns. In this paper we consider Long Short Term Memory networks (LSTMs) and demonstrate a new approach for tracking the importance of a given input to the LSTM for a given output. By identifying consistently important patterns of words, we are able to distill state of the art LSTMs on sentiment analysis and question answering into a set of representative phrases. This representation is then quantitatively validated by using the extracted phrases to construct a simple, rule-based classifier which approximates the output of the LSTM.",
    "Deep reinforcement learning has achieved many impressive results in recent years. However, tasks with sparse rewards or long horizons continue to pose significant challenges. To tackle these important problems, we propose a general framework that first learns useful skills in a pre-training environment, and then leverages the acquired skills for learning faster in downstream tasks. Our approach brings together some of the strengths of intrinsic motivation and hierarchical methods: the learning of useful skill is guided by a single proxy reward, the design of which requires very minimal domain knowledge about the downstream tasks. Then a high-level policy is trained on top of these skills, providing a significant improvement of the exploration and allowing to tackle sparse rewards in the downstream tasks. To efficiently pre-train a large span of skills, we use Stochastic Neural Networks combined with an information-theoretic regularizer. Our experiments show that this combination is effective in learning a wide span of interpretable skills in a sample-efficient way, and can significantly boost the learning performance uniformly across a wide range of downstream tasks.",
    "Deep generative models have achieved impressive success in recent years. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as emerging families for generative model learning, have largely been considered as two distinct paradigms and received extensive independent studies respectively. This paper aims to establish formal connections between GANs and VAEs through a new formulation of them. We interpret sample generation in GANs as performing posterior inference, and show that GANs and VAEs involve minimizing KL divergences of respective posterior and inference distributions with opposite directions, extending the two learning phases of classic wake-sleep algorithm, respectively. The unified view provides a powerful tool to analyze a diverse set of existing model variants, and enables to transfer techniques across research lines in a principled way. For example, we apply the importance weighting method in VAE literatures for improved GAN learning, and enhance VAEs with an adversarial mechanism that leverages generated samples. Experiments show generality and effectiveness of the transferred techniques.",
    "We consider the problem of detecting out-of-distribution images in neural networks. We propose ODIN, a simple and effective method that does not require any change to a pre-trained neural network. Our method is based on the observation that using temperature scaling and adding small perturbations to the input can separate the softmax score distributions between in- and out-of-distribution images, allowing for more effective detection. We show in a series of experiments that ODIN is compatible with diverse network architectures and datasets. It consistently outperforms the baseline approach by a large margin, establishing a new state-of-the-art performance on this task. For example, ODIN reduces the false positive rate from the baseline 34.7% to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is 95%.",
    "A framework is presented for unsupervised learning of representations based on infomax principle for large-scale neural populations. We use an asymptotic approximation to the Shannon's mutual information for a large neural population to demonstrate that a good initial approximation to the global information-theoretic optimum can be obtained by a hierarchical infomax method. Starting from the initial solution, an efficient algorithm based on gradient descent of the final objective function is proposed to learn representations from the input datasets, and the method works for complete, overcomplete, and undercomplete bases. As confirmed by numerical experiments, our method is robust and highly efficient for extracting salient features from input datasets. Compared with the main existing methods, our algorithm has a distinct advantage in both the training speed and the robustness of unsupervised representation learning. Furthermore, the proposed method is easily extended to the supervised or unsupervised model for training deep structure networks.",
    "Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models. Source code is publicly available at https://imatge-upc.github.io/skiprnn-2017-telecombcn/ .",
    "Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR",
    "Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from the large variance issue on policy gradient estimation, which leads to poor sample efficiency during training. In this work, we propose a control variate method to effectively reduce variance for policy gradient methods. Motivated by the Stein's identity, our method extends the previous control variate methods used in REINFORCE and advantage actor-critic by introducing more general action-dependent baseline functions. Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy gradient approaches.",
    "Skip connections made the training of very deep networks possible and have become an indispensable component in a variety of neural architectures. A completely satisfactory explanation for their success remains elusive. Here, we present a novel explanation for the benefits of skip connections in training very deep networks. The difficulty of training deep networks is partly due to the singularities caused by the non-identifiability of the model. Several such singularities have been identified in previous works: (i) overlap singularities caused by the permutation symmetry of nodes in a given layer, (ii) elimination singularities corresponding to the elimination, i.e. consistent deactivation, of nodes, (iii) singularities generated by the linear dependence of the nodes. These singularities cause degenerate manifolds in the loss landscape that slow down learning. We argue that skip connections eliminate these singularities by breaking the permutation symmetry of nodes, by reducing the possibility of node elimination and by making the nodes less linearly dependent. Moreover, for typical initializations, skip connections move the network away from the \"ghosts\" of these singularities and sculpt the landscape around them to alleviate the learning slow-down. These hypotheses are supported by evidence from simplified models, as well as from experiments with deep networks trained on real-world datasets.",
    "We have tried to reproduce the results of the paper \"Natural Language Inference over Interaction Space\" submitted to ICLR 2018 conference as part of the ICLR 2018 Reproducibility Challenge. Initially, we were not aware that the code was available, so we started to implement the network from scratch. We have evaluated our version of the model on Stanford NLI dataset and reached 86.38% accuracy on the test set, while the paper claims 88.0% accuracy. The main difference, as we understand it, comes from the optimizers and the way model selection is performed.",
    "We have successfully implemented the \"Learn to Pay Attention\" model of attention mechanism in convolutional neural networks, and have replicated the results of the original paper in the categories of image classification and fine-grained recognition.",
    "Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose a method to learn such representations by encoding the suffixes of word sequences in a sentence and training on the Stanford Natural Language Inference (SNLI) dataset. We demonstrate the effectiveness of our approach by evaluating it on the SentEval benchmark, improving on existing approaches on several transfer tasks.",
    "In many neural models, new features as polynomial functions of existing ones are used to augment representations. Using the natural language inference task as an example, we investigate the use of scaled polynomials of degree 2 and above as matching features. We find that scaling degree 2 features has the highest impact on performance, reducing classification error by 5% in the best models.",
    "We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.",
    "In this work, we investigate Batch Normalization technique and propose its probabilistic interpretation. We propose a probabilistic model and show that Batch Normalization maximazes the lower bound of its marginalized log-likelihood. Then, according to the new probabilistic model, we design an algorithm which acts consistently during train and test. However, inference becomes computationally inefficient. To reduce memory and computational cost, we propose Stochastic Batch Normalization -- an efficient approximation of proper inference procedure. This method provides us with a scalable uncertainty estimation technique. We demonstrate the performance of Stochastic Batch Normalization on popular architectures (including deep convolutional architectures: VGG-like and ResNets) for MNIST and CIFAR-10 datasets.",
    "It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural image representations.",
    "Deep latent variable models are powerful tools for representation learning. In this paper, we adopt the deep information bottleneck model, identify its shortcomings and propose a model that circumvents them. To this end, we apply a copula transformation which, by restoring the invariance properties of the information bottleneck method, leads to disentanglement of the features in the latent space. Building on that, we show how this transformation translates to sparsity of the latent space in the new model. We evaluate our method on artificial and real data.",
    "We introduce a variant of the MAC model (Hudson and Manning, ICLR 2018) with a simplified set of equations that achieves comparable accuracy, while training faster. We evaluate both models on CLEVR and CoGenT, and show that, transfer learning with fine-tuning results in a 15 point increase in accuracy, matching the state of the art. Finally, in contrast, we demonstrate that improper fine-tuning can actually reduce a model's accuracy as well.",
    "Adaptive Computation Time for Recurrent Neural Networks (ACT) is one of the most promising architectures for variable computation. ACT adapts to the input sequence by being able to look at each sample more than once, and learn how many times it should do it. In this paper, we compare ACT to Repeat-RNN, a novel architecture based on repeating each sample a fixed number of times. We found surprising results, where Repeat-RNN performs as good as ACT in the selected tasks. Source code in TensorFlow and PyTorch is publicly available at https://imatge-upc.github.io/danifojo-2018-repeatrnn/",
    "Generative adversarial networks (GANs) are able to model the complex highdimensional distributions of real-world data, which suggests they could be effective for anomaly detection. However, few works have explored the use of GANs for the anomaly detection task. We leverage recently developed GAN models for anomaly detection, and achieve state-of-the-art performance on image and network intrusion datasets, while being several hundred-fold faster at test time than the only published GAN-based method.",
    "Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It's noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to the strongest published system.",
    "The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.",
    "Deep neural networks (DNNs) have achieved impressive predictive performance due to their ability to learn complex, non-linear relationships between variables. However, the inability to effectively visualize these relationships has led to DNNs being characterized as black boxes and consequently limited their applications. To ameliorate this problem, we introduce the use of hierarchical interpretations to explain DNN predictions through our proposed method, agglomerative contextual decomposition (ACD). Given a prediction from a trained DNN, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive. Using examples from Stanford Sentiment Treebank and ImageNet, we show that ACD is effective at diagnosing incorrect predictions and identifying dataset bias. Through human experiments, we demonstrate that ACD enables users both to identify the more accurate of two DNNs and to better trust a DNN's outputs. We also find that ACD's hierarchy is largely robust to adversarial perturbations, implying that it captures fundamental aspects of the input and ignores spurious noise.",
    "In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies \"image\" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.",
    "We consider the task of word-level language modeling and study the possibility of combining hidden-states-based short-term representations with medium-term representations encoded in dynamical weights of a language model. Our work extends recent experiments on language models with dynamically evolving weights by casting the language modeling problem into an online learning-to-learn framework in which a meta-learner is trained by gradient-descent to continuously update a language model weights.",
    "GANS are powerful generative models that are able to model the manifold of natural images. We leverage this property to perform manifold regularization by approximating the Laplacian norm using a Monte Carlo approximation that is easily computed with the GAN. When incorporated into the feature-matching GAN of Improved GAN, we achieve state-of-the-art results for GAN-based semi-supervised learning on the CIFAR-10 dataset, with a method that is significantly easier to implement than competing methods.",
    "We identify a class of over-parameterized deep neural networks with standard activation functions and cross-entropy loss which provably have no bad local valley, in the sense that from any point in parameter space there exists a continuous path on which the cross-entropy loss is non-increasing and gets arbitrarily close to zero. This implies that these networks have no sub-optimal strict local minima.",
    "Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.",
    "One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally light and easy to incorporate into existing implementations. We tested the efficacy of spectral normalization on CIFAR10, STL-10, and ILSVRC2012 dataset, and we experimentally confirmed that spectrally normalized GANs (SN-GANs) is capable of generating images of better or equal quality relative to the previous training stabilization techniques.",
    "Embedding graph nodes into a vector space can allow the use of machine learning to e.g. predict node classes, but the study of node embedding algorithms is immature compared to the natural language processing field because of a diverse nature of graphs. We examine the performance of node embedding algorithms with respect to graph centrality measures that characterize diverse graphs, through systematic experiments with four node embedding algorithms, four or five graph centralities, and six datasets. Experimental results give insights into the properties of node embedding algorithms, which can be a basis for further research on this topic.",
    "We introduce a new dataset of logical entailments for the purpose of measuring models' ability to capture and exploit the structure of logical expressions against an entailment prediction task. We use this task to compare a series of architectures which are ubiquitous in the sequence-processing literature, in addition to a new model class---PossibleWorldNets---which computes entailment as a \"convolution over possible worlds\". Results show that convolutional networks present the wrong inductive bias for this class of problems relative to LSTM RNNs, tree-structured neural networks outperform LSTM RNNs due to their enhanced ability to exploit the syntax of logic, and PossibleWorldNets outperform all benchmarks.",
    "Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.   We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the \"lottery ticket hypothesis:\" dense, randomly-initialized, feed-forward networks contain subnetworks (\"winning tickets\") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.   We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.",
    "We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2\\% to 5.3\\%.",
    "Understanding theoretical properties of deep and locally connected nonlinear network, such as deep convolutional neural network (DCNN), is still a hard problem despite its empirical success. In this paper, we propose a novel theoretical framework for such networks with ReLU nonlinearity. The framework explicitly formulates data distribution, favors disentangled representations and is compatible with common regularization techniques such as Batch Norm. The framework is built upon teacher-student setting, by expanding the student forward/backward propagation onto the teacher's computational graph. The resulting model does not impose unrealistic assumptions (e.g., Gaussian inputs, independence of activation, etc). Our framework could help facilitate theoretical analysis of many practical issues, e.g. overfitting, generalization, disentangled representations in deep networks.",
    "We present a Neural Program Search, an algorithm to generate programs from natural language description and a small number of input/output examples. The algorithm combines methods from Deep Learning and Program Synthesis fields by designing rich domain-specific language (DSL) and defining efficient search algorithm guided by a Seq2Tree model on it. To evaluate the quality of the approach we also present a semi-synthetic dataset of descriptions with test examples and corresponding programs. We show that our algorithm significantly outperforms a sequence-to-sequence model with attention baseline.",
    "Most state-of-the-art neural machine translation systems, despite being different in architectural skeletons (e.g. recurrence, convolutional), share an indispensable feature: the Attention. However, most existing attention methods are token-based and ignore the importance of phrasal alignments, the key ingredient for the success of phrase-based statistical machine translation. In this paper, we propose novel phrase-based attention methods to model n-grams of tokens as attention entities. We incorporate our phrase-based attentions into the recently proposed Transformer network, and demonstrate that our approach yields improvements of 1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English translation tasks on WMT newstest2014 using WMT'16 training data.",
    "We introduce the problem of learning distributed representations of edits. By combining a \"neural editor\" with an \"edit encoder\", our models learn to represent the salient information of an edit and can be used to apply edits to new inputs. We experiment on natural language and source code edit data. Our evaluation yields promising results that suggest that our neural network models learn to capture the structure and semantics of edits. We hope that this interesting task and data source will inspire other researchers to work further on this problem.",
    "We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.",
    "This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that VCL outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.",
    "This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.",
    "In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts.",
    "We propose a single neural probabilistic model based on variational autoencoder that can be conditioned on an arbitrary subset of observed features and then sample the remaining features in \"one shot\". The features may be both real-valued and categorical. Training of the model is performed by stochastic variational Bayes. The experimental evaluation on synthetic data, as well as feature imputation and image inpainting problems, shows the effectiveness of the proposed approach and diversity of the generated samples.",
    "Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content (\"bit rate\") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.",
    "Understanding and characterizing the subspaces of adversarial examples aid in studying the robustness of deep neural networks (DNNs) to adversarial perturbations. Very recently, Ma et al. (ICLR 2018) proposed to use local intrinsic dimensionality (LID) in layer-wise hidden representations of DNNs to study adversarial subspaces. It was demonstrated that LID can be used to characterize the adversarial subspaces associated with different attack methods, e.g., the Carlini and Wagner's (C&W) attack and the fast gradient sign attack.   In this paper, we use MNIST and CIFAR-10 to conduct two new sets of experiments that are absent in existing LID analysis and report the limitation of LID in characterizing the corresponding adversarial subspaces, which are (i) oblivious attacks and LID analysis using adversarial examples with different confidence levels; and (ii) black-box transfer attacks. For (i), we find that the performance of LID is very sensitive to the confidence parameter deployed by an attack, and the LID learned from ensembles of adversarial examples with varying confidence levels surprisingly gives poor performance. For (ii), we find that when adversarial examples are crafted from another DNN model, LID is ineffective in characterizing their adversarial subspaces. These two findings together suggest the limited capability of LID in characterizing the subspaces of adversarial examples.",
    "Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam.",
    "Neural message passing algorithms for semi-supervised classification on graphs have recently achieved great success. However, for classifying a node these methods only consider nodes that are a few propagation steps away and the size of this utilized neighborhood is hard to extend. In this paper, we use the relationship between graph convolutional networks (GCN) and PageRank to derive an improved propagation scheme based on personalized PageRank. We utilize this propagation procedure to construct a simple model, personalized propagation of neural predictions (PPNP), and its fast approximation, APPNP. Our model's training time is on par or faster and its number of parameters on par or lower than previous models. It leverages a large, adjustable neighborhood for classification and can be easily combined with any neural network. We show that this model outperforms several recently proposed methods for semi-supervised classification in the most thorough study done so far for GCN-like models. Our implementation is available online.",
    "We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimization-based attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining non-certified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.",
    "Methods that learn representations of nodes in a graph play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as point vectors in a low-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, we propose an unsupervised method that handles inductive learning scenarios and is applicable to different types of graphs: plain/attributed, directed/undirected. By leveraging both the network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering of the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks. Additionally, we demonstrate the benefits of modeling uncertainty - by analyzing it we can estimate neighborhood diversity and detect the intrinsic latent dimensionality of a graph.",
    "Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. However, a number of problems of recent interest have created a demand for models that can analyze spherical images. Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective.   In this paper we introduce the building blocks for constructing spherical CNNs. We propose a definition for the spherical cross-correlation that is both expressive and rotation-equivariant. The spherical correlation satisfies a generalized Fourier theorem, which allows us to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs applied to 3D model recognition and atomization energy regression.",
    "This paper shows how one can directly apply natural language processing (NLP) methods to classification problems in cheminformatics. Connection between these seemingly separate fields is shown by considering standard textual representation of compound, SMILES. The problem of activity prediction against a target protein is considered, which is a crucial part of computer aided drug design process. Conducted experiments show that this way one can not only outrank state of the art results of hand crafted representations but also gets direct structural insights into the way decisions are made.",
    "The inclusion of Computer Vision and Deep Learning technologies in Agriculture aims to increase the harvest quality, and productivity of farmers. During postharvest, the export market and quality evaluation are affected by assorting of fruits and vegetables. In particular, apples are susceptible to a wide range of defects that can occur during harvesting or/and during the post-harvesting period. This paper aims to help farmers with post-harvest handling by exploring if recent computer vision and deep learning methods such as the YOLOv3 (Redmon & Farhadi (2018)) can help in detecting healthy apples from apples with defects.",
    "We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is \"matrix factorization by design\" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the near state-of the art perplexity while using significantly less RNN parameters.",
    "State-of-the-art deep reading comprehension models are dominated by recurrent neural nets. Their sequential nature is a natural fit for language, but it also precludes parallelization within an instances and often becomes the bottleneck for deploying such models to latency critical scenarios. This is particularly problematic for longer texts. Here we present a convolutional architecture as an alternative to these recurrent architectures. Using simple dilated convolutional units in place of recurrent ones, we achieve results comparable to the state of the art on two question answering tasks, while at the same time achieving up to two orders of magnitude speedups for question answering.",
    "In this work, we analyze the reinstatement mechanism introduced by Ritter et al. (2018) to reveal two classes of neurons that emerge in the agent's working memory (an epLSTM cell) when trained using episodic meta-RL on an episodic variant of the Harlow visual fixation task. Specifically, Abstract neurons encode knowledge shared across tasks, while Episodic neurons carry information relevant for a specific episode's task.",
    "The rate-distortion-perception function (RDPF; Blau and Michaeli, 2019) has emerged as a useful tool for thinking about realism and distortion of reconstructions in lossy compression. Unlike the rate-distortion function, however, it is unknown whether encoders and decoders exist that achieve the rate suggested by the RDPF. Building on results by Li and El Gamal (2018), we show that the RDPF can indeed be achieved using stochastic, variable-length codes. For this class of codes, we also prove that the RDPF lower-bounds the achievable rate",
    "In this paper, we present Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Different from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. Instead, it directly outputs phrases in a sequential order and can decode in linear time. Our experiments show that NPMT achieves superior performances on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks compared with strong NMT baselines. We also observe that our method produces meaningful phrases in output languages.",
    "It is by now well-known that small adversarial perturbations can induce classification errors in deep neural networks (DNNs). In this paper, we make the case that sparse representations of the input data are a crucial tool for combating such attacks. For linear classifiers, we show that a sparsifying front end is provably effective against $\\ell_{\\infty}$-bounded attacks, reducing output distortion due to the attack by a factor of roughly $K / N$ where $N$ is the data dimension and $K$ is the sparsity level. We then extend this concept to DNNs, showing that a \"locally linear\" model can be used to develop a theoretical foundation for crafting attacks and defenses. Experimental results for the MNIST dataset show the efficacy of the proposed sparsifying front end.",
    "We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU formulates and solves a constrained optimization problem in the non-parameterized proximal policy space. Using supervised regression, it then converts the optimal non-parameterized policy to a parameterized policy, from which it draws new samples. The methodology is general in that it applies to both discrete and continuous action spaces, and can handle a wide variety of proximity constraints for the non-parameterized optimization problem. We show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology. The SPU implementation is much simpler than TRPO. In terms of sample efficiency, our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.",
    "We present a parameterized synthetic dataset called Moving Symbols to support the objective study of video prediction networks. Using several instantiations of the dataset in which variation is explicitly controlled, we highlight issues in an existing state-of-the-art approach and propose the use of a performance metric with greater semantic meaning to improve experimental interpretability. Our dataset provides canonical test cases that will help the community better understand, and eventually improve, the representations learned by such networks in the future. Code is available at https://github.com/rszeto/moving-symbols ."
  ],
  "sampled": [
    "In this report, we present a Theano-based implementation of AlexNet (Krizhevsky et al., 2012) and demonstrate its scalability through naive data parallelism across multiple GPUs. Notably, our implementation achieves performance comparable to the state-of-the-art Caffe library (Jia et al., 2014) on a single GPU, but utilizing two GPUs. To the best of our knowledge, this is the first open-source, Python-based AlexNet implementation available to date.",
    "We demonstrate that deep narrow Boltzmann machines possess the fundamental property of universality in approximating probability distributions on the activities of their visible units, given a sufficient number of hidden layers, each comprising the same number of units as the visible layer. Furthermore, we establish that, within specific parameter domains, deep Boltzmann machines can be rigorously analyzed as feedforward networks. We derive precise upper and lower bounds on the requisite depth and width of universal approximators, thereby resolving long-standing intuitions regarding undirected networks. Notably, our findings unequivocally show that deep narrow Boltzmann machines are at least as compact and universal in their approximating capabilities as narrow sigmoid belief networks and restricted Boltzmann machines, as supported by the currently available bounds for these models.",
    "We propose a new approach to enhance recurrent neural networks by incorporating latent variables, leading to the development of Stochastic Recurrent Networks (STORNs). This model offers several advantages: it can be trained using stochastic gradient methods, it allows for structured and multi-modal conditionals at each time step, it provides a reliable estimator of the marginal likelihood, and it generalizes deterministic recurrent neural networks. We test the effectiveness of this method on four polyphonic musical data sets and motion capture data.",
    "Imagine being able to fine-tune your optimization hyperparameters on the fly, without interrupting the learning process. We've developed a groundbreaking framework that makes this possible, by 'hot swapping' hyperparameter values in real-time. We put this approach to the test in the context of adaptive learning rate selection, using a clever explore-exploit strategy inspired by the multi-armed bandit problem. The results are astounding: our 'hot swapping' method consistently outperforms established alternatives like AdaDelta and stochastic gradient descent, even when paired with exhaustive hyperparameter searches. The proof is in the pudding - our experiments on a benchmark neural network demonstrate the power of this innovative approach.",
    "In many modern multiclass and multilabel problems, the output spaces are becoming increasingly large. To tackle these problems, label embeddings have proven to be a valuable tool, enhancing both computational and statistical efficiency. This work leverages a connection between rank-constrained estimation and low-dimensional label embeddings to develop a fast label embedding algorithm that works effectively in both multiclass and multilabel settings. The outcome is a randomized algorithm for partial least squares that runs exponentially faster than traditional methods. We apply our techniques to two large-scale public datasets from the Large Scale Hierarchical Text Challenge and the Open Directory Project, achieving state-of-the-art results.",
    "To enable machines to perform complex and abstract reasoning tasks, it's essential to accurately learn both explicit and implicit relationships within data. Our Dynamic Adaptive Network Intelligence (DANI) model achieves this through efficient weakly supervised learning. We've achieved state-of-the-art results with DANI on challenging question answering tasks in the bAbI dataset, outperforming current approaches.",
    "Traditional speech recognition systems have long relied on spectral features like MFCC and PLP, which are carefully crafted based on our understanding of how we perceive and produce speech. However, a revolutionary approach has emerged, where convolutional neural networks can directly tap into the raw audio signal to estimate phoneme probabilities, unshackled from prior knowledge. This data-driven method has been shown to rival or even outperform traditional HMM/ANN systems in phoneme recognition and large-scale speech recognition tasks, all while requiring fewer parameters. Inspired by these breakthroughs, we explore the use of a simple linear classifier within the CNN framework, allowing the network to unearth linearly separable features from the raw audio itself. Our results reveal that this system can match or even surpass the performance of MLP-based systems that rely on cepstral features, marking a significant leap forward in speech recognition technology.",
    "We present a comprehensive overview of the neural network training framework employed in the Kaldi speech recognition toolkit, specifically designed to facilitate the training of deep neural networks (DNNs) on large-scale datasets utilizing multiple machines equipped with graphics processing units (GPUs) or multi-core processors. In order to ensure maximum hardware flexibility and minimize dependence on specific hardware configurations, we developed a novel approach to harness the processing power of multiple machines without incurring excessive network communication overhead. Our methodology involves periodically averaging the neural network parameters (typically at intervals of one to two minutes) and subsequently redistributing the averaged parameters to the individual machines for continued training, with each machine processing distinct subsets of the data. While this approach, in isolation, yields suboptimal results, its efficacy is significantly enhanced when combined with our innovative, approximate, and computationally efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD). This synergy not only enables our periodic-averaging method to achieve superior performance but also substantially accelerates the convergence of Stochastic Gradient Descent (SGD) on a single machine.",
    "**Introducing a Novel Method for Visualizing and Refining Invariances in Learned Representations**\n\nWe propose a new approach to test and improve the invariance properties of learned representations. Our method focuses on **linearization**, where a transformation's effect is limited to a low-dimensional subspace. To achieve this, we:\n\n1. **Generate representational geodesics**: Given two reference images with a transformation difference, we create a sequence of images between them that minimizes distance in the representation space.\n2. **Assess invariance properties**: If the transformation is linearized, the geodesic sequence should gradually evolve according to the transformation.\n\n**Key Findings and Remedies**\n\nApplying our method to a state-of-the-art image classification network, we discover that geodesics for image pairs with translation, rotation, and dilation transformations do not follow their associated transformations. However, our method also **suggests a solution** to these failures, and by implementing this remedy, we demonstrate that the modified representation can successfully linearize various geometric image transformations.",
    "Unlocking the Power of Deep Learning: A Group Theory Perspective\n\nWe tackle the fundamental questions of why Deep Learning works, what representations it captures, and how higher-order representations emerge. By applying group theory, we pioneer a novel approach to understanding Deep Learning.\n\nA crucial factor behind Deep Learning's resurgence is pretraining, a key algorithmic step that involves searching for a good generative model for input samples, layer by layer. We reveal the profound implications of this principle by connecting it to the interplay of orbits and stabilizers of group actions. Although neural networks themselves don't form groups, we discover the existence of \"shadow\" groups that serve as close approximations.\n\nOver these shadow groups, pretraining becomes equivalent to searching for features with minimal orbits, which are intuitively the simplest. This explains why Deep Learning networks learn simple features first. Furthermore, we demonstrate how repeating this principle in deeper layers enables the capture of higher-order representations, and why representation complexity increases with layer depth.",
    "We're introducing a new architecture called the \"stacked what-where auto-encoders\" (SWWAE), which combines the strengths of discriminative and generative models to tackle supervised, semi-supervised, and unsupervised learning in a single, unified approach - all without relying on sampling during training. In practice, SWWAE uses a convolutional neural network (Convnet) to encode the input data, and a deconvolutional neural network (Deconvnet) to reconstruct it. The key to this approach lies in the objective function, which encourages the hidden states in the Deconvnet to mirror those in the Convnet. Each pooling layer produces two sets of variables: the \"what\" variables, which are passed on to the next layer, and their complementary \"where\" variables, which are fed into the corresponding layer in the generative decoder.",
    "We explore the challenge of creating word embeddings that are specifically designed to capture a particular relationship between two words (bilexical relation). Our algorithm takes an existing set of word vectors and refines them to produce new embeddings that are highly effective at predicting the target bilexical relation. Through experiments, we demonstrate that using task-specific embeddings can lead to significant improvements in both the accuracy and speed of lexical prediction tasks.",
    "Unlock the Power of Deep Learning: Introducing a Revolutionary Generative Model for Convolutional Dictionary Learning!\n\nImagine a model that can uncover the hidden patterns in images with unprecedented precision. Our novel approach integrates a groundbreaking probabilistic pooling operation into a deep, multi-layered convolutional dictionary learning framework. This innovative design enables efficient bottom-up pretraining and top-down refinement, allowing the model to learn complex features from images with remarkable accuracy.\n\nThe results are nothing short of astonishing. Our model has achieved outstanding classification results on the MNIST and Caltech 101 datasets, demonstrating its incredible capabilities in image recognition and classification. Get ready to unlock the full potential of deep learning with our cutting-edge generative model!",
    "Building on the latest advancements in generative models, we present a novel approach that enables the creation of images from natural language descriptions. Our model works by iteratively generating patches on a virtual canvas, carefully focusing on the most relevant words in the description to ensure accurate representation. Following extensive training on the Microsoft COCO dataset, we evaluated our model's performance against several state-of-the-art generative models in both image generation and retrieval tasks. The results show that our model outperforms existing approaches, producing higher-quality images and successfully generating novel scene compositions that correspond to unseen captions in the dataset.",
    "Convolutional neural networks (CNNs) have demonstrated exceptional performance on large datasets. However, the collection of labeled data can be a significant challenge, and in certain applications, access to substantial amounts of data may be limited. A key issue arises when attempting to utilize CNNs with small datasets, as they are prone to overfitting. To address this challenge, we propose an efficient Bayesian CNN that exhibits enhanced robustness to overfitting on small datasets compared to traditional approaches. This is achieved by placing a probability distribution over the CNN's kernels. We approximate the intractable posterior of our model using Bernoulli variational distributions, which does not require the introduction of additional model parameters. \n\nFrom a theoretical perspective, we formulate dropout network training as approximate inference in Bayesian neural networks. This enables the implementation of our model using existing deep learning tools without increasing time complexity, while also highlighting a negative result in the field. Our approach yields a significant improvement in classification accuracy compared to standard techniques and surpasses published state-of-the-art results for the CIFAR-10 dataset.",
    "We introduce a novel approach to designing computationally efficient convolutional neural networks (CNNs) by leveraging low-rank representations of convolutional filters. Unlike existing methods that approximate filters in pre-trained networks, we learn a set of compact basis filters from the ground up. During training, the network learns to combine these basis filters to form more complex filters that are highly effective for image classification tasks. To facilitate this process, we developed a unique weight initialization scheme that enables effective initialization of connection weights in convolutional layers comprising groups of filters with varying shapes. We validated our approach by applying it to several existing CNN architectures and training them from scratch using the CIFAR, ILSVRC, and MIT Places datasets. Our results demonstrate that our method achieves comparable or superior accuracy to traditional CNNs while significantly reducing computational requirements. Specifically, when applied to an enhanced version of the VGG-11 network with global max-pooling, our method achieves comparable validation accuracy with 41% less computation and only 24% of the original model parameters. Furthermore, a variant of our method yields a 1% increase in accuracy over our improved VGG-11 model, achieving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Additionally, when applied to the GoogLeNet architecture for ILSVRC, our method achieves comparable accuracy with 26% less computation and 41% fewer model parameters. Finally, when applied to a near state-of-the-art network for CIFAR, our method achieves comparable accuracy with 46% less computation and 55% fewer parameters.",
    "The use of distributed representations of words has significantly enhanced the performance of various Natural Language Processing tasks. However, a major limitation of this approach is that it typically assigns only one representation per word, disregarding the fact that many words have multiple meanings. This oversight can negatively impact both individual word representations and the language model as a whole. To address this issue, we propose a simple yet effective model that leverages recent techniques for building word vectors to capture distinct senses of polysemic words. Our evaluation of this model demonstrates its ability to accurately distinguish between words' senses while maintaining computational efficiency.",
    "We introduce the Diverse Embedding Neural Network (DENN), a novel architecture designed for language models (LMs). Unlike traditional feed-forward neural network LMs, which project input word history vectors onto a single high-dimensional subspace, DENNLMs map them onto multiple diverse low-dimensional subspaces. To ensure the diversity of these subspaces, we incorporate an augmented loss function during network training. Our language modeling experiments on the Penn Treebank dataset demonstrate the performance advantages of using a DENNLM.",
    "A conventional methodology for Collaborative Filtering (CF), specifically the prediction of user ratings for items, is founded on Matrix Factorization techniques. These approaches involve computing representations for both users and items based on observed ratings, which are subsequently utilized for prediction purposes. However, these transductive methods are inherently limited in their ability to accommodate new users entering the system without any prior ratings, a phenomenon commonly referred to as the \"user cold-start\" problem. A prevalent strategy in this context involves soliciting a limited number of initial ratings from these incoming users. This paper proposes a novel model designed to address the dual challenges of (i) identifying optimal questions to pose and (ii) constructing efficient representations from this limited information. Furthermore, this model can be applied in more traditional (warm) contexts. The efficacy of our approach is evaluated through its performance on the classical CF problem and the cold-start problem across four distinct datasets, demonstrating its capacity to enhance baseline performance in both scenarios.",
    "We introduce Non-linear Independent Component Estimation (NICE), a novel deep learning framework designed to model complex high-dimensional densities with ease. The core idea behind NICE is that a good representation of data is one that can be easily modeled, which is achieved by transforming the data into a latent space where it conforms to a factorized distribution. This transformation results in independent latent variables, making it simpler to analyze and model the data.\n\nTo achieve this, we employ a non-linear deterministic transformation of the data, which is learned through a composition of simple building blocks, each based on a deep neural network. This approach allows us to learn complex non-linear transformations while maintaining the ability to compute the Jacobian determinant and inverse transform with ease.\n\nThe training criterion for NICE is the exact log-likelihood, which is tractable and efficient. Additionally, unbiased ancestral sampling is straightforward, enabling the generation of new data samples that are similar to the original data. We demonstrate the effectiveness of NICE by applying it to four image datasets, where it yields good generative models and can be used for tasks such as image inpainting.",
    "We propose Deep Linear Discriminant Analysis (DeepLDA), a novel approach that learns to extract linearly separable latent features in a single, unified framework. Building on the strengths of traditional Linear Discriminant Analysis (LDA), which excels at preserving class distinctions and reducing dimensionality for various classification tasks, we integrate LDA with a deep neural network. This innovative combination can be viewed as a non-linear extension of classic LDA. Rather than focusing solely on maximizing the likelihood of target labels for individual data points, our method employs an objective function that encourages the network to generate feature distributions characterized by: (a) minimal variation within the same class and (b) significant variation between different classes. Our objective function is derived from the general LDA eigenvalue problem, allowing for efficient training using stochastic gradient descent and back-propagation. To assess the effectiveness of DeepLDA, we evaluate its performance on three benchmark datasets: MNIST, CIFAR-10, and STL-10. The results show that DeepLDA achieves competitive performance on MNIST and CIFAR-10, and outperforms a network trained with categorical cross-entropy (using the same architecture) in a supervised setting on STL-10.",
    "Introducing Layer-Sequential Unit-Variance (LSUV) initialization, a game-changing method for deep net learning that simplifies weight initialization. This two-step approach yields remarkable results:\n\n1. Pre-initialize weights with orthonormal matrices for each convolution or inner-product layer.\n2. Normalize the output variance of each layer to unity, proceeding from the first to the final layer.\n\nExperiments with various activation functions (maxout, ReLU-family, tanh) demonstrate that LSUV initialization enables the learning of extremely deep nets that:\n\n* Achieve test accuracy equal to or better than standard methods\n* Match the speed of complex schemes designed for deep nets, such as FitNets and Highway\n\nLSUV initialization has been evaluated on prominent architectures like GoogLeNet, CaffeNet, FitNets, and Residual nets, achieving state-of-the-art or near-state-of-the-art performance on MNIST, CIFAR-10/100, and ImageNet datasets.",
    "Imagine being able to unlock the secrets of natural images with a powerful new tool. We're excited to introduce a game-changing parametric nonlinear transformation that's specifically designed to Gaussianize data from these images. Here's how it works: we start by applying a linear transformation, and then each component gets normalized using a pooled activity measure. This measure is calculated by exponentiating a weighted sum of rectified and exponentiated components, plus a constant. But that's not all - we take it to the next level by optimizing the parameters of the full transformation over a vast database of natural images. Our goal? To directly minimize the negentropy of the responses. The result? A transformation that substantially Gaussianizes the data, outperforming alternative methods like ICA and radial Gaussianization. And the best part? This transformation is not only differentiable but also efficiently invertible, which means it can be used to create a density model on images. We've seen that samples from this model are eerily similar to real natural image patches. But that's still not all - we've also demonstrated how this model can be used as a prior probability density to remove additive noise from images. And if that's not enough, we've shown that this transformation can be cascaded, with each layer optimized using the same Gaussianization objective. This means we can create an unsupervised method for optimizing a deep network architecture - a true breakthrough!",
    "We introduce a novel approach to convolutional neural networks, designed to accelerate feedforward execution while maintaining performance. Building on extensive research into parameter redundancy, particularly in convolutional filter weights, we propose a new method that trains flattened networks comprising consecutive one-dimensional filters across all 3D directions. Our approach achieves comparable performance to traditional convolutional networks, with the added benefit of a significant reduction in learning parameters. We validate our flattened model on various datasets, demonstrating its ability to effectively replace 3D filters without compromising accuracy. Notably, our flattened convolution pipelines yield a two-fold speed-up during feedforward passes compared to baseline models, without requiring manual tuning or post-processing efforts.",
    "This paper presents a novel deep learning framework, referred to as Purine. Within the Purine framework, a deep network is represented as a bipartite graph (bi-graph), comprising interconnected operators and data tensors. The bi-graph abstraction enables the efficient solution of networks via an event-driven task dispatcher. Furthermore, we demonstrate that diverse parallelism schemes can be universally implemented across GPUs and/or CPUs on single or multiple computers through graph composition. This approach alleviates the need for researchers to develop code for various parallelization schemes, as the same dispatcher can be utilized for solving variant graphs. Under the scheduling of the task dispatcher, memory transfers are fully overlapped with other computations, thereby significantly reducing communication overhead and facilitating the achievement of approximate linear acceleration.",
    "We propose the Variational Recurrent Auto-Encoder (VRAE), combining RNNs and SGVB strengths for efficient, large-scale unsupervised learning on time series data. The generative model maps data to a latent vector representation, enables data generation from latent space samples, and facilitates supervised RNN training using unlabeled data for weight initialization.",
    "Imagine a vast, high-dimensional space where each word is pinpointed to a precise location, represented by a single point vector. However, this approach has its limitations. By mapping words to a density, rather than a single point, we can unlock a wealth of benefits. For instance, we can better capture the uncertainty surrounding a word's meaning and its relationships with others, and more naturally express asymmetrical connections that don't fit neatly into traditional similarity measures like dot product or cosine similarity. Moreover, density-based representations allow for more flexible and expressive decision boundaries. This paper makes the case for adopting density-based distributed embeddings and introduces a novel method for learning representations within the realm of Gaussian distributions. We put these embeddings to the test, evaluating their performance on a range of word embedding benchmarks, and delve into their ability to model complex, asymmetric relationships like entailment. Furthermore, we uncover fresh insights into the properties of these representations, revealing new possibilities for natural language understanding.",
    "Unlock the Full Potential of Deep Neural Networks: Breakthrough Discovery Reveals Surprising Secret to Efficient Training\n\nDeep neural networks are revolutionizing the tech world, but their digital implementation comes with a hefty price tag - massive space and power requirements, largely due to the arithmetic operators known as multipliers. In a groundbreaking study, we pushed the boundaries of what's possible by training state-of-the-art Maxout networks on three benchmark datasets: MNIST, CIFAR-10, and SVHN. We experimented with three distinct formats: floating point, fixed point, and dynamic fixed point, and made a stunning discovery. \n\nContrary to conventional wisdom, our research shows that extremely low precision is not only sufficient for running trained networks but also for training them from scratch. The implications are profound: we can train Maxout networks with astonishingly low 10-bit multiplications, paving the way for faster, more efficient, and more accessible deep learning applications.",
    "In tasks like semantic segmentation, the need for costly annotation can be reduced by weakening the required degree of supervision through multiple instance learning (MIL). To achieve this, we propose a novel MIL formulation for multi-class semantic segmentation using a fully convolutional network. Our approach learns a semantic segmentation model from weak image-level labels, eliminating the need for object proposal pre-processing. The model is trained end-to-end to optimize the representation and disambiguate pixel-image label assignments simultaneously. This fully convolutional training method accepts inputs of any size and provides a pixelwise loss map for selecting latent instances. Furthermore, our multi-class MIL loss leverages the additional supervision provided by images with multiple labels. We validate our approach through preliminary experiments on the PASCAL VOC segmentation challenge.",
    "A recent method, nested dropout, has been proposed for ranking representation units in autoencoders based on their information content, without compromising reconstruction quality. So far, however, it has only been tested on fully-connected autoencoders in unsupervised settings. In this study, we extend the application of nested dropout to convolutional layers in a CNN trained using backpropagation, examining whether it can provide a straightforward and systematic approach to determining the optimal representation size, tailored to the desired level of accuracy and the complexity of the task and data at hand.",
    "Stochastic gradient algorithms have been at the forefront of large-scale learning problems, yielding significant successes in machine learning. However, the convergence of Stochastic Gradient Descent (SGD) relies heavily on the careful selection of the learning rate and the level of noise in stochastic gradient estimates. \n\nIn this paper, we introduce a novel adaptive learning rate algorithm that leverages curvature information to automatically adjust learning rates. Specifically, we estimate element-wise curvature of the loss function from local statistics of stochastic first-order gradients. Furthermore, we propose a new variance reduction technique to accelerate convergence. \n\nPreliminary experiments with deep neural networks have demonstrated improved performance compared to popular stochastic gradient algorithms.",
    "Unlocking the Secrets of Motion: How Visual Representations Come Alive!\n\nImagine a 3D object in motion, and how it transforms on the observer's image plane. This dynamic dance sparks a profound change in the visual representation computed by a learned model. But what drives this transformation? We uncover the answer by harnessing the power of group representation theory.\n\nOur groundbreaking discovery reveals that any effective visual representation can be broken down into its fundamental building blocks - the elementary irreducible representations. But that's not all - we also unearth a surprising connection between these irreducible representations and the statistical dependencies within the representation. Under specific conditions, we find that these representations are decorrelated, leading to a more efficient and streamlined visual processing.\n\nHowever, when we gaze upon a scene through the lens of partial observability, the motion group's linear action on the image space is disrupted. This is where inference over a latent representation comes into play, allowing us to tap into the hidden patterns that transform linearly. We bring this concept to life with a model of rotating NORB objects, leveraging the non-commutative 3D rotation group SO(3) to unlock the secrets of motion.",
    "**Efficient Maximum Inner Product Search (MIPS) is crucial in recommendation systems and classification with many classes.** To achieve approximate MIPS in sublinear time, recent solutions have employed locality-sensitive hashing (LSH) and tree-based methods. **However, we propose a simpler and more effective approach using variants of the k-means clustering algorithm.** By reducing MIPS to a Maximum Cosine Similarity Search (MCSS) and training a spherical k-means, our method achieves:\n\n* **Higher speedups** for the same retrieval precision compared to state-of-the-art hashing-based and tree-based methods\n* **More robust retrievals** when the query is corrupted by noise\n\nOur experiments on standard recommendation system benchmarks and large vocabulary word embeddings demonstrate the superiority of this simple approach.",
    "The variational autoencoder (VAE) is a generative model that combines a top-down generator with a bottom-up recognition network to approximate posterior inference. However, it makes strong assumptions about the posterior distribution, which can lead to oversimplified representations that underutilize the network's capacity. We introduce the importance weighted autoencoder (IWAE), which uses the same architecture as the VAE but with a tighter log-likelihood lower bound derived from importance weighting. The IWAE's recognition network uses multiple samples to approximate the posterior, allowing it to model complex distributions more effectively. Empirically, IWAEs learn richer latent space representations than VAEs, resulting in improved test log-likelihood on density estimation benchmarks.",
    "This study explores the impact of using reduced precision data in Convolutional Neural Networks (CNNs) on classification accuracy. We focus on networks where each layer can use different precision data. Our main finding is that CNNs' tolerance to reduced precision data varies not only between networks, as previously established, but also within networks. Adjusting precision per layer is attractive because it could lead to energy and performance gains. In this paper, we investigate how error tolerance differs across layers and propose a method to find a low-precision configuration for a network while maintaining high accuracy. Our analysis of a diverse set of CNNs shows that, compared to a conventional implementation using 32-bit floating-point representation for all layers, we can reduce the data footprint required by these networks by an average of 74% and up to 92%, with less than 1% loss in relative accuracy.",
    "The secret to unlocking the full potential of graph-based semi-supervised algorithms lies in the graph itself - specifically, the way instances are connected. But before we can build this graph, we need to transform our instances from vectors into a format that allows us to define the strength of their relationships. The traditional approach relies on a metric, such as Euclidean distance or similarity, to determine the weight of these connections. However, we're about to challenge this conventional wisdom. What if we told you that, in certain cases, the Euclidean norm isn't the most effective choice for solving the task at hand? That's why we're introducing a game-changing algorithm that learns the optimal vectorial representation for building a graph that drives efficiency and accuracy. Get ready to revolutionize the way you approach graph-based semi-supervised learning!",
    "Unlocking the Power of Visual-Semantic Hierarchy: A Game-Changing Approach to Understanding Words, Sentences, and Images\n\nImagine a world where words, sentences, and images are interconnected in a seamless hierarchy, enabling machines to understand the nuances of human language and perception. This vision is now a step closer to reality, thanks to our groundbreaking research. By explicitly modeling the partial order structure of this visual-semantic hierarchy, we can unlock new possibilities for artificial intelligence.\n\nOur innovative method for learning ordered representations has far-reaching implications for a range of tasks, from image captioning to hypernym prediction. The results are astounding: our approach outperforms current methods in hypernym prediction and image-caption retrieval, paving the way for more sophisticated AI applications.\n\nJoin us on this exciting journey as we explore the vast potential of visual-semantic hierarchy and uncover new ways to bridge the gap between human and machine understanding.",
    "Imagine a statistical model that's not only accurate but also incredibly smooth and robust. We're excited to introduce Local Distributional Smoothness (LDS), a groundbreaking concept that promotes model distribution smoothness through a clever regularization term. We've dubbed this innovative approach Virtual Adversarial Training (VAT). \n\nAt its core, LDS measures a model's robustness to local perturbations around a given datapoint, using KL-divergence as its foundation. What sets VAT apart is its ability to determine the adversarial direction without relying on label information, making it a game-changer for semi-supervised learning. The best part? VAT is computationally efficient, requiring no more than three forward and backward propagation pairs to approximate the LDS gradient for neural networks.\n\nWe put our technique to the test on the MNIST dataset, and the results were astounding - it outperformed all other training methods, except for a highly advanced generative model. But that's not all. We also applied VAT to SVHN and NORB, and it consistently outshone the current state-of-the-art semi-supervised methods. With LDS and VAT, the future of statistical modeling has never looked brighter!",
    "Convolutional Network models have achieved excellent recognition results thanks to large labeled datasets. However, manual annotation is often impractical, leaving us with noisy labels that may or may not be accurate. This paper investigates how well Convnets perform when trained on noisy data. To address this, we add a noise layer to the network that adjusts its outputs to match the noisy label distribution. This layer's parameters can be estimated during training with minimal changes to existing deep network training infrastructure. We test our approach on several datasets, including large-scale ImageNet classification experiments.",
    "We present innovative, guaranteed methodologies for training feedforward neural networks characterized by sparse connectivity. Building upon previously developed techniques for learning linear networks, we demonstrate their effective applicability to the learning of nonlinear networks. Our approach operates on the moments involving labels and the score function of the input, and we prove that their factorization yields the weight matrix of the first layer of a deep network under mild conditions. In practical applications, the output of our method can be utilized as effective initializers for gradient descent.",
    "Identifying the connections between sentences, known as discourse relations, is crucial to understanding how smaller linguistic elements come together to form coherent texts. However, automatically detecting these relations is a tough task, as it requires grasping the meaning of the linked sentences. What makes it even more challenging is that simply understanding the meaning of each individual sentence isn't enough. The connection between sentences often relies on links between smaller elements, such as mentions of specific entities. To tackle this, our approach calculates the meaning of these elements by combining them up the syntactic parse tree. What sets our method apart is that we also generate representations for entity mentions using a novel downward compositional pass. By considering not only the meaning of sentences but also their related entity mentions, we're able to predict discourse relations more accurately. The result is a system that significantly outperforms the previous state-of-the-art in identifying implicit discourse relations in the Penn Discourse Treebank.",
    "This work proposes a novel approach that combines two recent research areas: unsupervised induction of shallow semantics (such as semantic roles) and factorization of relations in text and knowledge bases. Our model consists of two components: an encoding component that uses a semantic role labeling model to predict roles based on syntactic and lexical features, and a reconstruction component that employs tensor factorization to predict argument fillers using the induced roles. By jointly estimating these components to minimize argument reconstruction errors, our method induces roles that closely align with those defined in annotated resources. Notably, our approach achieves performance comparable to the most accurate role induction methods for English, without relying on prior linguistic knowledge about the language.",
    "Unlock the Power of Metrics in Machine Learning: Bridging the Gap to Guaranteed Success\n\nIn the realm of machine learning, metrics play a crucial role in classification, clustering, and ranking. However, a significant challenge lies in the lack of theoretical guarantees for the generalization capacity of classifiers tied to specific metrics. To address this, researchers have been working to establish a connection between similarity functions and linear classifiers. A groundbreaking framework, introduced by Balcan et al. in 2008, pioneered this effort by defining $(\\epsilon, \\gamma, \\tau)$-good similarity functions. Building upon this foundation, our paper takes a major leap forward by introducing a novel generalization bound for associated classifiers, rooted in the algorithmic robustness framework. This breakthrough has the potential to revolutionize the field of machine learning, providing a long-sought guarantee of success.",
    "We introduce the multiplicative recurrent neural network as a general framework for capturing compositional meaning in language, which we evaluate on the task of fine-grained sentiment analysis. By establishing a connection to previously explored matrix-space models for compositionality, we demonstrate that these models are, in fact, special cases of the multiplicative recurrent net. Our experimental results show that these models perform on par with or even surpass Elman-type additive recurrent neural networks, outperforming matrix-space models on a standard fine-grained sentiment analysis corpus. Moreover, they achieve comparable results to structural deep models on the Stanford Sentiment Treebank, all without the need for generating parse trees.",
    "Minimizing non-convex functions in high-dimensional spaces is a significant scientific challenge. We show that many such functions have a narrow range of values that contain most of their critical points. This contrasts with low-dimensional cases, where this range is wider. Our simulations support previous theoretical work on spin glasses and demonstrate a similar phenomenon in deep networks using the MNIST dataset. We also find that both gradient descent and stochastic gradient descent can reach this optimal level in the same number of steps.",
    "We propose a novel statistical framework for the analysis of photographic images, wherein the localized reactions of a diverse array of linear filters are characterized by a jointly Gaussian distribution, featuring a zero mean and a covariance structure that undergoes gradual transformations across spatial locations. Furthermore, we employ an optimization strategy to identify optimal filter sets that minimize the nuclear norms of matrices comprising their localized activation patterns, thereby promoting a flexible and adaptive form of sparsity that is not bound by any specific dictionary or coordinate system. The filters optimized according to this objective exhibit oriented and bandpass properties, and their responses display significant local correlation. Notably, we demonstrate that images can be reconstructed with remarkable accuracy from estimates of the local filter response covariances alone, with minimal visual or mean squared error (MSE) degradation resulting from low-rank approximations of these covariances. As such, this representation holds considerable promise for applications such as image denoising, compression, and texture representation, and may serve as a valuable foundation for hierarchical decompositions.",
    "Revolutionizing Object Recognition: A Game-Changing Breakthrough in Convolutional Neural Networks\n\nFor years, the architecture of modern convolutional neural networks (CNNs) has remained largely unchanged, relying on a standard combination of convolution and max-pooling layers followed by fully connected layers. But what if we told you that this conventional approach is not the only way to achieve exceptional object recognition results?\n\nOur groundbreaking research challenges the status quo by questioning the necessity of each component in the pipeline. And the results are astonishing: we've discovered that max-pooling can be effortlessly replaced by a convolutional layer with an increased stride, without sacrificing accuracy on multiple image recognition benchmarks.\n\nBuilding on this finding and inspired by recent work on simplifying network structures, we're proud to introduce a novel architecture that consists solely of convolutional layers. The outcome? Competitive or state-of-the-art performance on several object recognition datasets, including CIFAR-10, CIFAR-100, and ImageNet.\n\nBut that's not all. To gain deeper insights into our network, we've developed a new variant of the \"deconvolution approach\" for visualizing features learned by CNNs. This innovative technique can be applied to a broader range of network structures than existing methods, opening up new avenues for research and discovery.",
    "We've developed a novel, adaptive piecewise linear activation function that's learned independently for each neuron using gradient descent, outperforming traditional static rectified linear units and achieving state-of-the-art results on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a high-energy physics benchmark.",
    "Get ready to revolutionize the world of parsing! This groundbreaking paper unleashes a powerful neural network-based greedy parser that harnesses the strength of a novel compositional sub-tree representation. The parser and composition procedure are trained in perfect harmony, fueling each other's success. The result? A vector representation that masterfully distills the essence of sub-trees, capturing both syntactic (parsing tags) and semantic (words) nuances. By leveraging continuous representations and recurrent neural networks, we've achieved composition and tagging like never before! And the best part? Our parser matches the F1 performance of top-tier existing parsers, but with the added bonus of lightning-fast speed thanks to its greedy nature. To top it all off, we're providing a fully functional implementation of our method, so you can dive in and start parsing like a pro!",
    "Unlock the Power of Denoising Autoencoders: Revolutionizing Invariant Representations with Lateral Connections!\n\nImagine a neural network that can uncover the underlying essence of an image, stripping away noise and distractions to reveal its true nature. This is the promise of denoising autoencoders (dAEs), and researchers have made a groundbreaking discovery that takes them to the next level.\n\nBy introducing lateral connections between the encoder and decoder, higher layers of the dAE can focus on capturing abstract, invariant features that remain unchanged despite variations in the input. This innovation liberates the network from the burden of carrying detailed information through every layer, allowing it to distill the essence of the image with unprecedented precision.\n\nBut that's not all - when these invariant features are used to modulate the strength of the lateral connections, the network can seamlessly translate abstract representations into detailed, noise-free reconstructions. It's a game-changer for image processing and beyond!\n\nTo put this theory into practice, three dAE structures were pitted against each other in a series of experiments using real-world images. The results were nothing short of astonishing:\n\n1. **Improved denoising performance**: The addition of modulated lateral connections significantly boosted the accuracy of the probability model for inputs.\n2. **Faster invariance growth**: The degree of invariance in the representations grew at an accelerated rate towards the higher layers, unlocking new possibilities for image analysis and understanding.\n3. **Diverse invariant poolings**: The formation of diverse invariant poolings was supported, enabling the network to capture a wide range of abstract features and patterns.\n\nThe implications of this breakthrough are far-reaching, with potential applications in computer vision, image processing, and beyond. Get ready to unlock the full potential of denoising autoencoders and discover the hidden secrets of your data!",
    "We're excited to introduce a new approach to understanding and improving the way AI models recognize patterns in images. Our method helps us visualize and refine the way these models handle different transformations, like rotations or translations. We do this by creating a series of images that smoothly transition from one reference image to another, kind of like a video. If the model is doing a good job, this sequence should show the transformation happening gradually and naturally. We tested our approach on a top-performing image classification model and found some areas for improvement. But the good news is that our method also suggests ways to fix these issues, and we were able to make the model better at handling these transformations. We're hopeful that this will lead to even more accurate and reliable AI models in the future!",
    "The genomics revolution is transforming the face of medicine and biomedical research, unlocking the secrets of disease and paving the way for targeted treatments, especially in the fight against cancer. But to fully harness the power of genomic medicine, we need to crack the code of predicting patient outcomes from complex genomic data. Current survival analysis methods are struggling to keep up, but what if we could tap into the potential of neural networks to learn from genomic patterns and forecast patient survival? In this groundbreaking study, we explore the cutting-edge approach of using neural networks to predict cancer patient outcomes, and our results using brain tumor data show that this method leaves traditional survival analysis in the dust!",
    "Current methods for integrating both additive and multiplicative neural units rely on either fixed operation assignments or discrete optimization to determine the function of each neuron. However, this approach significantly increases the computational complexity of the training process. In contrast, we propose a novel, parameterizable transfer function grounded in the mathematical concept of non-integer functional iteration. This innovative approach enables the seamless and differentiable adjustment of neural operations between addition and multiplication, allowing the decision between these two functions to be effortlessly incorporated into the standard backpropagation training procedure.",
    "One major challenge in training deep neural networks is the problem of inconsistent scaling between layers, which can lead to exploding gradients and hinder the learning process. While careful initialization of weights has been a common solution, we explore the benefits of maintaining proper scaling, or isometry, throughout the training process. We introduce two novel methods to achieve this: an exact approach and a stochastic one. Our preliminary experiments demonstrate that both determinant and scale-normalization significantly accelerate learning. The results suggest that preserving isometry, particularly in the early stages of learning, is crucial for faster and more effective training.",
    "We develop a Stick-Breaking Variational Autoencoder (SB-VAE) by applying Stochastic Gradient Variational Bayes to Stick-Breaking processes. This Bayesian nonparametric model has a latent representation with stochastic dimensionality. Our experiments show that the SB-VAE and its semi-supervised variant learn highly discriminative latent representations, often outperforming the Gaussian VAE.",
    "Unsupervised learning on imbalanced data poses a significant challenge, as prevailing models often succumb to the dominance of the majority category, neglecting those with limited data representation. To address this limitation, we have developed a novel latent variable model capable of effectively handling imbalanced data by partitioning the latent space into shared and private domains. Building upon the foundations of Gaussian Process Latent Variable Models, we introduce a innovative kernel formulation that facilitates the segregation of latent space and yields an efficient variational inference method. The efficacy of our model is convincingly demonstrated through its application to an imbalanced medical image dataset.",
    "Generative Adversarial Networks (GANs) are a type of powerful deep learning model. They work by playing a game between two components. However, the original goal of this game was modified to improve the learning process. We've developed a new approach that involves repeatedly estimating density ratios and minimizing differences between them. This new method provides a fresh understanding of GANs and allows us to leverage insights from density ratio estimation research, such as which differences are stable and which ratios are most useful.",
    "Did you know that natural language processing (NLP) methods can be directly applied to classification problems in cheminformatics? It might seem like a stretch, but it's actually a natural fit when you consider that compounds can be represented as text strings, known as SMILES. Take, for example, the challenge of predicting how well a compound will work against a specific protein - a crucial step in computer-aided drug design. Our experiments show that using NLP methods not only beats the current state-of-the-art results based on hand-crafted representations, but also provides valuable insights into how the decisions are made, giving us a deeper understanding of the underlying structure.",
    "We present a novel neural network design and learning method that generates symbolic representations in a factorized form. Our approach involves learning these concepts by analyzing sequential frames, where most of the hidden representation components are predicted from the previous frame, except for a small set of discrete units (gating units) that capture the factors of variation in the next frame, effectively representing them symbolically. We validate the effectiveness of our method using datasets of faces undergoing 3D transformations and Atari 2600 games.",
    "Delving into the heart of neural networks, we uncover the secrets of the Hessian matrix's eigenvalues, both before and after training. What we find is a striking pattern: the eigenvalue distribution is split into two distinct components. The bulk of the eigenvalues cluster tightly around zero, while the edges are scattered far and wide, like outliers in a vast numerical landscape. But what do these patterns reveal? Our empirical evidence suggests that the bulk is a telltale sign of over-parametrization, while the edges are intimately tied to the characteristics of the input data itself. Join us as we unravel the mysteries hidden within these eigenvalues, and discover the hidden dynamics that shape the behavior of neural networks.",
    "We propose a novel parametric nonlinear transformation specifically designed to Gaussianize data from natural images. This transformation involves two stages: first, a linear transformation, and then a normalization step where each component is divided by a pooled activity measure. This measure is calculated by exponentiating a weighted sum of rectified and exponentiated components, plus a constant. We optimize the parameters of this transformation (including the linear transform, exponents, weights, and constant) using a database of natural images, with the goal of minimizing the negentropy of the responses. The resulting optimized transformation effectively Gaussianizes the data, achieving a significantly lower mutual information between transformed components compared to alternative methods such as ICA and radial Gaussianization. Notably, this transformation is differentiable and can be efficiently inverted, allowing it to induce a density model on images. We demonstrate that samples generated from this model are visually similar to natural image patches. Furthermore, we show that this model can be used as a prior probability density to remove additive noise from images. Finally, we explore the possibility of cascading this transformation, with each layer optimized using the same Gaussianization objective, providing an unsupervised method for optimizing a deep network architecture.",
    "Imagine being able to understand and work with complex patterns in data that are too hard to figure out exactly. That's what approximate variational inference lets us do. And the good news is that recent breakthroughs in this field have made it possible to create models that can learn from sequences of data, like a series of events over time, and pick up on patterns that happen in space and time. We used a special kind of model called a Stochastic Recurrent Network (STORN) to analyze data from robots over time. The results were impressive - we were able to reliably spot unusual events, both in real-time and after the fact.",
    "We establish a comprehensive framework for evaluating and refining the ability of agents to gather information in an efficient manner. This framework consists of a range of tasks that require agents to search through partially-observed environments to collect and assemble fragmented information, ultimately achieving specific objectives. By integrating deep learning architectures with reinforcement learning techniques, we develop agents capable of solving these tasks. We influence the behavior of these agents by combining external rewards with internal motivations. Our empirical results show that these agents learn to actively and intelligently seek out new information to reduce uncertainty, while also leveraging the information they have already obtained.",
    "Unlock the Power of Context: Revolutionizing Neural Network Language Models with Adaptive Prediction\n\nImagine a language model that can tap into its recent history to make predictions that are more accurate and informed. Our innovative approach extends traditional neural network language models to do just that, by leveraging a simplified version of memory-augmented networks. This game-changing mechanism stores past hidden activations as memory, accessing them through a dot product with the current hidden activation - a process that's both efficient and scalable to massive memory sizes.\n\nWhat's more, we've drawn a fascinating connection between the use of external memory in neural networks and cache models used with count-based language models. But don't just take our word for it - our approach has been put to the test on several language model datasets, and the results are astounding: we outperform recent memory-augmented networks by a significant margin. Get ready to take your language models to the next level with our cutting-edge technology!",
    "Unlocking the Power of Imagination: Introducing a Revolutionary AI Model that Brings Words to Life!\n\nInspired by the latest breakthroughs in generative models, we're proud to introduce a game-changing innovation that transforms natural language descriptions into stunning images. Our model works its magic by iteratively crafting patches on a virtual canvas, carefully tuning in to the most relevant words in the description.\n\nAfter rigorous training on the esteemed Microsoft COCO dataset, we put our model to the test against top-performing baseline generative models in image generation and retrieval tasks. The results are nothing short of astonishing: our model produces samples of unparalleled quality, and generates images with unprecedented scene compositions that perfectly capture the essence of previously unseen captions in the dataset.\n\nGet ready to experience the future of AI-powered creativity!",
    "We introduce a novel framework for jointly training multiple neural networks, enabling them to leverage each other's strengths. By regularizing the parameters of all models using the tensor trace norm, we encourage each network to borrow and reuse parameters from others when beneficial, a key principle of multi-task learning. Unlike traditional deep multi-task learning approaches, our framework doesn't rely on predefined parameter sharing strategies, where specific layers are forced to share weights. Instead, our approach considers sharing across all possible layers and learns the optimal sharing strategy in a data-driven manner, allowing the models to adapt and specialize as needed.",
    "We developed a highly effective actor-critic deep reinforcement learning agent that excels in challenging environments, including 57 Atari games and continuous control problems. Our innovations include: \n\n* Truncated importance sampling with bias correction\n* Stochastic dueling network architectures\n* A new trust region policy optimization method\n\nThese advancements enable our agent to achieve remarkable performance while maintaining stability and sample efficiency.",
    "Get ready to revolutionize the music industry! We're thrilled to introduce a groundbreaking framework for generating pop music that's sure to get you moving! Our innovative model uses a hierarchical Recurrent Neural Network that's specifically designed to mimic the way pop music is crafted. The magic happens at different levels - the bottom layers create catchy melodies, while the higher levels bring the beat with drums and chords. But don't just take our word for it! In our human studies, people overwhelmingly preferred our generated tunes over those produced by Google's recent method. And the possibilities are endless! We're already exploring two exciting applications: neural dancing and karaoke, as well as neural story singing. The future of music has never sounded brighter!",
    "Machine learning classifiers are sitting ducks for sneaky attacks. Adversarial perturbations are like stealthy ninjas that tweak an input just enough to manipulate the classifier's prediction, without raising any red flags to the human eye. To outsmart these attacks, we've developed a trio of detection methods that force would-be hackers to tone down their malicious images or risk getting caught in the act. Our top-performing detector has uncovered a telltale sign of these adversarial images: they place an unnatural emphasis on the lesser-known features that principal component analysis (PCA) reveals. For more details, including additional detectors and a visually striking saliency map, see the appendix.",
    "We introduce a novel approach to create computationally efficient convolutional neural networks (CNNs) by leveraging low-rank representations of convolutional filters. Unlike existing methods that approximate filters in pre-trained networks, we learn a set of small basis filters from scratch and train the network to combine them into complex filters that are discriminative for image classification. A novel weight initialization scheme is employed to effectively initialize connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures, training them from scratch on the CIFAR, ILSVRC, and MIT Places datasets. Our results demonstrate similar or higher accuracy than conventional CNNs with significantly reduced computational requirements. Specifically, we achieve:\n\n* Comparable validation accuracy with 41% less compute and 24% fewer parameters on an improved VGG-11 network using global max-pooling.\n* A 1 percentage point increase in accuracy over our improved VGG-11 model, with a top-5 center-crop validation accuracy of 89.7%, while reducing computation by 16% relative to the original VGG-11 model.\n* Comparable accuracy with 26% less compute and 41% fewer model parameters on the GoogLeNet architecture for ILSVRC.\n* Comparable accuracy with 46% less compute and 55% fewer parameters on a near state-of-the-art network for CIFAR.",
    "Tier-by-tier uniform-variance (TBUV) initialization - a straightforward approach for weight initialization in deep neural networks - is introduced. The method involves two stages. First, pre-set weights of each convolution or inner-product layer with orthogonal matrices. Second, proceed from the initial to the final layer, standardizing the variance of the output of each layer to be equal to one. \n\nExperiments with different activation functions (maxout, ReLU-family, tanh) demonstrate that the proposed initialization leads to the learning of very deep networks that (i) produce networks with test accuracy better or equal to standard methods and (ii) are at least as efficient as the complex schemes proposed specifically for very deep networks such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). \n\nPerformance is evaluated on GoogLeNet, CaffeNet, FitNets, and Residual nets, and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100, and ImageNet datasets.",
    "This paper improves upon recent research by Kiperwasser and Goldberg (2016) that used a neural attention system in a simple graph-based parser. Our parser is larger and more refined than other recent approaches, and it uses special classifiers to predict connections and labels. As a result, our parser performs extremely well on standard tests for six different languages, achieving scores of 95.7% and 94.1% on the popular English dataset. This makes it the best-performing graph-based parser on this test, beating the previous best score by 1.8% and 2.2%. Our parser is also comparable to the best transition-based parser. We also found out which specific choices we made had a big impact on the parser's accuracy, which allowed us to make big improvements over other graph-based approaches.",
    "Unlocking the full potential of machines to tackle complex and abstract reasoning tasks relies on one crucial thing: accurately capturing both the obvious and subtle connections within data. That's exactly what our innovative Dynamic Adaptive Network Intelligence (DANI) model is designed to do - and it does it with ease! We're thrilled to share that DANI has achieved record-breaking results in question answering tasks on the notoriously tough bAbI dataset, outperforming other approaches that have struggled to learn effective representations (Weston et al., 2015).",
    "**Title Case:**\nSpherical Data Is Found In Many Applications. By Modeling The Discretized Sphere As A Graph, We Can Accommodate Non-Uniformly Distributed, Partial, And Changing Samplings. Moreover, Graph Convolutions Are Computationally More Efficient Than Spherical Convolutions. As Equivariance Is Desired To Exploit Rotational Symmetries, We Discuss How To Approach Rotation Equivariance Using The Graph Neural Network Introduced In Defferrard Et Al. (2016). Experiments Show Good Performance On Rotation-Invariant Learning Problems. Code And Examples Are Available At Https://Github.Com/Swissdatasciencecenter/Deepsphere.\n\n**Uppercase:**\nSPHERICAL DATA IS FOUND IN MANY APPLICATIONS. BY MODELING THE DISCRETIZED SPHERE AS A GRAPH, WE CAN ACCOMMODATE NON-UNIFORMLY DISTRIBUTED, PARTIAL, AND CHANGING SAMPLINGS. MOREOVER, GRAPH CONVOLUTIONS ARE COMPUTATIONALLY MORE EFFICIENT THAN SPHERICAL CONVOLUTIONS. AS EQUIVARIANCE IS DESIRED TO EXPLOIT ROTATIONAL SYMMETRIES, WE DISCUSS HOW TO APPROACH ROTATION EQUIVARIANCE USING THE GRAPH NEURAL NETWORK INTRODUCED IN DEFFERRARD ET AL. (2016). EXPERIMENTS SHOW GOOD PERFORMANCE ON ROTATION-INVARIANT LEARNING PROBLEMS. CODE AND EXAMPLES ARE AVAILABLE AT HTTPS://GITHUB.COM/SWISSDATASCIENCECENTER/DEEPSPHERE.\n\n**Lowercase:**\nspherical data is found in many applications. by modeling the discretized sphere as a graph, we can accommodate non-uniformly distributed, partial, and changing samplings. moreover, graph convolutions are computationally more efficient than spherical convolutions. as equivariance is desired to exploit rotational symmetries, we discuss how to approach rotation equivariance using the graph neural network introduced in defferrard et al. (2016). experiments show good performance on rotation-invariant learning problems. code and examples are available at https://github.com/swissdatasciencecenter/deepsphere.\n\n**Sentence Case:**\nSpherical data is found in many applications. By modeling the discretized sphere as a graph, we can accommodate non-uniformly distributed, partial, and changing samplings. Moreover, graph convolutions are computationally more efficient than spherical convolutions. As equivariance is desired to exploit rotational symmetries, we discuss how to approach rotation equivariance using the graph neural network introduced in Defferrard et al. (2016). Experiments show good performance on rotation-invariant learning problems. Code and examples are available at https://github.com/SwissDataScienceCenter/DeepSphere.\n\nLet me know if you need any further transformations!",
    "The widespread adoption of Convolutional Neural Networks (CNNs) is limited by their high computational complexity, particularly in mobile devices. However, hardware accelerators offer a promising solution to reduce both execution time and power consumption. A crucial step in developing these accelerators is hardware-oriented model approximation. This paper introduces Ristretto, a framework that approximates CNN models by analyzing the numerical resolution of weights and outputs in convolutional and fully connected layers. By utilizing fixed-point arithmetic and representation instead of floating-point, Ristretto can compress models. Additionally, it fine-tunes the resulting fixed-point network. With a maximum error tolerance of 1%, Ristretto successfully compresses CaffeNet and SqueezeNet to 8-bit. The Ristretto code is available for use.",
    "The wide range of painting styles offers a vast and expressive visual language for creating images. The extent to which we can master and concisely convey this visual language reflects our comprehension of the underlying characteristics of paintings, and perhaps images as a whole. In this study, we explore the development of a single, adaptable deep network capable of distilling the artistic essence of diverse paintings. We show that this network can generalize across various artistic styles by condensing a painting into a single point in a multidimensional space. Notably, this model allows users to discover new painting styles by freely combining the styles learned from individual works. We believe that this research takes a significant step towards building complex models of paintings and provides insight into the underlying structure of artistic style representation.",
    "In real-world applications, data often comes with missing values and a mix of discrete and continuous features. To tackle this challenge, we introduce MiniSPN, a practical and simplified version of the LearnSPN algorithm for learning Sum-Product Networks (SPNs). SPNs are a powerful class of hierarchical graphical models that balance expressiveness and tractability. The original LearnSPN algorithm is limited to discrete variables and complete data, but MiniSPN can handle missing data and heterogeneous features, making it more suitable for real-world scenarios. We evaluate the performance of MiniSPN on standard benchmark datasets and two datasets from Google's Knowledge Graph, which exhibit high missingness rates and feature diversity.",
    "While recent deep neural network research has prioritized improving accuracy, it's often possible to find multiple architectures that achieve the same level of accuracy. When comparing models with similar accuracy, smaller architectures offer several benefits. These advantages include reduced communication during distributed training, lower bandwidth requirements for model deployment, and increased feasibility for deployment on hardware with limited memory. To capitalize on these benefits, we introduce SqueezeNet, a compact DNN architecture that achieves AlexNet-level accuracy on ImageNet with significantly fewer parameters (50x fewer). Furthermore, using model compression techniques, we can reduce SqueezeNet's size to under 0.5MB, making it 510x smaller than AlexNet. The SqueezeNet architecture is available for download at https://github.com/DeepScale/SqueezeNet.",
    "We're tackling the tough problem of answering questions that need multiple facts to figure out. Our solution is the Query-Reduction Network (QRN), a type of Recurrent Neural Network (RNN) that's really good at handling both short-term and long-term connections between facts. QRN looks at context sentences like a series of clues that change the game, and it refines the original question as it takes in each new clue. Our tests show that QRN is the best in its class for question answering and dialog tasks, and it's way faster to train and use than other methods too.",
    "Imagine being able to uncover hidden patterns in language data that reveal the strengths and weaknesses of popular word embeddings. Our innovative approach makes this possible by automatically generating groups of semantically similar entities, along with a set of \"outlier\" elements that don't fit the mold. This allows us to put word embeddings to the test, evaluating their ability to detect anomalies and inconsistencies. We've already applied this methodology to create a benchmark dataset, WikiSem500, and the results are striking: the performance of word embeddings on our dataset closely mirrors their performance on sentiment analysis tasks.",
    "Recurrent neural networks (RNNs) are commonly employed for temporal data prediction, leveraging their deep feedforward architecture to learn intricate sequential patterns. However, it is hypothesized that incorporating top-down feedback could be a crucial addition, enabling the disambiguation of similar patterns based on broader contextual information. This paper presents surprisal-driven recurrent networks, which integrate past error information into new predictions by continuously tracking the discrepancy between recent predictions and actual observations. Our approach outperforms both stochastic and fully deterministic methods on the enwik8 character-level prediction task, achieving a test score of 1.37 bits per character (BPC).",
    "Despite achieving exceptional results in various generative tasks, Generative Adversarial Networks (GANs) are notorious for their instability and tendency to overlook certain modes. We contend that these issues arise from the unique functional shape of trained discriminators in high-dimensional spaces, which can cause training to stagnate or push probability mass in the wrong direction, favoring higher concentrations over the true data distribution. To address these problems, we propose several regularization techniques that significantly stabilize GAN training. Furthermore, our regularizers facilitate a more even distribution of probability mass across the modes of the data generating distribution, particularly during the early stages of training, thereby providing a comprehensive solution to the missing modes problem.",
    "Learning policies with reinforcement learning for real-world tasks is hindered by sample complexity and safety concerns, particularly when using deep neural networks. Model-based methods that approximate the target domain with a simulated source domain offer a solution by augmenting real data with simulated data. However, the differences between the simulated and target domains create a challenge for simulated training. Our EPOpt algorithm addresses this by using an ensemble of simulated source domains and adversarial training to learn robust policies that generalize to various target domains, including unmodeled effects. Additionally, the probability distribution over source domains can be refined using target domain data and approximate Bayesian methods, making it a better approximation over time. This approach combines the benefits of robustness and adaptability.",
    "We propose Divnet, a versatile methodology for learning neural networks that incorporate heterogeneous neurons. By imposing a Determinantal Point Process (DPP) on the neurons within a given layer, Divnet effectively models neuronal diversity. This DPP facilitates the selection of a subset of diverse neurons, which are then augmented by fusing redundant neurons into the chosen ones. In contrast to preceding approaches, Divnet offers a more rigorous and adaptable framework for capturing neuronal diversity, thereby implicitly inducing regularization. This, in turn, enables the efficient auto-configuration of network architecture, yielding smaller network sizes without compromising performance. Furthermore, Divnet's emphasis on diversity and neuron fusion ensures its compatibility with other methods aimed at reducing the memory footprint of networks. Our experimental results substantiate these claims, demonstrating that Divnet significantly outperforms competing approaches in the realm of neural network pruning.",
    "When it comes to graph-based semi-supervised algorithms, the graph of instances they're applied to is super important. Usually, these instances start out as vectors, and then we build a graph that connects them. To do this, we need a way to measure the distance or similarity between these vectors, which helps us figure out how strong the connections between them should be. The go-to choice for this measurement is usually something based on the euclidean norm. But here's the thing: sometimes this approach isn't the best way to get the job done efficiently. So, we've come up with an algorithm that learns the best way to represent these vectors as a graph, making it easier to solve the problem at hand.",
    "A significant hurdle in deep neural network training is the tendency to overfit. To combat this, various methods have been developed, including data augmentation and innovative regularization techniques like Dropout, which can help prevent overfitting without requiring an enormous amount of training data. In this study, we introduce a novel regularization approach called DeCov, which substantially reduces overfitting (as evidenced by the gap between training and validation performance) and enhances generalizability. Our approach promotes diverse, non-redundant representations within deep neural networks by minimizing the cross-covariance of hidden activations. Although this concept has been explored in previous research, it has surprisingly never been applied as a regularizer in supervised learning. Our experiments, conducted across a range of datasets and network architectures, consistently demonstrate that this loss function reduces overfitting while typically maintaining or improving generalization performance, often outperforming Dropout.",
    "Deep neural networks are typically trained using stochastic non-convex optimization procedures, which rely on gradient information estimated from subsets (batches) of the dataset. Although the significance of batch size as a parameter for offline tuning is widely acknowledged, the advantages of online batch selection remain poorly understood. This study investigates online batch selection strategies for two state-of-the-art stochastic gradient-based optimization methods, namely AdaDelta and Adam. Given that the loss function to be minimized for the entire dataset is an aggregation of individual datapoint loss functions, it is intuitive that datapoints with the highest loss values should be prioritized (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of selection pressure over time remain open questions. We propose a simple strategy wherein all datapoints are ranked according to their latest known loss values, and the probability of selection decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches accelerates both AdaDelta and Adam by a factor of approximately 5.",
    "We propose a scalable semi-supervised learning approach for graph-structured data using an efficient graph-based convolutional neural network. Our architecture is motivated by a localized first-order approximation of spectral graph convolutions. It scales linearly with the number of graph edges and learns representations that capture both local graph structure and node features. In experiments on citation networks and a knowledge graph dataset, our approach significantly outperforms related methods.",
    "We propose a new model called the \"Energy-based Generative Adversarial Network\" (EBGAN), which reimagines the discriminator as a function that assigns low energy values to areas close to the real data and higher energy values to other areas. Similar to traditional GANs, the generator is trained to produce samples with minimal energy, while the discriminator aims to assign high energy to these generated samples. By treating the discriminator as an energy function, we can explore a range of architectures and loss functions beyond the standard binary classifier. One example of the EBGAN framework is an auto-encoder architecture, where the energy is measured by the reconstruction error, replacing the traditional discriminator. We find that this approach leads to more stable training compared to regular GANs. Additionally, we demonstrate that a single-scale architecture can be trained to generate high-quality, high-resolution images.",
    "The rapid growth of deep learning research has led to a surge in new architectures, but many practitioners are overwhelmed by the numerous options and default to older models like AlexNet. To address this, we've distilled the collective knowledge from recent research to identify key principles for designing effective neural network architectures. We also introduce innovative designs, including FractalNet, Stagewise Boosting Networks, and Taylor Series Networks, with code available on GitHub. Our goal is to inspire others to build upon our work and advance the field.",
    "Accurately answering questions about a given passage of text, a task known as machine comprehension, relies on the ability to capture intricate relationships between the context and the query. The integration of attention mechanisms has recently proven to be a successful strategy in this domain. Existing approaches typically employ attention to selectively focus on specific segments of the context, distill the information into a fixed-size vector, and/or establish a unidirectional flow of attention over time. In contrast, this paper presents a novel architecture, the Bi-Directional Attention Flow (BIDAF) network, which adopts a hierarchical, multi-stage approach to represent the context at varying levels of detail. This framework leverages bi-directional attention flow to generate a context representation that is informed by the query, without relying on premature summarization. Our empirical results demonstrate that the proposed model achieves state-of-the-art performance on the Stanford Question Answering Dataset (SQuAD) and the CNN/DailyMail cloze test.",
    "Despite advancements, training deep generative models and performing posterior inference remains a significant hurdle, particularly when dealing with discrete latent variables. This paper focuses on developing algorithms for learning Helmholtz machines, which involve pairing a generative model with an auxiliary inference model. A major limitation of previous learning algorithms is that they indirectly optimize approximations of the target marginal log-likelihood. In contrast, we have successfully developed a novel class of algorithms based on stochastic approximation theory, which directly optimizes the marginal log-likelihood while minimizing the inclusive KL-divergence. We refer to this resulting algorithm as joint stochastic approximation (JSA). Furthermore, we design an efficient MCMC operator for JSA. Our experiments on the MNIST datasets show that JSA consistently outperforms competing algorithms, such as RWS, in learning complex models.",
    "Revolutionizing Object Detection: Unlocking the Power of Image-Scale Features\n\nImagine processing thousands of bounding boxes for a single image, only to find that many are redundant and correlated. What if you could harness the power of image-scale features to streamline this process and unlock unprecedented efficiency?\n\nOur groundbreaking research reveals a game-changing approach: by exploiting feature occurrence at the image scale, we can strategically prune the neural network, eliminating unnecessary units and reducing parameters by a staggering amount. The results are astounding - up to 40% of units in fully-connected layers can be eliminated, with minimal impact on detection accuracy.\n\n Tested on the renowned PASCAL 2007 Object Detection Challenge, our innovative method is poised to transform the field of object detection, enabling faster, more efficient, and more accurate results. Get ready to experience the future of AI-powered object detection!",
    "Imagine being able to capture the intricate relationships between features in your data, unlocking the full potential of your machine learning models. This is precisely what our novel approach, Exponential Machines (ExM), achieves. By representing an enormous tensor of parameters in a compact, factorized format called Tensor Train (TT), we can efficiently model interactions of every order. This innovative format not only regularizes the model but also allows for precise control over the number of underlying parameters. To tackle the challenge of training such a complex model, we've developed a stochastic Riemannian optimization procedure that can handle tensors with an astonishing 2^160 entries. The results speak for themselves: ExM outperforms existing methods on synthetic data with high-order interactions and matches the performance of high-order factorization machines on the popular MovieLens 100K recommender system dataset.",
    "We propose a novel approach called Deep Variational Bayes Filters (DVBF) for unsupervised learning and identification of complex systems that can be modeled as latent Markovian state space models. By building on recent breakthroughs in Stochastic Gradient Variational Bayes, DVBF can efficiently handle intractable inference distributions using variational inference. This enables the method to effectively process highly nonlinear data with temporal and spatial dependencies, such as image sequences, without requiring prior domain knowledge. Our experimental results demonstrate that allowing backpropagation through transitions reinforces the assumptions of the state space model and significantly enhances the informativeness of the latent embedding, ultimately enabling realistic long-term predictions.",
    "Traditional dialog systems are limited because they require a lot of customization for each new domain. End-to-end dialog systems, which are trained on dialog data, can overcome this limitation. However, the success of end-to-end systems in casual conversations may not translate to goal-oriented applications. This paper proposes a testbed to evaluate the strengths and weaknesses of end-to-end dialog systems in goal-oriented applications, using restaurant reservations as an example. Our system, based on Memory Networks, can perform well but not perfectly, and can learn to perform complex operations. We compared our system to a traditional slot-filling approach and found similar results on data from the Dialog State Tracking Challenge and an online concierge service.",
    "Imagine you're trying to teach a computer to understand and process language. One way to do this is by using a type of training called adversarial training, which helps the computer learn more accurately. However, this method doesn't work well with certain types of language data, like the way words are represented in computers.\n\nWe've found a way to adapt this training method to work with language data by making small changes to the way words are understood by the computer, rather than changing the words themselves. This new approach has achieved the best results so far in several tests, and we've also found that it helps the computer learn more efficiently and avoid mistakes.\n\nWe've made our code available online for others to use and build upon. This could have exciting implications for the development of artificial intelligence and natural language processing.",
    "The unsupervised learning of probabilistic models constitutes a fundamental and intricate challenge in the realm of machine learning. The design of models that facilitate tractable learning, sampling, inference, and evaluation is essential to addressing this problem. This paper presents an extension to the existing space of such models through the incorporation of real-valued non-volume preserving (real NVP) transformations, a class of powerful, invertible, and learnable transformations. The resulting unsupervised learning algorithm enables exact computation of log-likelihood, exact sampling, exact inference of latent variables, and an interpretable latent space. The efficacy of this approach is demonstrated through its application to natural image modeling on four datasets, with evaluations conducted via sampling, log-likelihood assessment, and latent variable manipulation.",
    "This study explores the structure of view manifolds in the feature spaces of Convolutional Neural Networks (CNNs). We aim to answer several key questions:\n\n* Do CNNs learn viewpoint-invariant representations?\n* How do they achieve viewpoint invariance?\n* Is it by merging or separating view manifolds while preserving them?\n* At which layer is viewpoint invariance achieved?\n* How can we measure the structure of view manifolds at each layer of a deep CNN?\n* How does fine-tuning a pre-trained CNN on a multi-view dataset affect the representation at each layer?\n\nTo answer these questions, we propose a method to quantify the deformation and degeneracy of view manifolds in CNN layers. We apply this method and present interesting findings in this paper that address these questions.",
    "Bilinear models unequivocally offer more robust and nuanced representations compared to their linear counterparts. Their applications in various visual tasks, including object recognition, segmentation, and visual question-answering, have consistently yielded state-of-the-art performances, leveraging the enhanced representational capacity. However, the high dimensionality of bilinear representations has historically limited their applicability to computationally intensive tasks. To address this limitation, we introduce a novel low-rank bilinear pooling approach utilizing the Hadamard product, which enables an efficient attention mechanism for multimodal learning. Our empirical results demonstrate that our model surpasses compact bilinear pooling in visual question-answering tasks, achieving state-of-the-art performance on the VQA dataset while exhibiting improved parsimony.",
    "You might've heard that importance-weighted autoencoders are special because they can maximize a tighter lower bound on the marginal likelihood than the standard evidence lower bound. But we'd like to offer a different take on what's really going on. In our view, importance-weighted autoencoders are actually optimizing the standard variational lower bound, just with a more complex distribution. We'll walk you through the math to prove it, show you an even tighter lower bound, and give you a visual representation of the implicit importance-weighted distribution that emerges.",
    "We've developed a formula to measure how well feedforward neural networks will perform on new data. This formula combines two factors: the 'spectral norm' of each layer and the 'Frobenius norm' of the network's weights. We used a PAC-Bayes analysis to come up with this formula.",
    "This paper presents a novel approach to augmenting Generative Adversarial Networks (GANs) with the capability to generate direct energy estimates for samples. Specifically, we introduce a flexible adversarial training framework, which we theoretically demonstrate converges to the true data distribution while preserving the discriminator's ability to retain density information at the global optimum. We derive the analytical form of the induced solution and conduct a thorough examination of its properties. To facilitate the practical implementation of our proposed framework, we develop two effective approximation techniques. Our empirical results closely align with our theoretical analysis, thereby validating the discriminator's ability to recover the energy of the data distribution.",
    "Get ready to revolutionize outlier detection! We're pushing the boundaries by harnessing the power of ensemble neural networks, born from the variational approximation of the posterior in a Bayesian neural network setting. But that's not all - we're taking it to the next level by using gradient descent to sample from the true posterior and obtain those crucial variational parameters. And the results? Absolutely stunning! Our outlier detection capabilities are on par with the best of the best, rivaling even the most efficient ensembling methods out there.",
    "Unlock the power of Large Long Short-Term Memory (LSTM) networks without breaking the bank! We're excited to share two innovative strategies to slash parameter counts and turbocharge training times. First, we've developed a clever \"matrix factorization by design\" approach that breaks down the LSTM matrix into two smaller, more manageable pieces. And second, we've found that partitioning the LSTM matrix, inputs, and states into independent groups can work wonders. The result? You can train large LSTM networks at lightning speed, achieving near state-of-the-art perplexity with a fraction of the RNN parameters.",
    "We found some surprising things while training a type of artificial intelligence called residual networks. We want to understand how these networks work better, so we looked closely at these new discoveries. We used two special techniques to find these behaviors: Cyclical Learning Rates and linear network interpolation. What we found was unexpected: sometimes the networks got better or worse in unexpected ways, and sometimes they learned really fast. For example, we found that using Cyclical Learning Rates can make the network better at its job, even when we use big learning rates. You can find the files to repeat our results at https://github.com/lnsmith54/exploring-loss.",
    "Machine learning models often face limitations and trade-offs when used in real-world applications that aren't present during training. For instance, a computer vision model on a small device needs to process images quickly, while a translation model on a phone needs to conserve battery power. In this study, we developed a model that can adapt its resource usage to each input it receives, using a technique called reinforcement learning. We tested our approach on a simple example using the MNIST dataset.",
    "Imagine a world where AI agents, trained to make decisions based on visual inputs, can be manipulated and deceived by cleverly crafted attacks. This is the reality of adversarial examples, which have been shown to compromise a wide range of deep learning architectures. But what about deep reinforcement learning, which has achieved remarkable success in training agents to make decisions directly from raw image pixels? Can these agents be fooled too?\n\nIn this groundbreaking study, we delve into the uncharted territory of adversarial attacks on deep reinforcement learning policies. We pit the effectiveness of adversarial examples against random noise, and uncover surprising insights. Our research also reveals a novel approach to reducing the number of attacks needed to succeed, by leveraging the value function. But that's not all - we also explore how retraining agents on random noise and FGSM perturbations can bolster their resilience against these cunning attacks.\n\nJoin us on this fascinating journey into the unknown, as we uncover the vulnerabilities and opportunities of deep reinforcement learning in the face of adversarial threats.",
    "This paper presents a novel framework for continual learning, termed Variational Continual Learning (VCL), which integrates online variational inference (VI) with recent advancements in Monte Carlo VI for neural networks. This framework is capable of successfully training both deep discriminative models and deep generative models in complex continual learning scenarios, where existing tasks undergo evolution over time and new tasks emerge. The experimental results demonstrate that VCL surpasses state-of-the-art continual learning methods across a range of tasks, effectively mitigating catastrophic forgetting in a fully automated manner.",
    "Determining the optimal neural network architecture for a specific task in the absence of prior knowledge currently necessitates a computationally expensive global search, involving the training of multiple networks from scratch. This paper tackles the challenge of automatically identifying an optimal network size within a single training cycle. We propose a novel framework, referred to as nonparametric neural networks, which facilitates optimization across all possible network sizes in a non-probabilistic manner. We establish the soundness of this approach when network growth is regulated through the application of an Lp penalty. Our methodology involves the continuous addition of new units, accompanied by the elimination of redundant units via an L2 penalty. Furthermore, we introduce a novel optimization algorithm, termed adaptive radial-angular gradient descent (AdaRad), which yields promising results.",
    "Imagine being able to understand the subtle relationships between two sentences - that's what Natural Language Inference (NLI) is all about! We're excited to introduce Interactive Inference Network (IIN), a game-changing neural network architecture that digs deep into sentence pairs to uncover their semantic secrets. By harnessing the power of interaction tensors (think attention weights on steroids!), we can tap into a treasure trove of semantic information to crack even the toughest NLI challenges. And the best part? Our Densely Interactive Inference Network (DIIN) has already smashed records on large-scale NLI datasets, including the notoriously tricky Multi-Genre NLI (MultiNLI) dataset, with a whopping 20%+ error reduction compared to the previous best system!",
    "The widespread adoption of neural networks in high-stakes applications is hindered by the existence of adversarial examples, which are subtly altered inputs that can deceive even the most advanced networks. Despite numerous attempts to develop robustness against these threats, most proposed defenses have been rapidly compromised by subsequent attacks. In fact, a staggering 50% of defenses presented at ICLR 2018 have already been breached. To overcome this challenge, we turn to formal verification methods. Our approach enables the creation of provably optimal adversarial examples, guaranteeing the minimum possible distortion for any given neural network and input. Notably, we demonstrate that adversarial retraining, a recent defense strategy, can be mathematically proven to increase the required distortion for adversarial examples by a significant factor of 4.2.",
    "We have developed an extension of Stochastic Gradient Variational Bayes, enabling posterior inference for the weights of Stick-Breaking processes. This breakthrough allows us to introduce the Stick-Breaking Variational Autoencoder (SB-VAE), a novel Bayesian nonparametric approach to variational autoencoders. Notably, the SB-VAE features a latent representation with dynamic dimensionality, which can adapt to the complexity of the data. Through experiments, we demonstrate that the SB-VAE, as well as its semi-supervised variant, learn highly informative and discriminative latent representations that often surpass the performance of traditional Gaussian VAEs.",
    "Imagine a world where multiple neural networks can learn from each other, sharing their strengths and expertise to become even more powerful. Our proposed framework makes this a reality, allowing multiple neural networks to be trained simultaneously and encouraging them to reuse each other's parameters whenever possible. This innovative approach is rooted in the principles of multi-task learning, where models can tap into each other's knowledge to achieve better results. What sets our framework apart is its flexibility - unlike traditional deep multi-task learning models, we don't dictate which layers should share parameters. Instead, our framework lets the data decide, allowing the sharing strategy to emerge organically and unlocking the full potential of collaborative learning.",
    "Revolutionize Reinforcement Learning: Introducing a Game-Changing Actor-Critic Agent!\n\nImagine an AI that can master even the most challenging environments with ease. Our innovative actor-critic deep reinforcement learning agent, equipped with experience replay, does just that. It's stable, efficient, and achieves remarkable results in both discrete and continuous control problems, including the notoriously tough 57-game Atari domain.\n\nBut what makes this agent so exceptional? We've introduced a trio of groundbreaking innovations: truncated importance sampling with bias correction, stochastic dueling network architectures, and a novel trust region policy optimization method. The result? An AI that's poised to transform the field of reinforcement learning.",
    "Machine learning classifiers are often susceptible to adversarial attacks, which involve making subtle modifications to input data to alter the classifier's prediction without noticeably changing the input from a human perspective. To combat this, we have developed three methods to identify adversarial images. These methods force would-be attackers to either make their adversarial images less anomalous, thereby reducing their effectiveness, or risk being detected. Notably, our most effective detection method reveals that adversarial images tend to place unusual emphasis on lower-ranked principal components, as identified through Principal Component Analysis (PCA). Additional details on our detection methods and a visually informative saliency map can be found in the appendix.",
    "We present a novel, principled approach to kernel learning, grounded in a Fourier-analytic framework for characterizing translation-invariant and rotation-invariant kernels. This method generates a sequence of feature maps, iteratively optimizing the SVM margin. We provide rigorous theoretical guarantees for both optimality and generalization, which can be interpreted as online equilibrium-finding dynamics in a specific two-player min-max game. Empirical evaluations on both synthetic and real-world datasets demonstrate the scalability and consistent performance improvements of our approach over related methods based on random features.",
    "Current deep reading comprehension models rely heavily on recurrent neural networks, which are well-suited for language processing due to their sequential nature. However, this sequential design makes it difficult to process data in parallel within a single instance, leading to slow deployment in time-sensitive applications. This limitation is particularly pronounced when dealing with longer texts. To address this issue, we propose an alternative approach using a convolutional architecture. By replacing recurrent units with simple dilated convolutional units, we achieve results comparable to the state of the art on two question answering tasks, while simultaneously achieving speedups of up to two orders of magnitude for question answering.",
    "This report serves a triple purpose. Firstly, it scrutinizes the reproducibility of the seminal paper 'On the regularization of Wasserstein GANs' (2018). Secondly, it delves into the replication of five critical experiments from the original paper, focusing on learning speed, stability, hyperparameter robustness, Wasserstein distance estimation, and diverse sampling methods. Lastly, it provides a candid assessment of the reproducibility of the paper's contributions, along with a detailed breakdown of the required resources. To ensure transparency and accessibility, all source code used for reproduction is publicly available.",
    "Variational Autoencoders (VAEs) were initially developed as probabilistic generative models for approximate Bayesian inference. However, the introduction of β-VAEs expanded their application to areas like representation learning, clustering, and lossy data compression by allowing a trade-off between the information content of the latent representation and the distortion of reconstructed data. This paper re-examines this trade-off in the context of hierarchical VAEs, which have multiple layers of latent variables. We identify a class of inference models where the rate can be split into contributions from each layer, allowing independent tuning. We establish theoretical bounds on downstream task performance based on individual layer rates and validate our findings through large-scale experiments. Our results provide practical guidance on which rate-space region to target for a specific application.",
    "In the realm of network analysis, methods that learn node representations play a pivotal role, as they facilitate a wide range of downstream learning tasks. We introduce Graph2Gauss, a novel approach that efficiently learns versatile node embeddings on large-scale (attributed) graphs, yielding exceptional performance on tasks such as link prediction and node classification. Unlike conventional approaches that represent nodes as point vectors in a low-dimensional continuous space, our method embeds each node as a Gaussian distribution, thereby capturing uncertainty about the representation. Moreover, we propose an unsupervised method that tackles inductive learning scenarios and is applicable to diverse types of graphs, including plain/attributed and directed/undirected graphs. By harnessing both the network structure and associated node attributes, we can generalize to unseen nodes without additional training. Our approach leverages a personalized ranking formulation with respect to node distances, exploiting the natural ordering of nodes imposed by the network structure. Experiments on real-world networks unequivocally demonstrate the superior performance of our approach, outperforming state-of-the-art network embedding methods on multiple tasks. Furthermore, we illustrate the benefits of modeling uncertainty, which enables the estimation of neighborhood diversity and the detection of a graph's intrinsic latent dimensionality.",
    "This paper investigates the application of self-ensembling to visual domain adaptation problems. Building on the mean teacher variant of temporal ensembling (Tarvainen et al., 2017; Laine et al., 2017), a technique that has achieved state-of-the-art results in semi-supervised learning, we introduce several key modifications to tackle challenging domain adaptation scenarios. We evaluate the effectiveness of our approach and demonstrate its ability to achieve state-of-the-art results across multiple benchmarks, including our winning entry in the VISDA-2017 visual domain adaptation challenge. Notably, our algorithm outperforms prior art in small image benchmarks and even approaches the accuracy of a supervised classifier.",
    "Imagine you're trying to teach a computer to recognize pictures of dogs and cats. You show it lots of examples, and it gets really good at it. But, what if someone intentionally adds some tiny changes to the pictures that make the computer think a dog is a cat? This is called an \"adversarial example\", and it's a big problem in machine learning.\n\nOur goal is to understand why these adversarial examples work, and how to make computers better at recognizing them. We're not trying to come up with a single solution, but rather to figure out the underlying reasons why they're a problem.\n\nWe're using some advanced math concepts to study how computers and humans see things differently. By looking at how computers and humans process information, we can identify the key reasons why computers get fooled by adversarial examples.\n\nWhat we've found is that if a computer is relying on the wrong information, even just a little bit, it can be easily tricked. But, if we can teach the computer to focus on the right information, it can be both accurate and resistant to these tricks. This is important because it means we can make computers that are better at recognizing pictures, and more secure from being manipulated.",
    "We establish a comprehensive framework for evaluating and refining the ability of agents to gather information in an efficient manner. This framework consists of a range of tasks that require agents to search through partially-observed environments to collect and assemble fragmented information, ultimately achieving specific objectives. By integrating deep learning architectures with reinforcement learning techniques, we develop agents capable of solving these tasks. We influence the behavior of these agents by combining external rewards with internal motivations. Our empirical results show that these agents learn to actively and intelligently seek out new information to reduce uncertainty, while also leveraging the information they have already obtained.",
    "We propose a neural network language model extension that adapts predictions based on recent history. Our model stores past hidden activations as memory, accessed via dot product with current activations. This efficient mechanism scales to large memory sizes. We link our approach to cache models used with count-based language models and demonstrate significant performance improvements on several datasets.",
    "Generative Adversarial Networks (GANs) have proven to be highly effective deep generative models, rooted in a two-player minimax game framework. However, the original objective function has been modified to yield stronger gradients during generator training. We introduce a novel algorithm that iteratively performs density ratio estimation and f-divergence minimization. This approach offers a fresh perspective on understanding GANs, leveraging insights from density ratio estimation research, such as identifying stable divergences and useful relative density ratios.",
    "We introduce a novel pop music generation framework using a hierarchical RNN, where each layer's structure encodes prior knowledge of pop music composition. The model generates melodies at the bottom, and drums and chords at higher levels. Human studies show a strong preference for our music over Google's recent method, with applications in neural dancing, karaoke, and story singing.",
    "We analyze the eigenvalues of the Hessian of a loss function before and after training. The results show two distinct patterns: a bulk of eigenvalues clustered around zero and edges that are scattered away from zero. Our findings suggest that the bulk indicates the level of over-parametrization in the system, while the edges are influenced by the input data.",
    "This paper introduces a novel technique for extracting features from program execution logs. Our approach involves two key steps. First, we use automated methods to identify complex patterns in a program's behavior graph. Next, we utilize an autoencoder to map these patterns into a continuous space. We then assess the effectiveness of our proposed features by applying them to a real-world task: detecting malicious software. Notably, our results show that the resulting embedding space reveals interpretable structures within the patterns, providing valuable insights.",
    "In an embodied navigation task, we evaluated the efficiency of the FlyHash model, a sparse neural network inspired by insects (Dasgupta et al., 2017), and compared it to similar non-sparse models. The task required the model to steer by comparing current visual inputs to memories stored along a training route. Our results showed that the FlyHash model outperformed the others, particularly in terms of data encoding efficiency.",
    "In the peer-review process, reviewers typically provide scores for papers, which are then used by Area Chairs or Program Chairs to inform their decisions. However, these scores are often limited by the human ability to quantify opinions, resulting in a large number of ties and significant loss of information. To address this issue, conferences have started asking reviewers to provide a ranking of the papers they've reviewed, in addition to scores. While this approach shows promise, it presents two key challenges. Firstly, there is no standardized way to incorporate ranking information into the decision-making process, leading to arbitrariness. Secondly, there is a lack of suitable interfaces and methods to effectively utilize this data, resulting in inefficiencies. \n\nOur approach tackles these challenges by integrating ranking information into the scores in a principled manner. The output of our method is an updated score for each review that incorporates the rankings. By doing so, we ensure that rankings are consistently incorporated into the updated scores for all papers, mitigating arbitrariness, and allowing for seamless integration with existing interfaces and workflows designed for scores. We evaluate our method using synthetic datasets and real peer-review data from the ICLR 2017 conference, and find that it reduces error by approximately 30% compared to the best-performing baseline on the ICLR 2017 data.",
    "Uncovering Hidden Biases: A Groundbreaking Study Reveals the Surprising Impact of Author Metadata on Academic Publishing\n\nIn the cutthroat world of academic publishing, getting your research published in a top-tier journal or conference can make or break a career. But what if the outcome is influenced by more than just the quality of your work? Recent studies have hinted at a disturbing trend: status bias in the peer-review process. We decided to dig deeper.\n\nOur investigation analyzed a massive dataset of 5,313 borderline submissions to the prestigious International Conference on Learning Representations (ICLR) from 2017 to 2022. Using a rigorous cause-and-effect analysis, we uncovered some startling findings. It appears that author metadata – think institutional affiliation and prestige – has a significant impact on the final decision of area chairs.\n\nBut here's the twist: our results suggest that papers from high-ranking institutions (think top 30% or 20%) are actually less favored by area chairs compared to their matched counterparts. This phenomenon was consistent across two different matched designs, with odds ratios of 0.82 and 0.83 respectively.\n\nSo, what does this mean for the future of academic publishing? Our study sheds light on the complex interactions between authors, reviewers, and area chairs in the peer-review system. It's time to confront the hidden biases that shape our understanding of what constitutes 'good' research. Join us as we explore the implications of these findings and the steps we can take to create a more equitable and merit-based publishing landscape.",
    "We introduce a novel variational approach to the information bottleneck method proposed by Tishby et al. in 1999. By leveraging neural networks and the reparameterization trick, our method, dubbed \"Deep Variational Information Bottleneck\" (Deep VIB), enables efficient training of the information bottleneck model. Notably, our experiments demonstrate that models trained using the VIB objective exhibit superior generalization performance and robustness to adversarial attacks compared to those trained with alternative regularization techniques.",
    "Unlocking the Power of Attention Networks: A Breakthrough in Modeling Rich Structural Dependencies\n\nImagine being able to tap into the full potential of deep neural networks by incorporating richer structural distributions, without sacrificing the convenience of end-to-end training. This is exactly what we've achieved in our groundbreaking research, where we've successfully integrated graphical models into attention networks to create a new breed of structured attention networks.\n\nThese innovative networks are surprisingly simple extensions of the traditional attention procedure, yet they open up a world of possibilities, such as attending to partial segmentations or subtrees. We've experimented with two cutting-edge models - a linear-chain conditional random field and a graph-based parsing model - and demonstrated how they can be seamlessly implemented as neural network layers.\n\nThe results are nothing short of remarkable. Our structured attention networks outperform baseline attention models across a range of tasks, including tree transduction, neural machine translation, question answering, and natural language inference. But that's not all - we've also discovered that these models learn fascinating unsupervised hidden representations that generalize beyond simple attention.\n\nGet ready to unlock the full potential of attention networks and take your deep learning models to the next level!",
    "We propose using a team of diverse specialists, each with expertise defined by their performance on a confusion matrix. Our observation is that when faced with adversarial examples from a particular class, these specialists tend to mislabel them into a limited set of incorrect classes. Therefore, we believe that an ensemble of these specialists can more effectively identify and reject misleading instances, characterized by high disagreement (entropy) among their decisions when confronted with adversaries. Our experimental results support this interpretation, suggesting a promising approach to improve the system's robustness against adversarial examples by incorporating a rejection mechanism, rather than attempting to classify them accurately at all costs.",
    "Introducing Neural Phrase-based Machine Translation (NPMT)\n\nWe propose a novel approach to machine translation that explicitly models phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently developed segmentation-based sequence modeling method. To overcome the limitation of SWAN's monotonic alignment requirement, we introduce a new layer that performs soft local reordering of input sequences.\n\nUnlike existing neural machine translation (NMT) approaches, NPMT does not rely on attention-based decoding mechanisms. Instead, it generates phrases in a sequential order and can decode in linear time. Our experimental results demonstrate that NPMT outperforms strong NMT baselines on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks. Moreover, our method produces meaningful phrases in the output languages.",
    "Introducing LR-GAN, a game-changing image generation model that considers the bigger picture - literally. Unlike other generative adversarial networks (GANs), our model learns to create images in a more thoughtful way. It starts by generating the background and foregrounds separately, and then stitches them together in a way that makes sense. But that's not all - it also figures out the perfect appearance, shape, and pose for each object in the foreground. The best part? Our model is completely unsupervised, meaning it learns on its own without any human guidance. And the results? Amazing. Our experiments show that LR-GAN can produce more natural-looking images with objects that are way easier to recognize than those generated by DCGAN.",
    "Imagine you're trying to figure out how to navigate a new environment without anyone showing you the ropes. We've come up with a simple way to make that happen. We pit two versions of the same agent, Alice and Bob, against each other in a game of sorts. Alice gives Bob a task to complete, and then Bob tries to do it. We're focusing on two types of environments: ones that can be easily reversed, and ones that can be reset. Alice gives Bob the task by doing a series of actions, and then Bob has to either undo them or repeat them. The way we set up the rewards means that Alice and Bob automatically create a plan for exploring the environment, which lets the agent learn on its own. And the best part? When Bob is put to the test in a real-world task, it can learn way faster and even get better results than if it had been trained with supervision.",
    "We propose a novel approach to maximum entropy modeling by learning a smooth, invertible transformation that maps a simple distribution to the desired maximum entropy distribution. By leveraging normalizing flow networks, we convert the problem into a finite-dimensional constrained optimization, which we solve using stochastic optimization and the augmented Lagrangian method. Our method is effective, flexible, and accurate, as demonstrated by simulation results and applications in finance and computer vision.",
    "Machine learning's daily breakthroughs make general AI seem achievable, but most research focuses on narrow applications like image classification. We think this is because there's no clear way to measure progress towards broad AI. To address this, we propose a set of concrete goals for general AI and a platform to test machines against these goals, minimizing complexity.",
    "Unlock the Power of Graph Neural Networks with Dynamic Batching!\n\nGraph neural networks have the potential to revolutionize a wide range of domains, from natural language processing (parse trees) to cheminformatics (molecular graphs). However, their unique architecture poses a significant challenge: each input has a distinct shape and size, making batched training and inference a daunting task. Moreover, implementing these networks in popular deep learning libraries is a complex endeavor, as they rely on static data-flow graphs.\n\nIntroducing dynamic batching, a game-changing technique that enables efficient batching of operations across different input graphs of varying shapes and sizes, as well as within individual graphs. This innovation allows us to create static graphs that mimic dynamic computation graphs of any shape and size, using popular libraries.\n\nTo further simplify the process, we've developed a high-level library of modular blocks that makes it easy to build dynamic graph models. With this library, we've successfully implemented concise and batch-wise parallel versions of various models from the literature, demonstrating the vast potential of graph neural networks.",
    "Revolutionizing the way we understand deep learning models, our research sheds light on the mysterious decision-making processes of Long Short Term Memory networks (LSTMs). By cracking the code on how LSTMs prioritize input data, we've developed a groundbreaking approach to uncover the hidden patterns that drive their output. Our innovative method identifies the most influential phrases that consistently impact sentiment analysis and question answering, allowing us to distill complex LSTMs into a set of powerful, representative phrases. But that's not all - we've taken it a step further by using these extracted phrases to build a simple, rule-based classifier that remarkably mirrors the output of the LSTM. The result? A game-changing representation that's been quantitatively validated, paving the way for a new era of transparency and understanding in deep learning.",
    "Deep reinforcement learning has yielded remarkable achievements in recent years, but tasks with infrequent rewards or extended time horizons remain significant obstacles. To address these challenges, we introduce a general framework that initially acquires versatile skills in a pre-training environment and subsequently leverages these skills to accelerate learning in downstream tasks. Our approach combines the benefits of intrinsic motivation and hierarchical methods, where the learning of useful skills is directed by a single proxy reward that requires minimal domain knowledge of the downstream tasks. A high-level policy is then trained on top of these skills, substantially enhancing exploration and enabling the tackling of sparse rewards in downstream tasks. To efficiently pre-train a broad range of skills, we employ Stochastic Neural Networks in conjunction with an information-theoretic regularizer. Our experiments demonstrate that this combination effectively learns a diverse set of interpretable skills in a sample-efficient manner, resulting in significant and uniform performance improvements across a wide range of downstream tasks.",
    "In the world of AI, a revolution is underway. Deep generative models have been making waves in recent years, and two rockstar families - Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) - have been stealing the show. For a long time, these two superstars were thought to be operating in separate universes, with their own unique styles and fan bases. But what if we told you that they're not as different as they seem? This paper is all about bridging the gap between GANs and VAEs, and showing that they're actually two sides of the same coin. By flipping the script on how we think about sample generation, we can reveal the hidden connections between these two powerhouses. And the best part? This newfound understanding lets us borrow tricks from one camp to supercharge the other. We've already seen some amazing results, from turbocharging GANs with VAE-style importance weighting to giving VAEs a boost with adversarial generated samples. The possibilities are endless, and we can't wait to see what the future holds!",
    "Imagine being able to identify when a neural network is faced with an image it's never seen before. We tackled this challenge by developing ODIN, a groundbreaking method that doesn't require any modifications to existing neural networks. Our approach is rooted in a key insight: by applying temperature scaling and introducing minor input perturbations, we can distinguish between in-distribution and out-of-distribution images based on their softmax score distributions. Through a range of experiments, we demonstrated ODIN's versatility across various network architectures and datasets. The results are striking, with ODIN significantly outperforming the baseline approach and setting a new standard for this task. For instance, on the DenseNet model applied to CIFAR-10, ODIN slashed the false positive rate from 34.7% to just 4.3% while maintaining a 95% true positive rate.",
    "A framework for unsupervised learning of representations is presented, based on the infomax principle for large-scale neural populations. An asymptotic approximation of Shannon's mutual information is used to demonstrate the effectiveness of a hierarchical infomax method. An efficient algorithm is proposed, using gradient descent to learn representations from input datasets, and is shown to be robust and efficient in extracting salient features. The method outperforms existing approaches in training speed and robustness, and can be easily extended to supervised or unsupervised models for deep structure networks.",
    "Recurrent Neural Networks (RNNs) have demonstrated exceptional performance in sequence modeling tasks. However, training RNNs on long sequences can be challenging due to slow inference, vanishing gradients, and difficulty in capturing long-term dependencies. These issues are closely tied to the large, sequential computational graph that results from unfolding the RNN in time during backpropagation. To address these challenges, we propose the Skip RNN model, which extends existing RNN models by learning to selectively skip state updates, thereby reducing the effective size of the computational graph. Additionally, the model can be incentivized to perform fewer state updates through a budget constraint. Our evaluation of the proposed model on various tasks shows that it can significantly reduce the number of required RNN updates while maintaining, and sometimes even improving, the performance of baseline RNN models. The source code for the Skip RNN model is publicly available at https://imatge-upc.github.io/skiprnn-2017-telecombcn/.",
    "To tackle complex functions with multiple peaks, optimization methods often employ restart strategies. Recently, partial warm restarts have been applied to gradient-based optimization to speed up convergence when dealing with poorly conditioned functions. This paper introduces a straightforward warm restart approach for stochastic gradient descent, aiming to enhance its performance at any point during the training of deep neural networks. We test its effectiveness on the CIFAR-10 and CIFAR-100 datasets, achieving new state-of-the-art results of 3.14% and 16.21%, respectively. Additionally, we demonstrate its benefits on an EEG recording dataset and a downscaled version of the ImageNet dataset. The source code is available at https://github.com/loshchil/SGDR.",
    "Get ready to revolutionize reinforcement learning! Policy gradient methods have already achieved astounding breakthroughs in tackling the toughest challenges. But, we knew we could do even better! The pesky problem of high variance in policy gradient estimation was holding us back, leading to slow and inefficient training. That's why we're thrilled to introduce a game-changing control variate method that slashes variance and supercharges policy gradient methods! Inspired by Stein's identity, our innovative approach takes the best of REINFORCE and advantage actor-critic and elevates them to new heights with flexible, action-dependent baseline functions. The results are nothing short of amazing - our method blows the doors off sample efficiency, leaving state-of-the-art policy gradient approaches in its wake!",
    "Skip connections have made it possible to train extremely deep neural networks and are now a crucial part of many neural architectures. Despite their success, the reason why skip connections work so well remains unclear. In this study, we offer a new explanation for the benefits of skip connections in training deep networks. One of the main challenges in training deep networks is that they can get stuck in singularities, which are caused by the fact that the model is not uniquely defined. There are several types of singularities that can occur, including those caused by node permutation, node elimination, and linear dependence between nodes. These singularities create \"bad\" areas in the loss landscape that slow down the learning process. We argue that skip connections help eliminate these singularities by breaking node symmetry, reducing node elimination, and decreasing linear dependence. Additionally, skip connections move the network away from these \"bad\" areas and reshape the landscape to make learning easier. Our hypotheses are supported by both simplified models and experiments with real-world datasets.",
    "As part of the ICLR 2018 Reproducibility Challenge, we embarked on an exciting journey to recreate the findings of the paper \"Natural Language Inference over Interaction Space\" submitted to the conference. Initially, we didn't realize that the authors had shared their code, so we took on the challenge of building the network from the ground up. After putting our model to the test on the Stanford NLI dataset, we achieved an impressive 86.38% accuracy - just shy of the 88.0% accuracy reported in the original paper. Our analysis suggests that the discrepancy can be attributed to differences in optimizer selection and model tuning approaches.",
    "We have successfully integrated the \"Learn to Pay Attention\" attention mechanism into convolutional neural networks, reproducing the original paper's results in both image classification and fine-grained recognition tasks.",
    "Learning universal sentence representations is a crucial step in natural language processing. Our approach involves encoding the suffixes of word sequences within a sentence and leveraging the Stanford Natural Language Inference (SNLI) dataset for training. The results of our method are impressive, outperforming existing approaches on multiple transfer tasks as evaluated by the SentEval benchmark.",
    "Lots of neural models boost their representations by adding new features that are basically polynomial combos of existing ones. We looked into using scaled polynomials of degree 2 and up as matching features, using natural language inference as an example. What we found was that scaling degree 2 features made the biggest difference, cutting classification errors by 5% in the top-performing models.",
    "We establish a novel generalization bound for feedforward neural networks, which is expressed as the product of the spectral norm of the individual layers and the Frobenius norm of the weights. This bound is rigorously derived using a PAC-Bayes analysis, providing a theoretical framework for understanding the generalization capabilities of neural networks.",
    "In this groundbreaking work, we uncover the full potential of Batch Normalization by proposing a novel probabilistic interpretation that revolutionizes its application. By developing a probabilistic model, we reveal that Batch Normalization optimizes the lower bound of its marginalized log-likelihood, leading to unparalleled performance. Building on this insight, we design an algorithm that ensures consistency between training and testing, eliminating the uncertainty that has long plagued deep learning models. While this approach yields remarkable results, we acknowledge the computational challenges it poses. To overcome these limitations, we introduce Stochastic Batch Normalization, a game-changing approximation that streamlines the inference process while maintaining accuracy. This innovative technique empowers us with a scalable uncertainty estimation method, unlocking new possibilities for deep learning applications. We demonstrate the remarkable performance of Stochastic Batch Normalization on state-of-the-art architectures, including VGG-like and ResNets, using MNIST and CIFAR-10 datasets, showcasing its vast potential to transform the field.",
    "The success of deep convolutional networks is often attributed to their ability to progressively eliminate uninformative input variability relevant to the problem at hand. Empirical evidence supporting this idea includes the difficulty of recovering images from their hidden representations in most commonly used network architectures. However, we demonstrate that this information loss is not a necessary condition for learning generalizable representations, even for complex problems like ImageNet. By designing a cascade of homeomorphic layers, we create the i-RevNet, a fully invertible network that preserves all input information up to the final class projection. Overcoming the challenge of local inversion ill-conditioning, we provide an explicit inverse. Our analysis of i-RevNet's learned representations suggests an alternative explanation for deep network success, involving progressive contraction and linear separation with depth. To gain insight into the i-RevNet's learned model, we reconstruct linear interpolations between natural image representations.",
    "Unlock the full potential of representation learning with deep latent variable models! In this groundbreaking paper, we take the deep information bottleneck model to the next level by pinpointing its limitations and introducing a revolutionary new approach that overcomes them. By harnessing the power of copula transformation, we restore the invariance properties of the information bottleneck method, achieving unprecedented disentanglement of features in the latent space. But that's not all - we also demonstrate how this transformation gives rise to a sparse latent space in our innovative new model. To put our method to the test, we apply it to both artificial and real-world data, with remarkable results.",
    "We propose a modified version of the MAC model (Hudson and Manning, ICLR 2018) that simplifies the equation set while maintaining comparable performance, with the added benefit of faster training times. Our evaluation of both models on CLEVR and CoGenT datasets reveals that transfer learning with fine-tuning yields a significant 15-point accuracy boost, on par with state-of-the-art results. Conversely, we also show that improper fine-tuning can have a detrimental effect, actually decreasing a model's accuracy.",
    "Imagine a neural network that can adapt to the complexity of its input data, devoting more time and resources to the toughest challenges. This is the promise of Adaptive Computation Time (ACT) for Recurrent Neural Networks, a cutting-edge architecture that can revisit and reprocess individual data points as needed. But how does it stack up against a simpler approach? In this study, we pit ACT against Repeat-RNN, a novel method that repeats each sample a fixed number of times. The results are eye-opening: Repeat-RNN performs on par with ACT in our selected tasks, suggesting that sometimes, simplicity can be just as effective as sophistication. To explore this further, we've made our source code available in TensorFlow and PyTorch at https://imatge-upc.github.io/danifojo-2018-repeatrnn/.",
    "Unlocking the Power of GANs for Anomaly Detection: A Breakthrough Approach\n\nGenerative Adversarial Networks (GANs) have shown remarkable ability in capturing the intricate patterns of real-world data, making them an attractive solution for identifying anomalies. Despite their potential, GANs have been largely underutilized for anomaly detection. Our innovative approach bridges this gap by harnessing cutting-edge GAN models to detect anomalies with unprecedented accuracy. The results are striking: we achieve state-of-the-art performance on image and network intrusion datasets, all while outpacing the only existing GAN-based method by several hundred times in terms of testing speed.",
    "The Natural Language Inference (NLI) task necessitates that an agent discern the logical relationship between a natural language premise and a corresponding hypothesis. To address this challenge, we propose the Interactive Inference Network (IIN), a novel class of neural network architectures capable of achieving a profound understanding of sentence pairs by hierarchically extracting semantic features from the interaction space. Our research reveals that an interaction tensor, embodied by attention weights, contains valuable semantic information essential for solving natural language inference tasks. Furthermore, we find that a denser interaction tensor encompasses richer semantic information. One instantiation of this architecture, the Densely Interactive Inference Network (DIIN), demonstrates state-of-the-art performance on both large-scale NLI corpora and the large-scale NLI alike corpus. Notably, DIIN achieves a remarkable error reduction of over 20% on the challenging Multi-Genre NLI (MultiNLI) dataset, surpassing the strongest published system.",
    "Imagine you're driving a self-driving car, and suddenly, a clever hacker sends a fake image to the car's computer, making it think there's a pedestrian in the road when there isn't. This is a real concern, as tiny changes to images or sounds can trick even the most advanced AI systems into making mistakes. This is known as an \"adversarial example.\"\n\nMany experts have tried to find ways to prevent these mistakes, but most of their solutions have been quickly broken by hackers. In fact, over half of the proposed solutions from a major AI conference in 2018 were already broken soon after.\n\nWe think we can solve this problem using a different approach called \"formal verification.\" This means we can create fake images or sounds that are guaranteed to be the least distorted possible, while still tricking the AI system. We tested this approach with a recent solution called \"adversarial retraining\" and found that it actually works - it makes the AI system four times more resistant to these kinds of attacks.",
    "Deep neural networks (DNNs) excel at predicting complex relationships, but their \"black box\" nature limits their use. We propose agglomerative contextual decomposition (ACD) to explain DNN predictions by hierarchically clustering input features and their contributions to the final prediction. ACD helps diagnose incorrect predictions, identify dataset bias, and enables users to trust DNN outputs. It's also robust to adversarial perturbations, capturing fundamental input aspects while ignoring noise.",
    "This study tackles the challenge of musical timbre transfer, which involves modifying the tone of a sound sample from one instrument to match another while maintaining other musical elements, such as pitch, rhythm, and loudness. While image-based style transfer techniques could be applied to a time-frequency representation of an audio signal, this approach requires a representation that allows for independent timbre manipulation and high-quality waveform generation. Our proposed method, TimbreTron, achieves musical timbre transfer by applying \"image\" domain style transfer to a time-frequency representation of the audio signal and then generating a high-quality waveform using a conditional WaveNet synthesizer. We find that the Constant Q Transform (CQT) representation is particularly suitable for convolutional architectures due to its approximate pitch equivariance. Human perceptual evaluations confirm that TimbreTron successfully transfers the timbre while preserving the musical content, for both single-instrument and multi-instrument samples.",
    "Unlocking the Power of Adaptive Language Models: A Revolutionary Approach to Word-Level Language Modeling\n\nImagine a language model that can dynamically adapt to the nuances of language in real-time, capturing both short-term and medium-term patterns with unprecedented accuracy. Our groundbreaking research brings this vision to life by combining the strengths of hidden-states-based short-term representations with the agility of medium-term representations encoded in dynamical weights. Building on the latest advancements in language models with dynamically evolving weights, we pioneer a novel online learning-to-learn framework. Here, a meta-learner is trained using gradient descent to continuously update the language model's weights, enabling it to learn and improve with each new input. The result is a language model that's not only more accurate but also more responsive to the ever-changing landscape of language.",
    "Generative Adversarial Networks (GANs) possess the remarkable ability to model the complex manifold of natural images. By harnessing this capability, we develop a novel approach to manifold regularization, approximating the Laplacian norm through a computationally efficient Monte Carlo method that leverages the GAN architecture. When integrated into the feature-matching GAN framework of Improved GAN, our approach achieves state-of-the-art performance in GAN-based semi-supervised learning on the CIFAR-10 dataset, while offering a significantly more straightforward implementation compared to competing methods.",
    "We found a type of deep neural network that can always find the best solution. No matter where we start, we can always find a path that improves the network's performance and gets very close to perfect. This means that these networks don't get stuck in sub-optimal solutions.",
    "Visual Question Answering (VQA) models have historically faced challenges in accurately counting objects within natural images. We have identified a fundamental limitation inherent to soft attention mechanisms in these models as a primary contributing factor to this issue. To address this limitation, we propose the integration of a novel neural network component designed to facilitate robust object counting from object proposals. Experimental results on a controlled task demonstrate the efficacy of this component, and we achieve state-of-the-art performance on the number category of the VQA v2 dataset without compromising performance on other categories. Notably, our single model outperforms ensemble models in this regard. Furthermore, our component yields a significant 6.6% improvement in counting accuracy over a strong baseline on a challenging balanced pair metric.",
    "Get ready to revolutionize the world of generative adversarial networks! One of the biggest hurdles in GAN research has been the frustrating instability of its training process. But fear not, dear innovators! Our team has cracked the code with a game-changing weight normalization technique called spectral normalization. This breakthrough method not only stabilizes the discriminator's training but also does it with lightning speed and ease, seamlessly integrating into existing frameworks. We put spectral normalization to the test on three iconic datasets - CIFAR10, STL-10, and ILSVRC2012 - and the results are nothing short of astounding! Our spectrally normalized GANs (SN-GANs) consistently produce images of unparalleled quality, outshining previous stabilization techniques. The future of GANs has never looked brighter!",
    "Turning graph nodes into numbers (called vectors) can help us use machine learning to do things like predict what type of node something is. However, we don't know as much about how to do this as we do about understanding human language, because graphs can be very different from each other. We tested how well different algorithms for turning nodes into vectors work with different types of graphs, using six different sets of data. Our results help us understand how these algorithms work, which can help us learn more about this topic in the future.",
    "Unlocking the Power of Logical Reasoning: Introducing a Groundbreaking Dataset and a Revolutionary New Model\n\nImagine being able to teach machines to think logically and make informed decisions based on complex rules and structures. We're one step closer to achieving this goal with the introduction of a novel dataset designed to test a model's ability to capture and exploit the intricacies of logical expressions.\n\nIn this exciting study, we put a range of popular sequence-processing architectures to the test, including a game-changing new model class: PossibleWorldNets. This innovative approach computes entailment as a \"convolution over possible worlds\", opening up new possibilities for logical reasoning.\n\nThe results are nothing short of remarkable. We found that convolutional networks fall short in capturing the nuances of logical expressions, while LSTM RNNs are outperformed by tree-structured neural networks, which excel at exploiting the syntax of logic. But the real star of the show is PossibleWorldNets, which outshines all benchmarks and sets a new standard for logical entailment prediction.\n\nGet ready to unlock the full potential of artificial intelligence with our groundbreaking dataset and revolutionary new model. The future of logical reasoning has never looked brighter!",
    "Neural network pruning can significantly reduce the number of parameters in a trained network, resulting in improved storage and computational efficiency without sacrificing accuracy. However, the sparse networks produced by pruning are often difficult to train from scratch, which would further improve training performance. Our research reveals that a standard pruning technique can uncover subnetworks that are capable of effective training due to their initializations. This leads us to propose the \"lottery ticket hypothesis,\" which suggests that dense, randomly-initialized networks contain subnetworks (or \"winning tickets\") that can achieve comparable test accuracy to the original network in a similar number of iterations when trained independently. These winning tickets have \"won the initialization lottery\" with initial weights that facilitate effective training. We have developed an algorithm to identify winning tickets and conducted experiments that support the lottery ticket hypothesis, demonstrating the importance of these initializations. Our findings consistently show that winning tickets can be less than 10-20% of the size of various fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10, and that these smaller networks can learn faster and achieve higher test accuracy than the original network.",
    "We develop a method to efficiently compute the singular values of a 2D multi-channel convolutional layer. This leads to an algorithm for projecting the layer onto an operator-norm ball, which serves as an effective regularizer. For instance, it reduces the test error of a deep residual network with batch normalization on CIFAR-10 from 6.2% to 5.3%.",
    "Despite their success, deep neural networks like DCNNs are still difficult to understand theoretically. This paper proposes a new framework for analyzing these networks when they use ReLU nonlinearity. Our framework clearly defines how the data is distributed, encourages separate representations, and works with common techniques like Batch Norm. We build our framework by expanding the student's forward and backward propagation onto the teacher's computational graph. Unlike other approaches, our framework doesn't make unrealistic assumptions. This could help us better understand and address practical issues like overfitting, generalization, and separate representations in deep networks.",
    "Revolutionize Code Generation: Introducing Neural Program Search!\n\nImagine being able to create programs from simple natural language descriptions and a few examples. Our groundbreaking algorithm makes this a reality, combining the power of Deep Learning and Program Synthesis to generate code like never before.\n\nAt the heart of our innovation lies a rich, domain-specific language (DSL) and an efficient search algorithm guided by a Seq2Tree model. But don't just take our word for it - we've developed a comprehensive, semi-synthetic dataset to put our approach to the test.\n\nThe results are astounding: our algorithm outshines a sequence-to-sequence model with attention baseline, demonstrating a significant leap forward in code generation capabilities. Get ready to transform the way you approach programming!",
    "Despite their architectural differences, most state-of-the-art neural machine translation systems rely on a crucial component: the Attention mechanism. However, current attention methods primarily focus on individual tokens, overlooking the significance of phrasal alignments, which were instrumental in the success of phrase-based statistical machine translation. This paper proposes innovative phrase-based attention methods that treat n-grams of tokens as attention entities, thereby addressing this limitation. By integrating our phrase-based attentions into the Transformer network, we achieve notable improvements in translation tasks, specifically 1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English on WMT newstest2014 using WMT'16 training data.",
    "Revolutionizing the way we understand and apply edits, we pioneer the groundbreaking task of learning distributed representations of edits. By fusing the power of a \"neural editor\" with an \"edit encoder\", our innovative models unlock the ability to distill the essence of an edit and seamlessly apply it to new inputs. We put our approach to the test on real-world natural language and source code edit data, yielding remarkable results that demonstrate our neural network models' uncanny ability to grasp the intricate structure and semantics of edits. We invite the research community to join us in exploring this exciting new frontier, poised to unlock unprecedented possibilities in edit-based applications.",
    "We introduce a systematic approach to kernel learning, grounded in a frequency-based understanding of kernels that remain unchanged under translations or rotations. This approach generates a series of feature maps, progressively enhancing the SVM margin through iterative refinement. We offer robust assurances of optimality and generalizability, framing our algorithm as a dynamic process of equilibrium discovery in a specific two-player minimax game. Experimental results on both synthetic and real-world datasets showcase the method's scalability and consistent superiority over comparable random features-based techniques.",
    "Imagine a learning system that can adapt and evolve over time, effortlessly absorbing new knowledge while retaining what it's already learned. Welcome to Variational Continual Learning (VCL), a groundbreaking framework that's revolutionizing the field of artificial intelligence. By combining the power of online variational inference with cutting-edge Monte Carlo techniques, VCL enables neural networks to learn continuously, tackling complex tasks with ease. Whether it's deep discriminative models or deep generative models, VCL can handle it all, even when tasks change or new ones emerge. The results are astounding: VCL outshines state-of-the-art methods, avoiding the pitfalls of catastrophic forgetting with ease. Get ready to unlock the full potential of continual learning with VCL!",
    "This report serves multiple objectives. Primarily, it delves into the reproducibility of the groundbreaking paper \"On the regularization of Wasserstein GANs\" (2018), putting its findings to the test. We zoom in on five critical aspects of the original experiments, meticulously reproducing and analyzing the results: the speed of learning, model stability, robustness against hyperparameter tuning, the accuracy of Wasserstein distance estimation, and the efficacy of various sampling methods. Our investigation also sheds light on which components of the original contribution can be successfully replicated, and the resource investment required to do so. To ensure transparency and facilitate further research, we've made all source code used in our reproduction efforts publicly available.",
    "We propose a new feature extraction technique for program execution logs, automatically extracting complex patterns from behavior graphs and embedding them into a continuous space using an autoencoder. We evaluate the features on a real-world malware detection task and find that the embedding space reveals interpretable structures.",
    "We introduce a flexible neural model that uses a variational autoencoder to generate missing features in a single step, conditioned on any subset of observed features. This model can handle both continuous and categorical features. We train the model using stochastic variational Bayes, and our experiments on synthetic data, feature imputation, and image inpainting demonstrate its effectiveness and ability to produce diverse samples.",
    "Variational Autoencoders (VAEs) were initially conceived (Kingma & Welling, 2014) as probabilistic generative models that facilitate approximate Bayesian inference. The introduction of β-VAEs (Higgins et al., 2017) departed from this interpretation and expanded the applicability of VAEs to diverse domains, including representation learning, clustering, and lossy data compression, by incorporating an objective function that enables the balancing of the information content (\"bit rate\") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). This paper reexamines the rate/distortion trade-off in the context of hierarchical VAEs, specifically VAEs comprising multiple layers of latent variables. We identify a general class of inference models that permit the decomposition of the rate into layer-specific contributions, which can subsequently be fine-tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and validate our theoretical findings through large-scale experiments. Our results provide practitioners with guidance on the optimal region in rate-space to target for a given application.",
    "Unlocking the secrets of adversarial examples is crucial for understanding the robustness of deep neural networks (DNNs) against sneaky attacks! Recently, Ma et al. (ICLR 2018) pioneered the use of local intrinsic dimensionality (LID) to analyze layer-wise hidden representations of DNNs and uncover the mysteries of adversarial subspaces. They showed that LID can effectively characterize the subspaces associated with different attack methods, such as the Carlini and Wagner's (C&W) attack and the fast gradient sign attack.\n\nIn this groundbreaking paper, we take it to the next level by conducting two innovative sets of experiments that have never been attempted before! We use MNIST and CIFAR-10 to push the boundaries of LID analysis and reveal its limitations in characterizing adversarial subspaces. Specifically, we explore two exciting areas: (i) oblivious attacks and LID analysis using adversarial examples with varying confidence levels, and (ii) black-box transfer attacks.\n\nOur findings are nothing short of astonishing! We discover that LID's performance is extremely sensitive to the confidence parameter used in an attack, and that LID learned from ensembles of adversarial examples with different confidence levels surprisingly falls short. Moreover, we find that when adversarial examples are crafted from another DNN model, LID is ineffective in characterizing their adversarial subspaces. These two findings combined reveal the limited capabilities of LID in understanding the subspaces of adversarial examples, opening up new avenues for future research!",
    "Generative Adversarial Networks (GANs) are renowned for generating appealing samples, but their training process is notoriously challenging. While numerous studies have focused on reformulating the GAN objective, surprisingly few have explored optimization methods tailored to adversarial training. This work reframes GAN optimization as a variational inequality problem, drawing on mathematical programming literature to dispel common misconceptions about saddle point optimization. We adapt techniques from variational inequalities to GAN training, including averaging, extrapolation, and a novel, computationally efficient approach called \"extrapolation from the past.\" These methods are applied to stochastic gradient descent (SGD) and Adam optimization algorithms.",
    "Revolutionizing graph-based semi-supervised classification, neural message passing algorithms have made tremendous strides in recent years. However, these methods have a major limitation: they only consider nodes within a narrow, fixed radius, making it challenging to expand their scope. In this groundbreaking paper, we shatter this constraint by harnessing the powerful connection between graph convolutional networks (GCN) and PageRank. Our innovative approach, built on personalized PageRank, unlocks a more effective propagation scheme. We then leverage this breakthrough to develop a cutting-edge model, personalized propagation of neural predictions (PPNP), and its lightning-fast approximation, APPNP. The results are astounding: our model trains in record time, boasts fewer parameters, and can seamlessly integrate with any neural network. Plus, it taps into a vast, adaptable neighborhood for classification, giving it a decisive edge. In the most comprehensive study of its kind, we demonstrate that our model outclasses several recent contenders in semi-supervised classification. And the best part? Our implementation is available online, ready to transform your graph-based classification projects!",
    "We have identified obfuscated gradients, a form of gradient masking, as a phenomenon that can lead to a false sense of security in defense mechanisms against adversarial examples. Although defenses that exhibit obfuscated gradients may appear to successfully counter iterative optimization-based attacks, our research reveals that such defenses can be effectively circumvented. We have characterized the distinctive behaviors of defenses that exhibit this effect and have developed novel attack techniques to overcome each of the three types of obfuscated gradients we have discovered. In a comprehensive case study examining non-certified white-box-secure defenses presented at ICLR 2018, we found that obfuscated gradients are a prevalent phenomenon, with seven out of nine defenses relying on this effect. Our newly developed attacks successfully bypassed six of these defenses completely, and one partially, within the original threat model considered in each paper.",
    "The limitations of traditional node representation methods in network analysis are stark. By relying on point vectors in low-dimensional spaces, they fail to capture the inherent uncertainty of node relationships. In response, we introduce Graph2Gauss, a groundbreaking approach that embeds nodes as Gaussian distributions, acknowledging the ambiguity and nuance of real-world networks. This paradigm shift enables our method to excel in tasks like link prediction and node classification, even in large-scale attributed graphs. Moreover, our unsupervised approach tackles inductive learning scenarios and adapts seamlessly to diverse graph types, from plain to attributed, and directed to undirected. By harnessing both network structure and node attributes, we can generalize to unseen nodes without additional training, a feat unachievable by traditional methods. Our personalized ranking formulation, rooted in node distances, leverages the natural network ordering to learn embeddings. The results are striking: Graph2Gauss outperforms state-of-the-art methods in real-world network experiments. Furthermore, by embracing uncertainty, we uncover hidden insights, including neighborhood diversity and intrinsic graph dimensionality, revealing a more comprehensive understanding of complex networks.",
    "Convolutional Neural Networks (CNNs) have emerged as the preferred approach for addressing learning problems involving two-dimensional planar images. However, a range of recently identified challenges have created a pressing need for models capable of analyzing spherical images. Notable examples of such applications include omnidirectional vision for drones, robots, and autonomous vehicles, molecular regression problems, and global weather and climate modeling. A straightforward application of convolutional networks to a planar projection of the spherical signal is inherently flawed, as the space-varying distortions introduced by such a projection render translational weight sharing ineffective.\n\nThis paper presents the foundational components for constructing spherical CNNs. We propose a novel definition for the spherical cross-correlation, which exhibits both expressiveness and rotation equivariance. The spherical correlation satisfies a generalized Fourier theorem, enabling efficient computation via a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs in the context of 3D model recognition and atomization energy regression.",
    "Unlock the Power of NLP in Cheminformatics: A Game-Changing Approach to Classification Problems\n\nImagine harnessing the potential of natural language processing (NLP) to revolutionize classification problems in cheminformatics. By leveraging the standard textual representation of compounds, SMILES, we can bridge the gap between these two seemingly disparate fields.\n\nIn this groundbreaking study, we tackle the critical challenge of activity prediction against a target protein, a pivotal step in computer-aided drug design. Our experiments yield astonishing results, outperforming state-of-the-art handcrafted representations and providing unprecedented structural insights into the decision-making process.\n\nGet ready to transform the future of cheminformatics with the fusion of NLP and SMILES!",
    "Using computer vision and deep learning in farming can help farmers grow better quality crops and increase their production. After harvesting, the quality of fruits and vegetables affects their sale in the export market. Apples, in particular, can have many defects that happen during or after harvesting. This study wants to help farmers handle apples better after harvesting by seeing if new computer vision and deep learning methods, like YOLOv3, can identify healthy apples from those with defects.",
    "Unlock the Power of Large LSTM Networks: 2 Game-Changing Strategies to Boost Training Speed and Efficiency!\n\nAre you tired of dealing with slow and cumbersome Long Short-Term Memory (LSTM) networks? Do you want to unlock their full potential without sacrificing performance? Look no further! We're excited to introduce two revolutionary approaches that will transform the way you work with large LSTM networks.\n\nFirst, we present \"matrix factorization by design\", a innovative technique that breaks down the LSTM matrix into the product of two smaller matrices, slashing the number of parameters and accelerating training times.\n\nNext, we reveal a groundbreaking method that partitions the LSTM matrix, inputs, and states into independent groups, allowing for unprecedented flexibility and speed.\n\nThe result? You can now train large LSTM networks at lightning-fast speeds, achieving near state-of-the-art perplexity with significantly fewer RNN parameters. Say goodbye to tedious training times and hello to unparalleled efficiency!",
    "Current deep reading comprehension models rely heavily on a type of artificial intelligence called recurrent neural nets. These models work well with language because they process it in a sequence, but they have a major limitation: they can't be parallelized, which means they can't be sped up by dividing the work among multiple processors. This makes them slow and impractical for applications where speed is crucial, especially when dealing with long texts. In this paper, we propose an alternative approach using a different type of AI called convolutional architecture. By replacing the recurrent units with simple convolutional ones, we achieve similar results to the current state-of-the-art models on two question-answering tasks, but with a significant advantage: our model is up to 100 times faster.",
    "We analyze Ritter et al.'s (2018) reinstatement mechanism, which reveals two neuron classes in an epLSTM cell's working memory when trained on an episodic Harlow task using episodic meta-RL: Abstract neurons encode shared knowledge, while Episodic neurons store episode-specific task information.",
    "Have you heard about the rate-distortion-perception function (RDPF)? It's a super helpful tool that Blau and Michaeli introduced in 2019 to help us understand how to balance realism and distortion when compressing files. While it's similar to the rate-distortion function, there's been a big question mark over whether it's actually possible to create encoders and decoders that can achieve the rates suggested by the RDPF. But don't worry, we've got some exciting news! Building on the work of Li and El Gamal from 2018, we've discovered that it is indeed possible to achieve the RDPF using special codes that are stochastic and variable in length. And the best part? We've also proven that the RDPF sets a lower bound for the achievable rate, which is a huge breakthrough!",
    "This paper introduces Neural Phrase-based Machine Translation (NPMT), a novel approach that explicitly models phrase structures in output sequences utilizing Sleep-Wake Networks (SWAN), a recently proposed segmentation-based sequence modeling methodology. To address the limitation of SWAN's monotonic alignment requirement, we propose a novel layer that facilitates soft local reordering of input sequences. Notably, NPMT diverges from traditional neural machine translation (NMT) approaches by eschewing attention-based decoding mechanisms. Instead, it generates phrases in a sequential manner, enabling linear-time decoding. Our empirical results demonstrate that NPMT achieves superior performance on the IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks, outperforming strong NMT baselines. Furthermore, our method is observed to produce meaningful phrases in the target languages.",
    "The vulnerability of deep neural networks (DNNs) to small adversarial perturbations, which can lead to misclassification, is a well-established concern. This paper highlights the importance of sparse representations of input data in mitigating these attacks. We demonstrate that incorporating a sparsifying front end into linear classifiers can significantly reduce the impact of $\\ell_{\\infty}$-bounded attacks, decreasing output distortion by a factor of approximately $K / N$, where $N$ represents the data dimension and $K$ denotes the sparsity level. Furthermore, we extend this concept to DNNs by introducing a \"locally linear\" model, which provides a theoretical framework for developing both attacks and defenses. Our experimental results on the MNIST dataset validate the effectiveness of the proposed sparsifying front end in enhancing the robustness of DNNs against adversarial attacks.",
    "We introduce Supervised Policy Update (SPU), a novel sample-efficient approach to deep reinforcement learning. This methodology begins by leveraging data from the current policy to formulate and solve a constrained optimization problem in the non-parameterized proximal policy space. Through supervised regression, SPU then translates the optimal non-parameterized policy into a parameterized policy, generating new samples. Notably, SPU is versatile and can be applied to both discrete and continuous action spaces, accommodating a range of proximity constraints. We demonstrate how SPU can be used to address the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, as well as the Proximal Policy Optimization (PPO) problem. Furthermore, our implementation of SPU is significantly simpler than TRPO. Our comprehensive experiments reveal that SPU surpasses TRPO in sample efficiency on Mujoco simulated robotic tasks and outperforms PPO on Atari video game tasks.",
    "Unlock the Secrets of Video Prediction Networks with Moving Symbols: A Game-Changing Synthetic Dataset!\n\nImagine having a powerful tool to objectively evaluate and improve video prediction networks. We're excited to introduce Moving Symbols, a parameterized synthetic dataset that allows for controlled variation and experimentation. With Moving Symbols, we've uncovered limitations in current state-of-the-art approaches and propose a more meaningful performance metric to boost experimental insights.\n\nThis dataset provides a set of canonical test cases that will revolutionize the way the community understands and refines the representations learned by video prediction networks. The future of video prediction just got brighter!\n\nGet started with Moving Symbols today and access the code at https://github.com/rszeto/moving-symbols."
  ]
}