{
  "original": [
    "In this report, we describe a Theano-based AlexNet (Krizhevsky et al., 2012) implementation and its naive data parallelism on multiple GPUs. Our performance on 2 GPUs is comparable with the state-of-art Caffe library (Jia et al., 2014) run on 1 GPU. To the best of our knowledge, this is the first open-source Python-based AlexNet implementation to-date.",
    "We show that deep narrow Boltzmann machines are universal approximators of probability distributions on the activities of their visible units, provided they have sufficiently many hidden layers, each containing the same number of units as the visible layer. We show that, within certain parameter domains, deep Boltzmann machines can be studied as feedforward networks. We provide upper and lower bounds on the sufficient depth and width of universal approximators. These results settle various intuitions regarding undirected networks and, in particular, they show that deep narrow Boltzmann machines are at least as compact universal approximators as narrow sigmoid belief networks and restricted Boltzmann machines, with respect to the currently available bounds for those models.",
    "Leveraging advances in variational inference, we propose to enhance recurrent neural networks with latent variables, resulting in Stochastic Recurrent Networks (STORNs). The model i) can be trained with stochastic gradient methods, ii) allows structured and multi-modal conditionals at each time step, iii) features a reliable estimator of the marginal likelihood and iv) is a generalisation of deterministic recurrent neural networks. We evaluate the method on four polyphonic musical data sets and motion capture data.",
    "We describe a general framework for online adaptation of optimization hyperparameters by `hot swapping' their values during learning. We investigate this approach in the context of adaptive learning rate selection using an explore-exploit strategy from the multi-armed bandit literature. Experiments on a benchmark neural network show that the hot swapping approach leads to consistently better solutions compared to well-known alternatives such as AdaDelta and stochastic gradient with exhaustive hyperparameter search.",
    "Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm for partial least squares, whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.",
    "Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report state-of-the-art results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015).",
    "Automatic speech recognition systems usually rely on spectral-based features, such as MFCC of PLP. These features are extracted based on prior knowledge such as, speech perception or/and speech production. Recently, convolutional neural networks have been shown to be able to estimate phoneme conditional probabilities in a completely data-driven manner, i.e. using directly temporal raw speech signal as input. This system was shown to yield similar or better performance than HMM/ANN based system on phoneme recognition task and on large scale continuous speech recognition task, using less parameters. Motivated by these studies, we investigate the use of simple linear classifier in the CNN-based framework. Thus, the network learns linearly separable features from raw speech. We show that such system yields similar or better performance than MLP based system using cepstral-based features as input.",
    "We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.",
    "We develop a new method for visualizing and refining the invariances of learned representations. Specifically, we test for a general form of invariance, linearization, in which the action of a transformation is confined to a low-dimensional subspace. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images lying on a path between them that is of minimal length in the space of the representation (a \"representational geodesic\"). If the transformation relating the two reference images is linearized by the representation, this sequence should follow the gradual evolution of this transformation. We use this method to assess the invariance properties of a state-of-the-art image classification network and find that geodesics generated for image pairs differing by translation, rotation, and dilation do not evolve according to their associated transformations. Our method also suggests a remedy for these failures, and following this prescription, we show that the modified representation is able to linearize a variety of geometric image transformations.",
    "Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.",
    "We present a novel architecture, the \"stacked what-where auto-encoders\" (SWWAE), which integrates discriminative and generative pathways and provides a unified approach to supervised, semi-supervised and unsupervised learning without relying on sampling during training. An instantiation of SWWAE uses a convolutional net (Convnet) (LeCun et al. (1998)) to encode the input, and employs a deconvolutional net (Deconvnet) (Zeiler et al. (2010)) to produce the reconstruction. The objective function includes reconstruction terms that induce the hidden states in the Deconvnet to be similar to those of the Convnet. Each pooling layer produces two sets of variables: the \"what\" which are fed to the next layer, and its complementary variable \"where\" that are fed to the corresponding layer in the generative decoder.",
    "We investigate the problem of inducing word embeddings that are tailored for a particular bilexical relation. Our learning algorithm takes an existing lexical vector space and compresses it such that the resulting word embeddings are good predictors for a target bilexical relation. In experiments we show that task-specific embeddings can benefit both the quality and efficiency in lexical prediction tasks.",
    "A generative model is developed for deep (multi-layered) convolutional dictionary learning. A novel probabilistic pooling operation is integrated into the deep model, yielding efficient bottom-up (pretraining) and top-down (refinement) probabilistic learning. Experimental results demonstrate powerful capabilities of the model to learn multi-layer features from images, and excellent classification results are obtained on the MNIST and Caltech 101 datasets.",
    "Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.",
    "Convolutional neural networks (CNNs) work well on large datasets. But labelled data is hard to collect, and in some applications larger amounts of data are not available. The problem then is how to use CNNs with small data -- as CNNs overfit quickly. We present an efficient Bayesian CNN, offering better robustness to over-fitting on small data than traditional approaches. This is by placing a probability distribution over the CNN's kernels. We approximate our model's intractable posterior with Bernoulli variational distributions, requiring no additional model parameters.   On the theoretical side, we cast dropout network training as approximate inference in Bayesian neural networks. This allows us to implement our model using existing tools in deep learning with no increase in time complexity, while highlighting a negative result in the field. We show a considerable improvement in classification accuracy compared to standard techniques and improve on published state-of-the-art results for CIFAR-10.",
    "We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.",
    "Distributed representations of words have boosted the performance of many Natural Language Processing tasks. However, usually only one representation per word is obtained, not acknowledging the fact that some words have multiple meanings. This has a negative effect on the individual word representations and the language model as a whole. In this paper we present a simple model that enables recent techniques for building word vectors to represent distinct senses of polysemic words. In our assessment of this model we show that it is able to effectively discriminate between words' senses and to do so in a computationally efficient manner.",
    "We propose Diverse Embedding Neural Network (DENN), a novel architecture for language models (LMs). A DENNLM projects the input word history vector onto multiple diverse low-dimensional sub-spaces instead of a single higher-dimensional sub-space as in conventional feed-forward neural network LMs. We encourage these sub-spaces to be diverse during network training through an augmented loss function. Our language modeling experiments on the Penn Treebank data set show the performance benefit of using a DENNLM.",
    "A standard approach to Collaborative Filtering (CF), i.e. prediction of user ratings on items, relies on Matrix Factorization techniques. Representations for both users and items are computed from the observed ratings and used for prediction. Unfortunatly, these transductive approaches cannot handle the case of new users arriving in the system, with no known rating, a problem known as user cold-start. A common approach in this context is to ask these incoming users for a few initialization ratings. This paper presents a model to tackle this twofold problem of (i) finding good questions to ask, (ii) building efficient representations from this small amount of information. The model can also be used in a more standard (warm) context. Our approach is evaluated on the classical CF problem and on the cold-start problem on four different datasets showing its ability to improve baseline performance in both cases.",
    "We propose a deep learning framework for modeling complex high-dimensional densities called Non-linear Independent Component Estimation (NICE). It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. We parametrize this transformation so that computing the Jacobian determinant and inverse transform is trivial, yet we maintain the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood, which is tractable. Unbiased ancestral sampling is also easy. We show that this approach yields good generative models on four image datasets and can be used for inpainting.",
    "We introduce Deep Linear Discriminant Analysis (DeepLDA) which learns linearly separable latent representations in an end-to-end fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classification problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a non-linear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and back-propagation. For evaluation we test our approach on three different benchmark datasets (MNIST, CIFAR-10 and STL-10). DeepLDA produces competitive results on MNIST and CIFAR-10 and outperforms a network trained with categorical cross entropy (same architecture) on a supervised setting of STL-10.",
    "Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.   Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)).   Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.",
    "We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. The data are linearly transformed, and each component is then normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and a constant. We optimize the parameters of the full transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. The optimized transformation substantially Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than alternative methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We demonstrate the use of the model as a prior probability density that can be used to remove additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized using the same Gaussianization objective, thus offering an unsupervised method of optimizing a deep network architecture.",
    "We present flattened convolutional neural networks that are designed for fast feedforward execution. The redundancy of the parameters, especially weights of the convolutional filters in convolutional neural networks has been extensively studied and different heuristics have been proposed to construct a low rank basis of the filters after training. In this work, we train flattened networks that consist of consecutive sequence of one-dimensional filters across all directions in 3D space to obtain comparable performance as conventional convolutional networks. We tested flattened model on different datasets and found that the flattened layer can effectively substitute for the 3D filters without loss of accuracy. The flattened convolution pipelines provide around two times speed-up during feedforward pass compared to the baseline model due to the significant reduction of learning parameters. Furthermore, the proposed method does not require efforts in manual tuning or post processing once the model is trained.",
    "In this paper, we introduce a novel deep learning framework, termed Purine. In Purine, a deep network is expressed as a bipartite graph (bi-graph), which is composed of interconnected operators and data tensors. With the bi-graph abstraction, networks are easily solvable with event-driven task dispatcher. We then demonstrate that different parallelism schemes over GPUs and/or CPUs on single or multiple PCs can be universally implemented by graph composition. This eases researchers from coding for various parallelization schemes, and the same dispatcher can be used for solving variant graphs. Scheduled by the task dispatcher, memory transfers are fully overlapped with other computations, which greatly reduce the communication overhead and help us achieve approximate linear acceleration.",
    "In this paper we propose a model that combines the strengths of RNNs and SGVB: the Variational Recurrent Auto-Encoder (VRAE). Such a model can be used for efficient, large scale unsupervised learning on time series data, mapping the time series data to a latent vector representation. The model is generative, such that data can be generated from samples of the latent space. An important contribution of this work is that the model can make use of unlabeled data in order to facilitate supervised training of RNNs by initialising the weights and network state.",
    "Current work in lexical distributed representations maps each word to a point vector in low-dimensional space. Mapping instead to a density provides many interesting advantages, including better capturing uncertainty about a representation and its relationships, expressing asymmetries more naturally than dot product or cosine similarity, and enabling more expressive parameterization of decision boundaries. This paper advocates for density-based distributed embeddings and presents a method for learning representations in the space of Gaussian distributions. We compare performance on various word embedding benchmarks, investigate the ability of these embeddings to model entailment and other asymmetric relationships, and explore novel properties of the representation.",
    "Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them. For example, it is possible to train Maxout networks with 10 bits multiplications.",
    "Multiple instance learning (MIL) can reduce the need for costly annotation in tasks such as semantic segmentation by weakening the required degree of supervision. We propose a novel MIL formulation of multi-class semantic segmentation learning by a fully convolutional network. In this setting, we seek to learn a semantic segmentation model from just weak image-level labels. The model is trained end-to-end to jointly optimize the representation while disambiguating the pixel-image label assignment. Fully convolutional training accepts inputs of any size, does not need object proposal pre-processing, and offers a pixelwise loss map for selecting latent instances. Our multi-class MIL loss exploits the further supervision given by images with multiple labels. We evaluate this approach through preliminary experiments on the PASCAL VOC segmentation challenge.",
    "Recently, nested dropout was proposed as a method for ordering representation units in autoencoders by their information content, without diminishing reconstruction cost. However, it has only been applied to training fully-connected autoencoders in an unsupervised setting. We explore the impact of nested dropout on the convolutional layers in a CNN trained by backpropagation, investigating whether nested dropout can provide a simple and systematic way to determine the optimal representation size with respect to the desired accuracy and desired task and data complexity.",
    "Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.",
    "When a three-dimensional object moves relative to an observer, a change occurs on the observer's image plane and in the visual representation computed by a learned model. Starting with the idea that a good visual representation is one that transforms linearly under scene motions, we show, using the theory of group representations, that any such representation is equivalent to a combination of the elementary irreducible representations. We derive a striking relationship between irreducibility and the statistical dependency structure of the representation, by showing that under restricted conditions, irreducible representations are decorrelated. Under partial observability, as induced by the perspective projection of a scene onto the image plane, the motion group does not have a linear action on the space of images, so that it becomes necessary to perform inference over a latent representation that does transform linearly. This idea is demonstrated in a model of rotating NORB objects that employs a latent representation of the non-commutative 3D rotation group SO(3).",
    "Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm. Specifically, we propose to train a spherical k-means, after having reduced the MIPS problem to a Maximum Cosine Similarity Search (MCSS). Experiments on two standard recommendation system benchmarks as well as on large vocabulary word embeddings, show that this simple approach yields much higher speedups, for the same retrieval precision, than current state-of-the-art hashing-based and tree-based methods. This simple method also yields more robust retrievals when the query is corrupted by noise.",
    "The variational autoencoder (VAE; Kingma, Welling (2014)) is a recently proposed generative model pairing a top-down generative network with a bottom-up recognition network which approximates posterior inference. It typically makes strong assumptions about posterior inference, for instance that the posterior distribution is approximately factorial, and that its parameters can be approximated with nonlinear regression from the observations. As we show empirically, the VAE objective can lead to overly simplified representations which fail to use the network's entire modeling capacity. We present the importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log-likelihood lower bound derived from importance weighting. In the IWAE, the recognition network uses multiple samples to approximate the posterior, giving it increased flexibility to model complex posteriors which do not fit the VAE modeling assumptions. We show empirically that IWAEs learn richer latent space representations than VAEs, leading to improved test log-likelihood on density estimation benchmarks.",
    "This work investigates how using reduced precision data in Convolutional Neural Networks (CNNs) affects network accuracy during classification. More specifically, this study considers networks where each layer may use different precision data. Our key result is the observation that the tolerance of CNNs to reduced precision data not only varies across networks, a well established observation, but also within networks. Tuning precision per layer is appealing as it could enable energy and performance improvements. In this paper we study how error tolerance across layers varies and propose a method for finding a low precision configuration for a network while maintaining high accuracy. A diverse set of CNNs is analyzed showing that compared to a conventional implementation using a 32-bit floating-point representation for all layers, and with less than 1% loss in relative accuracy, the data footprint required by these networks can be reduced by an average of 74% and up to 92%.",
    "The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.",
    "Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.",
    "We propose local distributional smoothness (LDS), a new notion of smoothness for statistical model that can be used as a regularization term to promote the smoothness of the model distribution. We named the LDS based regularization as virtual adversarial training (VAT). The LDS of a model at an input datapoint is defined as the KL-divergence based robustness of the model distribution against local perturbation around the datapoint. VAT resembles adversarial training, but distinguishes itself in that it determines the adversarial direction from the model distribution alone without using the label information, making it applicable to semi-supervised learning. The computational cost for VAT is relatively low. For neural network, the approximated gradient of the LDS can be computed with no more than three pairs of forward and back propagations. When we applied our technique to supervised and semi-supervised learning for the MNIST dataset, it outperformed all the training methods other than the current state of the art method, which is based on a highly advanced generative model. We also applied our method to SVHN and NORB, and confirmed our method's superior performance over the current state of the art semi-supervised method applied to these datasets.",
    "The availability of large labeled datasets has allowed Convolutional Network models to achieve impressive recognition results. However, in many settings manual annotation of the data is impractical; instead our data has noisy labels, i.e. there is some freely available label for each image which may or may not be accurate. In this paper, we explore the performance of discriminatively-trained Convnets when trained on such noisy data. We introduce an extra noise layer into the network which adapts the network outputs to match the noisy label distribution. The parameters of this noise layer can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks. We demonstrate the approaches on several datasets, including large scale experiments on the ImageNet classification benchmark.",
    "We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.",
    "Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.",
    "In this work, we propose a new method to integrate two recent lines of work: unsupervised induction of shallow semantics (e.g., semantic roles) and factorization of relations in text and knowledge bases. Our model consists of two components: (1) an encoding component: a semantic role labeling model which predicts roles given a rich set of syntactic and lexical features; (2) a reconstruction component: a tensor factorization model which relies on roles to predict argument fillers. When the components are estimated jointly to minimize errors in argument reconstruction, the induced roles largely correspond to roles defined in annotated resources. Our method performs on par with most accurate role induction methods on English, even though, unlike these previous approaches, we do not incorporate any prior linguistic knowledge about the language.",
    "The notion of metric plays a key role in machine learning problems such as classification, clustering or ranking. However, it is worth noting that there is a severe lack of theoretical guarantees that can be expected on the generalization capacity of the classifier associated to a given metric. The theoretical framework of $(\\epsilon, \\gamma, \\tau)$-good similarity functions (Balcan et al., 2008) has been one of the first attempts to draw a link between the properties of a similarity function and those of a linear classifier making use of it. In this paper, we extend and complete this theory by providing a new generalization bound for the associated classifier based on the algorithmic robustness framework.",
    "We present the multiplicative recurrent neural network as a general model for compositional meaning in language, and evaluate it on the task of fine-grained sentiment analysis. We establish a connection to the previously investigated matrix-space models for compositionality, and show they are special cases of the multiplicative recurrent net. Our experiments show that these models perform comparably or better than Elman-type additive recurrent neural networks and outperform matrix-space models on a standard fine-grained sentiment analysis corpus. Furthermore, they yield comparable results to structural deep models on the recently published Stanford Sentiment Treebank without the need for generating parse trees.",
    "Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.",
    "We develop a new statistical model for photographic images, in which the local responses of a bank of linear filters are described as jointly Gaussian, with zero mean and a covariance that varies slowly over spatial position. We optimize sets of filters so as to minimize the nuclear norms of matrices of their local activations (i.e., the sum of the singular values), thus encouraging a flexible form of sparsity that is not tied to any particular dictionary or coordinate system. Filters optimized according to this objective are oriented and bandpass, and their responses exhibit substantial local correlation. We show that images can be reconstructed nearly perfectly from estimates of the local filter response covariances alone, and with minimal degradation (either visual or MSE) from low-rank approximations of these covariances. As such, this representation holds much promise for use in applications such as denoising, compression, and texture representation, and may form a useful substrate for hierarchical decompositions.",
    "Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the \"deconvolution approach\" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.",
    "Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.",
    "This paper introduces a greedy parser based on neural networks, which leverages a new compositional sub-tree representation. The greedy parser and the compositional procedure are jointly trained, and tightly depends on each-other. The composition procedure outputs a vector representation which summarizes syntactically (parsing tags) and semantically (words) sub-trees. Composition and tagging is achieved over continuous (word or tag) representations, and recurrent neural networks. We reach F1 performance on par with well-known existing parsers, while having the advantage of speed, thanks to the greedy nature of the parser. We provide a fully functional implementation of the method described in this paper.",
    "Suitable lateral connections between encoder and decoder are shown to allow higher layers of a denoising autoencoder (dAE) to focus on invariant representations. In regular autoencoders, detailed information needs to be carried through the highest layers but lateral connections from encoder to decoder relieve this pressure. It is shown that abstract invariant features can be translated to detailed reconstructions when invariant features are allowed to modulate the strength of the lateral connection. Three dAE structures with modulated and additive lateral connections, and without lateral connections were compared in experiments using real-world images. The experiments verify that adding modulated lateral connections to the model 1) improves the accuracy of the probability model for inputs, as measured by denoising performance; 2) results in representations whose degree of invariance grows faster towards the higher layers; and 3) supports the formation of diverse invariant poolings.",
    "We develop a new method for visualizing and refining the invariances of learned representations. Specifically, we test for a general form of invariance, linearization, in which the action of a transformation is confined to a low-dimensional subspace. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images lying on a path between them that is of minimal length in the space of the representation (a \"representational geodesic\"). If the transformation relating the two reference images is linearized by the representation, this sequence should follow the gradual evolution of this transformation. We use this method to assess the invariance properties of a state-of-the-art image classification network and find that geodesics generated for image pairs differing by translation, rotation, and dilation do not evolve according to their associated transformations. Our method also suggests a remedy for these failures, and following this prescription, we show that the modified representation is able to linearize a variety of geometric image transformations.",
    "Genomics are rapidly transforming medical practice and basic biomedical research, providing insights into disease mechanisms and improving therapeutic strategies, particularly in cancer. The ability to predict the future course of a patient's disease from high-dimensional genomic profiling will be essential in realizing the promise of genomic medicine, but presents significant challenges for state-of-the-art survival analysis methods. In this abstract we present an investigation in learning genomic representations with neural networks to predict patient survival in cancer. We demonstrate the advantages of this approach over existing survival analysis methods using brain tumor data.",
    "Existing approaches to combine both additive and multiplicative neural units either use a fixed assignment of operations or require discrete optimization to determine what function a neuron should perform. However, this leads to an extensive increase in the computational complexity of the training procedure.   We present a novel, parameterizable transfer function based on the mathematical concept of non-integer functional iteration that allows the operation each neuron performs to be smoothly and, most importantly, differentiablely adjusted between addition and multiplication. This allows the decision between addition and multiplication to be integrated into the standard backpropagation training procedure.",
    "One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.",
    "We extend Stochastic Gradient Variational Bayes to perform posterior inference for the weights of Stick-Breaking processes. This development allows us to define a Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric version of the variational autoencoder that has a latent representation with stochastic dimensionality. We experimentally demonstrate that the SB-VAE, and a semi-supervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE's.",
    "Unsupervised learning on imbalanced data is challenging because, when given imbalanced data, current model is often dominated by the major category and ignores the categories with small amount of data. We develop a latent variable model that can cope with imbalanced data by dividing the latent space into a shared space and a private space. Based on Gaussian Process Latent Variable Models, we propose a new kernel formulation that enables the separation of latent space and derives an efficient variational inference method. The performance of our model is demonstrated with an imbalanced medical image dataset.",
    "Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful.",
    "This paper shows how one can directly apply natural language processing (NLP) methods to classification problems in cheminformatics. Connection between these seemingly separate fields is shown by considering standard textual representation of compound, SMILES. The problem of activity prediction against a target protein is considered, which is a crucial part of computer aided drug design process. Conducted experiments show that this way one can not only outrank state of the art results of hand crafted representations but also gets direct structural insights into the way decisions are made.",
    "We introduce a neural network architecture and a learning algorithm to produce factorized symbolic representations. We propose to learn these concepts by observing consecutive frames, letting all the components of the hidden representation except a small discrete set (gating units) be predicted from the previous frame, and let the factors of variation in the next frame be represented entirely by these discrete gated units (corresponding to symbolic representations). We demonstrate the efficacy of our approach on datasets of faces undergoing 3D transformations and Atari 2600 games.",
    "We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data.",
    "We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. The data are linearly transformed, and each component is then normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and a constant. We optimize the parameters of the full transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. The optimized transformation substantially Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than alternative methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We demonstrate the use of the model as a prior probability density that can be used to remove additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized using the same Gaussianization objective, thus offering an unsupervised method of optimizing a deep network architecture.",
    "Approximate variational inference has shown to be a powerful tool for modeling unknown complex probability distributions. Recent advances in the field allow us to learn probabilistic models of sequences that actively exploit spatial and temporal structure. We apply a Stochastic Recurrent Network (STORN) to learn robot time series data. Our evaluation demonstrates that we can robustly detect anomalies both off- and on-line.",
    "We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.",
    "We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.",
    "Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.",
    "We propose a framework for training multiple neural networks simultaneously. The parameters from all models are regularised by the tensor trace norm, so that each neural network is encouraged to reuse others' parameters if possible -- this is the main motivation behind multi-task learning. In contrast to many deep multi-task learning models, we do not predefine a parameter sharing strategy by specifying which layers have tied parameters. Instead, our framework considers sharing for all shareable layers, and the sharing strategy is learned in a data-driven way.",
    "This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.",
    "We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.",
    "Many machine learning classifiers are vulnerable to adversarial perturbations. An adversarial perturbation modifies an input to change a classifier's prediction without causing the input to seem substantially different to human perception. We deploy three methods to detect adversarial images. Adversaries trying to bypass our detectors must make the adversarial image less pathological or they will fail trying. Our best detection method reveals that adversarial images place abnormal emphasis on the lower-ranked principal components from PCA. Other detectors and a colorful saliency map are in an appendix.",
    "We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.",
    "Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.   Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)).   Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.",
    "This paper builds off recent work from Kiperwasser & Goldberg (2016) using neural attention in a simple graph-based dependency parser. We use a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels. Our parser gets state of the art or near state of the art performance on standard treebanks for six different languages, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset. This makes it the highest-performing graph-based parser on this benchmark---outperforming Kiperwasser Goldberg (2016) by 1.8% and 2.2%---and comparable to the highest performing transition-based parser (Kuncoro et al., 2016), which achieves 95.8% UAS and 94.6% LAS. We also show which hyperparameter choices had a significant effect on parsing accuracy, allowing us to achieve large gains over other graph-based approaches.",
    "Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report state-of-the-art results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015).",
    "Spherical data is found in many applications. By modeling the discretized sphere as a graph, we can accommodate non-uniformly distributed, partial, and changing samplings. Moreover, graph convolutions are computationally more efficient than spherical convolutions. As equivariance is desired to exploit rotational symmetries, we discuss how to approach rotation equivariance using the graph neural network introduced in Defferrard et al. (2016). Experiments show good performance on rotation-invariant learning problems. Code and examples are available at https://github.com/SwissDataScienceCenter/DeepSphere",
    "High computational complexity hinders the widespread usage of Convolutional Neural Networks (CNNs), especially in mobile devices. Hardware accelerators are arguably the most promising approach for reducing both execution time and power consumption. One of the most important steps in accelerator development is hardware-oriented model approximation. In this paper we present Ristretto, a model approximation framework that analyzes a given CNN with respect to numerical resolution used in representing weights and outputs of convolutional and fully connected layers. Ristretto can condense models by using fixed point arithmetic and representation instead of floating point. Moreover, Ristretto fine-tunes the resulting fixed point network. Given a maximum error tolerance of 1%, Ristretto can successfully condense CaffeNet and SqueezeNet to 8-bit. The code for Ristretto is available.",
    "The diversity of painting styles represents a rich visual vocabulary for the construction of an image. The degree to which one may learn and parsimoniously capture this visual vocabulary measures our understanding of the higher level features of paintings, if not images in general. In this work we investigate the construction of a single, scalable deep network that can parsimoniously capture the artistic style of a diversity of paintings. We demonstrate that such a network generalizes across a diversity of artistic styles by reducing a painting to a point in an embedding space. Importantly, this model permits a user to explore new painting styles by arbitrarily combining the styles learned from individual paintings. We hope that this work provides a useful step towards building rich models of paintings and offers a window on to the structure of the learned representation of artistic style.",
    "Sum-Product Networks (SPNs) are a class of expressive yet tractable hierarchical graphical models. LearnSPN is a structure learning algorithm for SPNs that uses hierarchical co-clustering to simultaneously identifying similar entities and similar features. The original LearnSPN algorithm assumes that all the variables are discrete and there is no missing data. We introduce a practical, simplified version of LearnSPN, MiniSPN, that runs faster and can handle missing data and heterogeneous features common in real applications. We demonstrate the performance of MiniSPN on standard benchmark datasets and on two datasets from Google's Knowledge Graph exhibiting high missingness rates and a mix of discrete and continuous features.",
    "Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet).   The SqueezeNet architecture is available for download here: https://github.com/DeepScale/SqueezeNet",
    "In this paper, we study the problem of question answering when reasoning over multiple facts is required. We propose Query-Reduction Network (QRN), a variant of Recurrent Neural Network (RNN) that effectively handles both short-term (local) and long-term (global) sequential dependencies to reason over multiple facts. QRN considers the context sentences as a sequence of state-changing triggers, and reduces the original query to a more informed query as it observes each trigger (context sentence) through time. Our experiments show that QRN produces the state-of-the-art results in bAbI QA and dialog tasks, and in a real goal-oriented dialog dataset. In addition, QRN formulation allows parallelization on RNN's time axis, saving an order of magnitude in time complexity for training and inference.",
    "We propose a language-agnostic way of automatically generating sets of semantically similar clusters of entities along with sets of \"outlier\" elements, which may then be used to perform an intrinsic evaluation of word embeddings in the outlier detection task. We used our methodology to create a gold-standard dataset, which we call WikiSem500, and evaluated multiple state-of-the-art embeddings. The results show a correlation between performance on this dataset and performance on sentiment analysis.",
    "Recurrent neural nets are widely used for predicting temporal data. Their inherent deep feedforward structure allows learning complex sequential patterns. It is believed that top-down feedback might be an important missing ingredient which in theory could help disambiguate similar patterns depending on broader context. In this paper we introduce surprisal-driven recurrent networks, which take into account past error information when making new predictions. This is achieved by continuously monitoring the discrepancy between most recent predictions and the actual observations. Furthermore, we show that it outperforms other stochastic and fully deterministic approaches on enwik8 character level prediction task achieving 1.37 BPC on the test portion of the text.",
    "Although Generative Adversarial Networks achieve state-of-the-art results on a variety of generative tasks, they are regarded as highly unstable and prone to miss modes. We argue that these bad behaviors of GANs are due to the very particular functional shape of the trained discriminators in high dimensional spaces, which can easily make training stuck or push probability mass in the wrong direction, towards that of higher concentration than that of the data generating distribution. We introduce several ways of regularizing the objective, which can dramatically stabilize the training of GAN models. We also show that our regularizers can help the fair distribution of probability mass across the modes of the data generating distribution, during the early phases of training and thus providing a unified solution to the missing modes problem.",
    "Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks, especially when the policies are represented using rich function approximators like deep neural networks. Model-based methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real data with simulated data. However, discrepancies between the simulated source domain and the target domain pose a challenge for simulated training. We introduce the EPOpt algorithm, which uses an ensemble of simulated source domains and a form of adversarial training to learn policies that are robust and generalize to a broad range of possible target domains, including unmodeled effects. Further, the probability distribution over source domains in the ensemble can be adapted using data from target domain and approximate Bayesian methods, to progressively make it a better approximation. Thus, learning on a model ensemble, along with source domain adaptation, provides the benefit of both robustness and learning/adaptation.",
    "We introduce Divnet, a flexible technique for learning networks with diverse neurons. Divnet models neuronal diversity by placing a Determinantal Point Process (DPP) over neurons in a given layer. It uses this DPP to select a subset of diverse neurons and subsequently fuses the redundant neurons into the selected ones. Compared with previous approaches, Divnet offers a more principled, flexible technique for capturing neuronal diversity and thus implicitly enforcing regularization. This enables effective auto-tuning of network architecture and leads to smaller network sizes without hurting performance. Moreover, through its focus on diversity and neuron fusing, Divnet remains compatible with other procedures that seek to reduce memory footprints of networks. We present experimental results to corroborate our claims: for pruning neural networks, Divnet is seen to be notably superior to competing approaches.",
    "The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.",
    "One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. In this work, we propose a new regularizer called DeCov which leads to significantly reduced overfitting (as indicated by the difference between train and val performance), and better generalization. Our regularizer encourages diverse or non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations. This simple intuition has been explored in a number of past works but surprisingly has never been applied as a regularizer in supervised learning. Experiments across a range of datasets and network architectures show that this loss always reduces overfitting while almost always maintaining or increasing generalization performance and often improving performance over Dropout.",
    "Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.",
    "We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.",
    "We introduce the \"Energy-based Generative Adversarial Network\" model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.",
    "Recent research in the deep learning field has produced a plethora of new architectures. At the same time, a growing number of groups are applying deep learning to new applications. Some of these groups are likely to be composed of inexperienced deep learning practitioners who are baffled by the dizzying array of architecture choices and therefore opt to use an older architecture (i.e., Alexnet). Here we attempt to bridge this gap by mining the collective knowledge contained in recent deep learning research to discover underlying principles for designing neural network architectures. In addition, we describe several architectural innovations, including Fractal of FractalNet network, Stagewise Boosting Networks, and Taylor Series Networks (our Caffe code and prototxt files is available at https://github.com/iPhysicist/CNNDesignPatterns). We hope others are inspired to build on our preliminary work.",
    "Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention. In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.",
    "Though with progress, model learning and performing posterior inference still remains a common challenge for using deep generative models, especially for handling discrete hidden variables. This paper is mainly concerned with algorithms for learning Helmholz machines, which is characterized by pairing the generative model with an auxiliary inference model. A common drawback of previous learning algorithms is that they indirectly optimize some bounds of the targeted marginal log-likelihood. In contrast, we successfully develop a new class of algorithms, based on stochastic approximation (SA) theory of the Robbins-Monro type, to directly optimize the marginal log-likelihood and simultaneously minimize the inclusive KL-divergence. The resulting learning algorithm is thus called joint SA (JSA). Moreover, we construct an effective MCMC operator for JSA. Our results on the MNIST datasets demonstrate that the JSA's performance is consistently superior to that of competing algorithms like RWS, for learning a range of difficult models.",
    "Object detection with deep neural networks is often performed by passing a few thousand candidate bounding boxes through a deep neural network for each image. These bounding boxes are highly correlated since they originate from the same image. In this paper we investigate how to exploit feature occurrence at the image scale to prune the neural network which is subsequently applied to all bounding boxes. We show that removing units which have near-zero activation in the image allows us to significantly reduce the number of parameters in the network. Results on the PASCAL 2007 Object Detection Challenge demonstrate that up to 40% of units in some fully-connected layers can be entirely eliminated with little change in the detection result.",
    "Modeling interactions between features improves the performance of machine learning solutions in many domains (e.g. recommender systems or sentiment analysis). In this paper, we introduce Exponential Machines (ExM), a predictor that models all interactions of every order. The key idea is to represent an exponentially large tensor of parameters in a factorized format called Tensor Train (TT). The Tensor Train format regularizes the model and lets you control the number of underlying parameters. To train the model, we develop a stochastic Riemannian optimization procedure, which allows us to fit tensors with 2^160 entries. We show that the model achieves state-of-the-art performance on synthetic data with high-order interactions and that it works on par with high-order factorization machines on a recommender system dataset MovieLens 100K.",
    "We introduce Deep Variational Bayes Filters (DVBF), a new method for unsupervised learning and identification of latent Markovian state space models. Leveraging recent advances in Stochastic Gradient Variational Bayes, DVBF can overcome intractable inference distributions via variational inference. Thus, it can handle highly nonlinear input data with temporal and spatial dependencies such as image sequences without domain knowledge. Our experiments show that enabling backpropagation through transitions enforces state space assumptions and significantly improves information content of the latent embedding. This also enables realistic long-term prediction.",
    "Traditional dialog systems used in goal-oriented applications require a lot of domain-specific handcrafting, which hinders scaling up to new domains. End-to-end dialog systems, in which all components are trained from the dialogs themselves, escape this limitation. But the encouraging success recently obtained in chit-chat dialog may not carry over to goal-oriented settings. This paper proposes a testbed to break down the strengths and shortcomings of end-to-end dialog systems in goal-oriented applications. Set in the context of restaurant reservation, our tasks require manipulating sentences and symbols, so as to properly conduct conversations, issue API calls and use the outputs of such calls. We show that an end-to-end dialog system based on Memory Networks can reach promising, yet imperfect, performance and learn to perform non-trivial operations. We confirm those results by comparing our system to a hand-crafted slot-filling baseline on data from the second Dialog State Tracking Challenge (Henderson et al., 2014a). We show similar result patterns on data extracted from an online concierge service.",
    "Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting. Code is available at https://github.com/tensorflow/models/tree/master/research/adversarial_text.",
    "Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.",
    "This paper is focused on studying the view-manifold structure in the feature spaces implied by the different layers of Convolutional Neural Networks (CNN). There are several questions that this paper aims to answer: Does the learned CNN representation achieve viewpoint invariance? How does it achieve viewpoint invariance? Is it achieved by collapsing the view manifolds, or separating them while preserving them? At which layer is view invariance achieved? How can the structure of the view manifold at each layer of a deep convolutional neural network be quantified experimentally? How does fine-tuning of a pre-trained CNN on a multi-view dataset affect the representation at each layer of the network? In order to answer these questions we propose a methodology to quantify the deformation and degeneracy of view manifolds in CNN layers. We apply this methodology and report interesting results in this paper that answer the aforementioned questions.",
    "Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmentation, and visual question-answering, to get state-of-the-art performances taking advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally complex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning. We show that our model outperforms compact bilinear pooling in visual question-answering tasks with the state-of-the-art results on the VQA dataset, having a better parsimonious property.",
    "The standard interpretation of importance-weighted autoencoders is that they maximize a tighter lower bound on the marginal likelihood than the standard evidence lower bound. We give an alternate interpretation of this procedure: that it optimizes the standard variational lower bound, but using a more complex distribution. We formally derive this result, present a tighter lower bound, and visualize the implicit importance-weighted distribution.",
    "We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.",
    "In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples.Specifically, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal. We derive the analytic form of the induced solution, and analyze the properties. In order to make the proposed framework trainable in practice, we introduce two effective approximation techniques. Empirically, the experiment results closely match our theoretical analysis, verifying the discriminator is able to recover the energy of data distribution.",
    "In this work we perform outlier detection using ensembles of neural networks obtained by variational approximation of the posterior in a Bayesian neural network setting. The variational parameters are obtained by sampling from the true posterior by gradient descent. We show our outlier detection results are comparable to those obtained using other efficient ensembling methods.",
    "We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is \"matrix factorization by design\" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the near state-of the art perplexity while using significantly less RNN parameters.",
    "We present observations and discussion of previously unreported phenomena discovered while training residual networks. The goal of this work is to better understand the nature of neural networks through the examination of these new empirical results. These behaviors were identified through the application of Cyclical Learning Rates (CLR) and linear network interpolation. Among these behaviors are counterintuitive increases and decreases in training loss and instances of rapid training. For example, we demonstrate how CLR can produce greater testing accuracy than traditional training despite using large learning rates. Files to replicate these results are available at https://github.com/lnsmith54/exploring-loss",
    "Machine learning models are often used at test-time subject to constraints and trade-offs not present at training-time. For example, a computer vision model operating on an embedded device may need to perform real-time inference, or a translation model operating on a cell phone may wish to bound its average compute time in order to be power-efficient. In this work we describe a mixture-of-experts model and show how to change its test-time resource-usage on a per-input basis using reinforcement learning. We test our method on a small MNIST-based example.",
    "Adversarial examples have been shown to exist for a variety of deep learning architectures. Deep reinforcement learning has shown promising results on training agent policies directly on raw inputs such as image pixels. In this paper we present a novel study into adversarial attacks on deep reinforcement learning polices. We compare the effectiveness of the attacks using adversarial examples vs. random noise. We present a novel method for reducing the number of times adversarial examples need to be injected for a successful attack, based on the value function. We further explore how re-training on random noise and FGSM perturbations affects the resilience against adversarial examples.",
    "This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that VCL outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.",
    "Automatically determining the optimal size of a neural network for a given task without prior information currently requires an expensive global search and training many networks from scratch. In this paper, we address the problem of automatically finding a good network size during a single training cycle. We introduce *nonparametric neural networks*, a non-probabilistic framework for conducting optimization over all possible network sizes and prove its soundness when network growth is limited via an L_p penalty. We train networks under this framework by continuously adding new units while eliminating redundant units via an L_2 penalty. We employ a novel optimization algorithm, which we term *adaptive radial-angular gradient descent* or *AdaRad*, and obtain promising results.",
    "Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It's noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to the strongest published system.",
    "The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.",
    "We extend Stochastic Gradient Variational Bayes to perform posterior inference for the weights of Stick-Breaking processes. This development allows us to define a Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric version of the variational autoencoder that has a latent representation with stochastic dimensionality. We experimentally demonstrate that the SB-VAE, and a semi-supervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE's.",
    "We propose a framework for training multiple neural networks simultaneously. The parameters from all models are regularised by the tensor trace norm, so that each neural network is encouraged to reuse others' parameters if possible -- this is the main motivation behind multi-task learning. In contrast to many deep multi-task learning models, we do not predefine a parameter sharing strategy by specifying which layers have tied parameters. Instead, our framework considers sharing for all shareable layers, and the sharing strategy is learned in a data-driven way.",
    "This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.",
    "Many machine learning classifiers are vulnerable to adversarial perturbations. An adversarial perturbation modifies an input to change a classifier's prediction without causing the input to seem substantially different to human perception. We deploy three methods to detect adversarial images. Adversaries trying to bypass our detectors must make the adversarial image less pathological or they will fail trying. Our best detection method reveals that adversarial images place abnormal emphasis on the lower-ranked principal components from PCA. Other detectors and a colorful saliency map are in an appendix.",
    "We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.",
    "State-of-the-art deep reading comprehension models are dominated by recurrent neural nets. Their sequential nature is a natural fit for language, but it also precludes parallelization within an instances and often becomes the bottleneck for deploying such models to latency critical scenarios. This is particularly problematic for longer texts. Here we present a convolutional architecture as an alternative to these recurrent architectures. Using simple dilated convolutional units in place of recurrent ones, we achieve results comparable to the state of the art on two question answering tasks, while at the same time achieving up to two orders of magnitude speedups for question answering.",
    "This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.",
    "Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content (\"bit rate\") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.",
    "Methods that learn representations of nodes in a graph play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as point vectors in a low-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, we propose an unsupervised method that handles inductive learning scenarios and is applicable to different types of graphs: plain/attributed, directed/undirected. By leveraging both the network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering of the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks. Additionally, we demonstrate the benefits of modeling uncertainty - by analyzing it we can estimate neighborhood diversity and detect the intrinsic latent dimensionality of a graph.",
    "This paper explores the use of self-ensembling for visual domain adaptation problems. Our technique is derived from the mean teacher variant (Tarvainen et al., 2017) of temporal ensembling (Laine et al;, 2017), a technique that achieved state of the art results in the area of semi-supervised learning. We introduce a number of modifications to their approach for challenging domain adaptation scenarios and evaluate its effectiveness. Our approach achieves state of the art results in a variety of benchmarks, including our winning entry in the VISDA-2017 visual domain adaptation challenge. In small image benchmarks, our algorithm not only outperforms prior art, but can also achieve accuracy that is close to that of a classifier trained in a supervised fashion.",
    "Most machine learning classifiers, including deep neural networks, are vulnerable to adversarial examples. Such inputs are typically generated by adding small but purposeful modifications that lead to incorrect outputs while imperceptible to human eyes. The goal of this paper is not to introduce a single method, but to make theoretical steps towards fully understanding adversarial examples. By using concepts from topology, our theoretical analysis brings forth the key reasons why an adversarial example can fool a classifier ($f_1$) and adds its oracle ($f_2$, like human eyes) in such analysis. By investigating the topological relationship between two (pseudo)metric spaces corresponding to predictor $f_1$ and oracle $f_2$, we develop necessary and sufficient conditions that can determine if $f_1$ is always robust (strong-robust) against adversarial examples according to $f_2$. Interestingly our theorems indicate that just one unnecessary feature can make $f_1$ not strong-robust, and the right feature representation learning is the key to getting a classifier that is both accurate and strong-robust.",
    "We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.",
    "We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.",
    "Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful.",
    "We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.",
    "We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data.",
    "In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts.",
    "We compared the efficiency of the FlyHash model, an insect-inspired sparse neural network (Dasgupta et al., 2017), to similar but non-sparse models in an embodied navigation task. This requires a model to control steering by comparing current visual inputs to memories stored along a training route. We concluded the FlyHash model is more efficient than others, especially in terms of data encoding.",
    "In peer review, reviewers are usually asked to provide scores for the papers. The scores are then used by Area Chairs or Program Chairs in various ways in the decision-making process. The scores are usually elicited in a quantized form to accommodate the limited cognitive ability of humans to describe their opinions in numerical values. It has been found that the quantized scores suffer from a large number of ties, thereby leading to a significant loss of information. To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed. There are however two key challenges. First, there is no standard procedure for using this ranking information and Area Chairs may use it in different ways (including simply ignoring them), thereby leading to arbitrariness in the peer-review process. Second, there are no suitable interfaces for judicious use of this data nor methods to incorporate it in existing workflows, thereby leading to inefficiencies. We take a principled approach to integrate the ranking information into the scores. The output of our method is an updated score pertaining to each review that also incorporates the rankings. Our approach addresses the two aforementioned challenges by: (i) ensuring that rankings are incorporated into the updates scores in the same manner for all papers, thereby mitigating arbitrariness, and (ii) allowing to seamlessly use existing interfaces and workflows designed for scores. We empirically evaluate our method on synthetic datasets as well as on peer reviews from the ICLR 2017 conference, and find that it reduces the error by approximately 30% as compared to the best performing baseline on the ICLR 2017 data.",
    "Many recent studies have probed status bias in the peer-review process of academic journals and conferences. In this article, we investigated the association between author metadata and area chairs' final decisions (Accept/Reject) using our compiled database of 5,313 borderline submissions to the International Conference on Learning Representations (ICLR) from 2017 to 2022. We carefully defined elements in a cause-and-effect analysis, including the treatment and its timing, pre-treatment variables, potential outcomes and causal null hypothesis of interest, all in the context of study units being textual data and under Neyman and Rubin's potential outcomes (PO) framework. We found some weak evidence that author metadata was associated with articles' final decisions. We also found that, under an additional stability assumption, borderline articles from high-ranking institutions (top-30% or top-20%) were less favored by area chairs compared to their matched counterparts. The results were consistent in two different matched designs (odds ratio = 0.82 [95% CI: 0.67 to 1.00] in a first design and 0.83 [95% CI: 0.64 to 1.07] in a strengthened design). We discussed how to interpret these results in the context of multiple interactions between a study unit and different agents (reviewers and area chairs) in the peer-review system.",
    "We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method \"Deep Variational Information Bottleneck\", or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.",
    "Attention networks have proven to be an effective approach for embedding categorical inference within a deep neural network. However, for many tasks we may want to model richer structural dependencies without abandoning end-to-end training. In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees. We experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. We further find that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention.",
    "We are proposing to use an ensemble of diverse specialists, where speciality is defined according to the confusion matrix. Indeed, we observed that for adversarial instances originating from a given class, labeling tend to be done into a small subset of (incorrect) classes. Therefore, we argue that an ensemble of specialists should be better able to identify and reject fooling instances, with a high entropy (i.e., disagreement) over the decisions in the presence of adversaries. Experimental results obtained confirm that interpretation, opening a way to make the system more robust to adversarial examples through a rejection mechanism, rather than trying to classify them properly at any cost.",
    "In this paper, we present Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Different from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. Instead, it directly outputs phrases in a sequential order and can decode in linear time. Our experiments show that NPMT achieves superior performances on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks compared with strong NMT baselines. We also observe that our method produces meaningful phrases in output languages.",
    "We present LR-GAN: an adversarial image generation model which takes scene structure and context into account. Unlike previous generative adversarial networks (GANs), the proposed GAN learns to generate image background and foregrounds separately and recursively, and stitch the foregrounds on the background in a contextually relevant manner to produce a complete natural image. For each foreground, the model learns to generate its appearance, shape and pose. The whole model is unsupervised, and is trained in an end-to-end manner with gradient descent methods. The experiments demonstrate that LR-GAN can generate more natural images with objects that are more human recognizable than DCGAN.",
    "We describe a simple scheme that allows an agent to learn about its environment in an unsupervised manner. Our scheme pits two versions of the same agent, Alice and Bob, against one another. Alice proposes a task for Bob to complete; and then Bob attempts to complete the task. In this work we will focus on two kinds of environments: (nearly) reversible environments and environments that can be reset. Alice will \"propose\" the task by doing a sequence of actions and then Bob must undo or repeat them, respectively. Via an appropriate reward structure, Alice and Bob automatically generate a curriculum of exploration, enabling unsupervised training of the agent. When Bob is deployed on an RL task within the environment, this unsupervised training reduces the number of supervised episodes needed to learn, and in some cases converges to a higher reward.",
    "Maximum entropy modeling is a flexible and popular framework for formulating statistical models given partial knowledge. In this paper, rather than the traditional method of optimizing over the continuous density directly, we learn a smooth and invertible transformation that maps a simple distribution to the desired maximum entropy distribution. Doing so is nontrivial in that the objective being maximized (entropy) is a function of the density itself. By exploiting recent developments in normalizing flow networks, we cast the maximum entropy problem into a finite-dimensional constrained optimization, and solve the problem by combining stochastic optimization with the augmented Lagrangian method. Simulation results demonstrate the effectiveness of our method, and applications to finance and computer vision show the flexibility and accuracy of using maximum entropy flow networks.",
    "With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal. However, most current research focuses instead on important but narrow applications, such as image classification or machine translation. We believe this to be largely due to the lack of objective ways to measure progress towards broad machine intelligence. In order to fill this gap, we propose here a set of concrete desiderata for general AI, together with a platform to test machines on how well they satisfy such desiderata, while keeping all further complexities to a minimum.",
    "Neural networks that compute over graph structures are a natural fit for problems in a variety of domains, including natural language (parse trees) and cheminformatics (molecular graphs). However, since the computation graph has a different shape and size for every input, such networks do not directly support batched training or inference. They are also difficult to implement in popular deep learning libraries, which are based on static data-flow graphs. We introduce a technique called dynamic batching, which not only batches together operations between different input graphs of dissimilar shape, but also between different nodes within a single input graph. The technique allows us to create static graphs, using popular libraries, that emulate dynamic computation graphs of arbitrary shape and size. We further present a high-level library of compositional blocks that simplifies the creation of dynamic graph models. Using the library, we demonstrate concise and batch-wise parallel implementations for a variety of models from the literature.",
    "Although deep learning models have proven effective at solving problems in natural language processing, the mechanism by which they come to their conclusions is often unclear. As a result, these models are generally treated as black boxes, yielding no insight of the underlying learned patterns. In this paper we consider Long Short Term Memory networks (LSTMs) and demonstrate a new approach for tracking the importance of a given input to the LSTM for a given output. By identifying consistently important patterns of words, we are able to distill state of the art LSTMs on sentiment analysis and question answering into a set of representative phrases. This representation is then quantitatively validated by using the extracted phrases to construct a simple, rule-based classifier which approximates the output of the LSTM.",
    "Deep reinforcement learning has achieved many impressive results in recent years. However, tasks with sparse rewards or long horizons continue to pose significant challenges. To tackle these important problems, we propose a general framework that first learns useful skills in a pre-training environment, and then leverages the acquired skills for learning faster in downstream tasks. Our approach brings together some of the strengths of intrinsic motivation and hierarchical methods: the learning of useful skill is guided by a single proxy reward, the design of which requires very minimal domain knowledge about the downstream tasks. Then a high-level policy is trained on top of these skills, providing a significant improvement of the exploration and allowing to tackle sparse rewards in the downstream tasks. To efficiently pre-train a large span of skills, we use Stochastic Neural Networks combined with an information-theoretic regularizer. Our experiments show that this combination is effective in learning a wide span of interpretable skills in a sample-efficient way, and can significantly boost the learning performance uniformly across a wide range of downstream tasks.",
    "Deep generative models have achieved impressive success in recent years. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as emerging families for generative model learning, have largely been considered as two distinct paradigms and received extensive independent studies respectively. This paper aims to establish formal connections between GANs and VAEs through a new formulation of them. We interpret sample generation in GANs as performing posterior inference, and show that GANs and VAEs involve minimizing KL divergences of respective posterior and inference distributions with opposite directions, extending the two learning phases of classic wake-sleep algorithm, respectively. The unified view provides a powerful tool to analyze a diverse set of existing model variants, and enables to transfer techniques across research lines in a principled way. For example, we apply the importance weighting method in VAE literatures for improved GAN learning, and enhance VAEs with an adversarial mechanism that leverages generated samples. Experiments show generality and effectiveness of the transferred techniques.",
    "We consider the problem of detecting out-of-distribution images in neural networks. We propose ODIN, a simple and effective method that does not require any change to a pre-trained neural network. Our method is based on the observation that using temperature scaling and adding small perturbations to the input can separate the softmax score distributions between in- and out-of-distribution images, allowing for more effective detection. We show in a series of experiments that ODIN is compatible with diverse network architectures and datasets. It consistently outperforms the baseline approach by a large margin, establishing a new state-of-the-art performance on this task. For example, ODIN reduces the false positive rate from the baseline 34.7% to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is 95%.",
    "A framework is presented for unsupervised learning of representations based on infomax principle for large-scale neural populations. We use an asymptotic approximation to the Shannon's mutual information for a large neural population to demonstrate that a good initial approximation to the global information-theoretic optimum can be obtained by a hierarchical infomax method. Starting from the initial solution, an efficient algorithm based on gradient descent of the final objective function is proposed to learn representations from the input datasets, and the method works for complete, overcomplete, and undercomplete bases. As confirmed by numerical experiments, our method is robust and highly efficient for extracting salient features from input datasets. Compared with the main existing methods, our algorithm has a distinct advantage in both the training speed and the robustness of unsupervised representation learning. Furthermore, the proposed method is easily extended to the supervised or unsupervised model for training deep structure networks.",
    "Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models. Source code is publicly available at https://imatge-upc.github.io/skiprnn-2017-telecombcn/ .",
    "Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR",
    "Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from the large variance issue on policy gradient estimation, which leads to poor sample efficiency during training. In this work, we propose a control variate method to effectively reduce variance for policy gradient methods. Motivated by the Stein's identity, our method extends the previous control variate methods used in REINFORCE and advantage actor-critic by introducing more general action-dependent baseline functions. Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy gradient approaches.",
    "Skip connections made the training of very deep networks possible and have become an indispensable component in a variety of neural architectures. A completely satisfactory explanation for their success remains elusive. Here, we present a novel explanation for the benefits of skip connections in training very deep networks. The difficulty of training deep networks is partly due to the singularities caused by the non-identifiability of the model. Several such singularities have been identified in previous works: (i) overlap singularities caused by the permutation symmetry of nodes in a given layer, (ii) elimination singularities corresponding to the elimination, i.e. consistent deactivation, of nodes, (iii) singularities generated by the linear dependence of the nodes. These singularities cause degenerate manifolds in the loss landscape that slow down learning. We argue that skip connections eliminate these singularities by breaking the permutation symmetry of nodes, by reducing the possibility of node elimination and by making the nodes less linearly dependent. Moreover, for typical initializations, skip connections move the network away from the \"ghosts\" of these singularities and sculpt the landscape around them to alleviate the learning slow-down. These hypotheses are supported by evidence from simplified models, as well as from experiments with deep networks trained on real-world datasets.",
    "We have tried to reproduce the results of the paper \"Natural Language Inference over Interaction Space\" submitted to ICLR 2018 conference as part of the ICLR 2018 Reproducibility Challenge. Initially, we were not aware that the code was available, so we started to implement the network from scratch. We have evaluated our version of the model on Stanford NLI dataset and reached 86.38% accuracy on the test set, while the paper claims 88.0% accuracy. The main difference, as we understand it, comes from the optimizers and the way model selection is performed.",
    "We have successfully implemented the \"Learn to Pay Attention\" model of attention mechanism in convolutional neural networks, and have replicated the results of the original paper in the categories of image classification and fine-grained recognition.",
    "Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose a method to learn such representations by encoding the suffixes of word sequences in a sentence and training on the Stanford Natural Language Inference (SNLI) dataset. We demonstrate the effectiveness of our approach by evaluating it on the SentEval benchmark, improving on existing approaches on several transfer tasks.",
    "In many neural models, new features as polynomial functions of existing ones are used to augment representations. Using the natural language inference task as an example, we investigate the use of scaled polynomials of degree 2 and above as matching features. We find that scaling degree 2 features has the highest impact on performance, reducing classification error by 5% in the best models.",
    "We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.",
    "In this work, we investigate Batch Normalization technique and propose its probabilistic interpretation. We propose a probabilistic model and show that Batch Normalization maximazes the lower bound of its marginalized log-likelihood. Then, according to the new probabilistic model, we design an algorithm which acts consistently during train and test. However, inference becomes computationally inefficient. To reduce memory and computational cost, we propose Stochastic Batch Normalization -- an efficient approximation of proper inference procedure. This method provides us with a scalable uncertainty estimation technique. We demonstrate the performance of Stochastic Batch Normalization on popular architectures (including deep convolutional architectures: VGG-like and ResNets) for MNIST and CIFAR-10 datasets.",
    "It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural image representations.",
    "Deep latent variable models are powerful tools for representation learning. In this paper, we adopt the deep information bottleneck model, identify its shortcomings and propose a model that circumvents them. To this end, we apply a copula transformation which, by restoring the invariance properties of the information bottleneck method, leads to disentanglement of the features in the latent space. Building on that, we show how this transformation translates to sparsity of the latent space in the new model. We evaluate our method on artificial and real data.",
    "We introduce a variant of the MAC model (Hudson and Manning, ICLR 2018) with a simplified set of equations that achieves comparable accuracy, while training faster. We evaluate both models on CLEVR and CoGenT, and show that, transfer learning with fine-tuning results in a 15 point increase in accuracy, matching the state of the art. Finally, in contrast, we demonstrate that improper fine-tuning can actually reduce a model's accuracy as well.",
    "Adaptive Computation Time for Recurrent Neural Networks (ACT) is one of the most promising architectures for variable computation. ACT adapts to the input sequence by being able to look at each sample more than once, and learn how many times it should do it. In this paper, we compare ACT to Repeat-RNN, a novel architecture based on repeating each sample a fixed number of times. We found surprising results, where Repeat-RNN performs as good as ACT in the selected tasks. Source code in TensorFlow and PyTorch is publicly available at https://imatge-upc.github.io/danifojo-2018-repeatrnn/",
    "Generative adversarial networks (GANs) are able to model the complex highdimensional distributions of real-world data, which suggests they could be effective for anomaly detection. However, few works have explored the use of GANs for the anomaly detection task. We leverage recently developed GAN models for anomaly detection, and achieve state-of-the-art performance on image and network intrusion datasets, while being several hundred-fold faster at test time than the only published GAN-based method.",
    "Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It's noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to the strongest published system.",
    "The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.",
    "Deep neural networks (DNNs) have achieved impressive predictive performance due to their ability to learn complex, non-linear relationships between variables. However, the inability to effectively visualize these relationships has led to DNNs being characterized as black boxes and consequently limited their applications. To ameliorate this problem, we introduce the use of hierarchical interpretations to explain DNN predictions through our proposed method, agglomerative contextual decomposition (ACD). Given a prediction from a trained DNN, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive. Using examples from Stanford Sentiment Treebank and ImageNet, we show that ACD is effective at diagnosing incorrect predictions and identifying dataset bias. Through human experiments, we demonstrate that ACD enables users both to identify the more accurate of two DNNs and to better trust a DNN's outputs. We also find that ACD's hierarchy is largely robust to adversarial perturbations, implying that it captures fundamental aspects of the input and ignores spurious noise.",
    "In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies \"image\" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.",
    "We consider the task of word-level language modeling and study the possibility of combining hidden-states-based short-term representations with medium-term representations encoded in dynamical weights of a language model. Our work extends recent experiments on language models with dynamically evolving weights by casting the language modeling problem into an online learning-to-learn framework in which a meta-learner is trained by gradient-descent to continuously update a language model weights.",
    "GANS are powerful generative models that are able to model the manifold of natural images. We leverage this property to perform manifold regularization by approximating the Laplacian norm using a Monte Carlo approximation that is easily computed with the GAN. When incorporated into the feature-matching GAN of Improved GAN, we achieve state-of-the-art results for GAN-based semi-supervised learning on the CIFAR-10 dataset, with a method that is significantly easier to implement than competing methods.",
    "We identify a class of over-parameterized deep neural networks with standard activation functions and cross-entropy loss which provably have no bad local valley, in the sense that from any point in parameter space there exists a continuous path on which the cross-entropy loss is non-increasing and gets arbitrarily close to zero. This implies that these networks have no sub-optimal strict local minima.",
    "Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.",
    "One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally light and easy to incorporate into existing implementations. We tested the efficacy of spectral normalization on CIFAR10, STL-10, and ILSVRC2012 dataset, and we experimentally confirmed that spectrally normalized GANs (SN-GANs) is capable of generating images of better or equal quality relative to the previous training stabilization techniques.",
    "Embedding graph nodes into a vector space can allow the use of machine learning to e.g. predict node classes, but the study of node embedding algorithms is immature compared to the natural language processing field because of a diverse nature of graphs. We examine the performance of node embedding algorithms with respect to graph centrality measures that characterize diverse graphs, through systematic experiments with four node embedding algorithms, four or five graph centralities, and six datasets. Experimental results give insights into the properties of node embedding algorithms, which can be a basis for further research on this topic.",
    "We introduce a new dataset of logical entailments for the purpose of measuring models' ability to capture and exploit the structure of logical expressions against an entailment prediction task. We use this task to compare a series of architectures which are ubiquitous in the sequence-processing literature, in addition to a new model class---PossibleWorldNets---which computes entailment as a \"convolution over possible worlds\". Results show that convolutional networks present the wrong inductive bias for this class of problems relative to LSTM RNNs, tree-structured neural networks outperform LSTM RNNs due to their enhanced ability to exploit the syntax of logic, and PossibleWorldNets outperform all benchmarks.",
    "Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.   We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the \"lottery ticket hypothesis:\" dense, randomly-initialized, feed-forward networks contain subnetworks (\"winning tickets\") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.   We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.",
    "We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2\\% to 5.3\\%.",
    "Understanding theoretical properties of deep and locally connected nonlinear network, such as deep convolutional neural network (DCNN), is still a hard problem despite its empirical success. In this paper, we propose a novel theoretical framework for such networks with ReLU nonlinearity. The framework explicitly formulates data distribution, favors disentangled representations and is compatible with common regularization techniques such as Batch Norm. The framework is built upon teacher-student setting, by expanding the student forward/backward propagation onto the teacher's computational graph. The resulting model does not impose unrealistic assumptions (e.g., Gaussian inputs, independence of activation, etc). Our framework could help facilitate theoretical analysis of many practical issues, e.g. overfitting, generalization, disentangled representations in deep networks.",
    "We present a Neural Program Search, an algorithm to generate programs from natural language description and a small number of input/output examples. The algorithm combines methods from Deep Learning and Program Synthesis fields by designing rich domain-specific language (DSL) and defining efficient search algorithm guided by a Seq2Tree model on it. To evaluate the quality of the approach we also present a semi-synthetic dataset of descriptions with test examples and corresponding programs. We show that our algorithm significantly outperforms a sequence-to-sequence model with attention baseline.",
    "Most state-of-the-art neural machine translation systems, despite being different in architectural skeletons (e.g. recurrence, convolutional), share an indispensable feature: the Attention. However, most existing attention methods are token-based and ignore the importance of phrasal alignments, the key ingredient for the success of phrase-based statistical machine translation. In this paper, we propose novel phrase-based attention methods to model n-grams of tokens as attention entities. We incorporate our phrase-based attentions into the recently proposed Transformer network, and demonstrate that our approach yields improvements of 1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English translation tasks on WMT newstest2014 using WMT'16 training data.",
    "We introduce the problem of learning distributed representations of edits. By combining a \"neural editor\" with an \"edit encoder\", our models learn to represent the salient information of an edit and can be used to apply edits to new inputs. We experiment on natural language and source code edit data. Our evaluation yields promising results that suggest that our neural network models learn to capture the structure and semantics of edits. We hope that this interesting task and data source will inspire other researchers to work further on this problem.",
    "We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.",
    "This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that VCL outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.",
    "This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.",
    "In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts.",
    "We propose a single neural probabilistic model based on variational autoencoder that can be conditioned on an arbitrary subset of observed features and then sample the remaining features in \"one shot\". The features may be both real-valued and categorical. Training of the model is performed by stochastic variational Bayes. The experimental evaluation on synthetic data, as well as feature imputation and image inpainting problems, shows the effectiveness of the proposed approach and diversity of the generated samples.",
    "Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content (\"bit rate\") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.",
    "Understanding and characterizing the subspaces of adversarial examples aid in studying the robustness of deep neural networks (DNNs) to adversarial perturbations. Very recently, Ma et al. (ICLR 2018) proposed to use local intrinsic dimensionality (LID) in layer-wise hidden representations of DNNs to study adversarial subspaces. It was demonstrated that LID can be used to characterize the adversarial subspaces associated with different attack methods, e.g., the Carlini and Wagner's (C&W) attack and the fast gradient sign attack.   In this paper, we use MNIST and CIFAR-10 to conduct two new sets of experiments that are absent in existing LID analysis and report the limitation of LID in characterizing the corresponding adversarial subspaces, which are (i) oblivious attacks and LID analysis using adversarial examples with different confidence levels; and (ii) black-box transfer attacks. For (i), we find that the performance of LID is very sensitive to the confidence parameter deployed by an attack, and the LID learned from ensembles of adversarial examples with varying confidence levels surprisingly gives poor performance. For (ii), we find that when adversarial examples are crafted from another DNN model, LID is ineffective in characterizing their adversarial subspaces. These two findings together suggest the limited capability of LID in characterizing the subspaces of adversarial examples.",
    "Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam.",
    "Neural message passing algorithms for semi-supervised classification on graphs have recently achieved great success. However, for classifying a node these methods only consider nodes that are a few propagation steps away and the size of this utilized neighborhood is hard to extend. In this paper, we use the relationship between graph convolutional networks (GCN) and PageRank to derive an improved propagation scheme based on personalized PageRank. We utilize this propagation procedure to construct a simple model, personalized propagation of neural predictions (PPNP), and its fast approximation, APPNP. Our model's training time is on par or faster and its number of parameters on par or lower than previous models. It leverages a large, adjustable neighborhood for classification and can be easily combined with any neural network. We show that this model outperforms several recently proposed methods for semi-supervised classification in the most thorough study done so far for GCN-like models. Our implementation is available online.",
    "We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimization-based attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining non-certified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.",
    "Methods that learn representations of nodes in a graph play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as point vectors in a low-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, we propose an unsupervised method that handles inductive learning scenarios and is applicable to different types of graphs: plain/attributed, directed/undirected. By leveraging both the network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering of the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks. Additionally, we demonstrate the benefits of modeling uncertainty - by analyzing it we can estimate neighborhood diversity and detect the intrinsic latent dimensionality of a graph.",
    "Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. However, a number of problems of recent interest have created a demand for models that can analyze spherical images. Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective.   In this paper we introduce the building blocks for constructing spherical CNNs. We propose a definition for the spherical cross-correlation that is both expressive and rotation-equivariant. The spherical correlation satisfies a generalized Fourier theorem, which allows us to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs applied to 3D model recognition and atomization energy regression.",
    "This paper shows how one can directly apply natural language processing (NLP) methods to classification problems in cheminformatics. Connection between these seemingly separate fields is shown by considering standard textual representation of compound, SMILES. The problem of activity prediction against a target protein is considered, which is a crucial part of computer aided drug design process. Conducted experiments show that this way one can not only outrank state of the art results of hand crafted representations but also gets direct structural insights into the way decisions are made.",
    "The inclusion of Computer Vision and Deep Learning technologies in Agriculture aims to increase the harvest quality, and productivity of farmers. During postharvest, the export market and quality evaluation are affected by assorting of fruits and vegetables. In particular, apples are susceptible to a wide range of defects that can occur during harvesting or/and during the post-harvesting period. This paper aims to help farmers with post-harvest handling by exploring if recent computer vision and deep learning methods such as the YOLOv3 (Redmon & Farhadi (2018)) can help in detecting healthy apples from apples with defects.",
    "We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is \"matrix factorization by design\" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the near state-of the art perplexity while using significantly less RNN parameters.",
    "State-of-the-art deep reading comprehension models are dominated by recurrent neural nets. Their sequential nature is a natural fit for language, but it also precludes parallelization within an instances and often becomes the bottleneck for deploying such models to latency critical scenarios. This is particularly problematic for longer texts. Here we present a convolutional architecture as an alternative to these recurrent architectures. Using simple dilated convolutional units in place of recurrent ones, we achieve results comparable to the state of the art on two question answering tasks, while at the same time achieving up to two orders of magnitude speedups for question answering.",
    "In this work, we analyze the reinstatement mechanism introduced by Ritter et al. (2018) to reveal two classes of neurons that emerge in the agent's working memory (an epLSTM cell) when trained using episodic meta-RL on an episodic variant of the Harlow visual fixation task. Specifically, Abstract neurons encode knowledge shared across tasks, while Episodic neurons carry information relevant for a specific episode's task.",
    "The rate-distortion-perception function (RDPF; Blau and Michaeli, 2019) has emerged as a useful tool for thinking about realism and distortion of reconstructions in lossy compression. Unlike the rate-distortion function, however, it is unknown whether encoders and decoders exist that achieve the rate suggested by the RDPF. Building on results by Li and El Gamal (2018), we show that the RDPF can indeed be achieved using stochastic, variable-length codes. For this class of codes, we also prove that the RDPF lower-bounds the achievable rate",
    "In this paper, we present Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Different from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. Instead, it directly outputs phrases in a sequential order and can decode in linear time. Our experiments show that NPMT achieves superior performances on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks compared with strong NMT baselines. We also observe that our method produces meaningful phrases in output languages.",
    "It is by now well-known that small adversarial perturbations can induce classification errors in deep neural networks (DNNs). In this paper, we make the case that sparse representations of the input data are a crucial tool for combating such attacks. For linear classifiers, we show that a sparsifying front end is provably effective against $\\ell_{\\infty}$-bounded attacks, reducing output distortion due to the attack by a factor of roughly $K / N$ where $N$ is the data dimension and $K$ is the sparsity level. We then extend this concept to DNNs, showing that a \"locally linear\" model can be used to develop a theoretical foundation for crafting attacks and defenses. Experimental results for the MNIST dataset show the efficacy of the proposed sparsifying front end.",
    "We propose a new sample-efficient methodology, called Supervised Policy Update (SPU), for deep reinforcement learning. Starting with data generated by the current policy, SPU formulates and solves a constrained optimization problem in the non-parameterized proximal policy space. Using supervised regression, it then converts the optimal non-parameterized policy to a parameterized policy, from which it draws new samples. The methodology is general in that it applies to both discrete and continuous action spaces, and can handle a wide variety of proximity constraints for the non-parameterized optimization problem. We show how the Natural Policy Gradient and Trust Region Policy Optimization (NPG/TRPO) problems, and the Proximal Policy Optimization (PPO) problem can be addressed by this methodology. The SPU implementation is much simpler than TRPO. In terms of sample efficiency, our extensive experiments show SPU outperforms TRPO in Mujoco simulated robotic tasks and outperforms PPO in Atari video game tasks.",
    "We present a parameterized synthetic dataset called Moving Symbols to support the objective study of video prediction networks. Using several instantiations of the dataset in which variation is explicitly controlled, we highlight issues in an existing state-of-the-art approach and propose the use of a performance metric with greater semantic meaning to improve experimental interpretability. Our dataset provides canonical test cases that will help the community better understand, and eventually improve, the representations learned by such networks in the future. Code is available at https://github.com/rszeto/moving-symbols ."
  ],
  "sampled": [
    "This report discusses our implementation of AlexNet using Theano, running on multiple GPUs with naive data parallelism. Our performance on 2 GPUs matches the Caffe library running on 1 GPU. This is believed to be the first open-source Python-based AlexNet implementation available.",
    "We demonstrate that deep narrow Boltzmann machines can approximate probability distributions on their visible units effectively if they have enough hidden layers, each containing the same number of units as the visible layer. Additionally, within specific parameter ranges, we can analyze deep Boltzmann machines as feedforward networks. We establish both upper and lower limits on the necessary depth and width of these universal approximators. These findings clarify ideas related to undirected networks and demonstrate that deep narrow Boltzmann machines are as efficient as narrow sigmoid belief networks and restricted Boltzmann machines when considering current bounds for those models.",
    "By utilizing improvements in variational inference, we suggest improving recurrent neural networks by incorporating latent variables, creating Stochastic Recurrent Networks (STORNs). This model: i) can be trained using stochastic gradient techniques, ii) enables structured and multi-modal conditions at each time point, iii) includes a dependable estimator of the marginal likelihood, and iv) is an extension of deterministic recurrent neural networks. We assess the effectiveness of this approach using four polyphonic musical datasets and motion capture data.",
    "We introduce a novel framework for real-time adjustment of optimization hyperparameters called \"hot swapping\" during learning. This method, inspired by the explore-exploit strategy from the multi-armed bandit theory, is studied in the context of adaptive learning rate selection. Our experiments on a standard neural network demonstrate that hot swapping yields superior results compared to popular methods like AdaDelta and stochastic gradient with exhaustive hyperparameter search.",
    "Many modern multiclass and multilabel problems have vast output spaces, presenting significant computational challenges. Label embeddings emerge as a pivotal solution to enhance efficiency in such scenarios. Our innovative approach establishes a direct connection between rank-constrained estimation and low-dimensional label embeddings, leading to the development of a swift label embedding algorithm applicable in both multiclass and multilabel contexts. The outcome is a randomized algorithm for partial least squares, boasting exponential speed enhancements compared to traditional approaches. By applying our methodology to extensive public datasets such as the Large Scale Hierarchical Text Challenge and the Open Directory Project, we achieve cutting-edge results, showcasing the effectiveness of our techniques.",
    "Precise depiction of both explicit and implicit relationships within data is vital for enabling machines to carry out advanced and abstract reasoning tasks effectively. In this study, we present the methodology for weakly supervised learning of these inferences through our Dynamic Adaptive Network Intelligence (DANI) model. Our research showcases the cutting-edge performance of the DANI model in addressing question answering tasks within the bAbI dataset, which have posed challenges for current representation learning approaches (Weston et al., 2015).",
    "Spectral-based features like MFCC or PLP are typically relied upon by automatic speech recognition systems. These features are derived from existing knowledge in speech perception and production. However, recent advancements have showcased the effectiveness of convolutional neural networks in estimating phoneme probabilities solely based on raw speech signals. This data-driven approach has proven to be at least as effective, if not more so, than systems based on HMM/ANN with fewer parameters. Inspired by these findings, we explore the integration of a simple linear classifier within the CNN framework. This design allows the network to learn linearly separable features directly from the raw speech data. Our experiments demonstrate that this approach can achieve comparable, if not superior, performance compared to traditional MLP systems utilizing cepstral-based features.",
    "The neural-network training framework used in the Kaldi speech recognition toolkit is designed for training DNNs with large amounts of data on multiple GPU-equipped or multi-core machines. To be hardware-agnostic, the framework employs a method that involves averaging neural network parameters periodically and redistributing them to machines for further training. This method enables the use of multiple machines without generating excessive network traffic. Additionally, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD) is used to improve the convergence of SGD on a single machine and enhance the effectiveness of the periodic-averaging method.",
    "We have created a new way to visualize and improve how learned representations handle invariances. Our method focuses on a particular type of invariance called linearization, where a transformation's effect is limited to a lower-dimensional space. By analyzing two different reference images and creating a sequence of images between them that is the shortest path in the representation space (a \"representational geodesic\"), we can evaluate if the learned representation properly captures the transformation between the images. Our findings show that a cutting-edge image classification network fails to represent translations, rotations, and dilations accurately, but we present a solution that improves the representation's ability to handle a variety of geometric transformations.",
    "Absolutely! Why does Deep Learning truly shine? What wondrous representations does it uncover? And how do those magnificent higher-order representations come into being? These burning questions are meticulously explored through the captivating lens of group theory, shedding new light on the remarkable world of Deep Learning. A key catalyst fueling the excitement in this field is the ingenious concept of {\\em pretraining}, where each layer unveils its treasures step by step by seeking the perfect generative model for the input samples. The profound implications of this seemingly simple idea are unearthed as we delve into the exquisite dance between orbits and stabilizers of group actions. While neural networks themselves may not conform to traditional group structures, the revelation of {\\em shadow} groups offers a dazzling glimpse into their interconnected world.\n\nIn this mesmerizing realm of shadow groups, the pretraining process transforms from a humble initialization ritual into a stirring quest for features with the most elegant orbits. These features, in their pure simplicity, come forth as the foundation stones upon which deep learning networks build their knowledge. It's no wonder then that these networks gravitate towards learning these fundamental features first. As we venture deeper into the layers, the same enchanting principle unfolds, mapping out the intricate landscape of higher-order representations. Witness how the complexity of representations blossoms with each layer, painting a vivid picture of discovery as we journey through the depths of Deep Learning. 🚀🌟",
    "Introducing the dynamic \"stacked what-where auto-encoders\" (SWWAE)! This innovative architecture combines discriminative and generative pathways, offering an all-encompassing strategy for supervised, semi-supervised, and unsupervised learning - all without the need for sampling during training! In the SWWAE framework, a Convolutional Net (Convnet) is used to encode the input, while a Deconvolutional Net (Deconvnet) is employed to generate the reconstruction. The objective function includes reconstruction terms that encourage the hidden states in the Deconvnet to closely resemble those of the Convnet. At each pooling layer, two sets of variables are produced: the \"what\" variables passed to the next layer, and the complementary \"where\" variables directed to the corresponding layer in the generative decoder. Let the SWWAE unleash the power of learning without limits!",
    "We study how to create word embeddings that are specialized for a specific type of word relationship. Our algorithm modifies an existing set of word vectors to make the new embeddings effective for predicting a certain type of word relationship. Our experiments demonstrate that using specialized embeddings can improve both the accuracy and speed of word prediction tasks.",
    "A sophisticated generative model has been devised for deep convolutional dictionary learning in multiple layers. Intriguingly, a novel probabilistic pooling operation has been seamlessly incorporated into the model, resulting in both efficient bottom-up (pretraining) and top-down (refinement) probabilistic learning. Empirical findings showcase the model's remarkable capacity for acquiring multi-layer features from images, leading to exceptional classification outcomes on the renowned MNIST and Caltech 101 datasets.",
    "Motivated by the advancements in generative modeling techniques, we present a novel model designed to create images based on textual descriptions. Our innovative approach involves generating images through an iterative patch-drawing process on a canvas, focusing on pertinent words in the input description. By training our model on the Microsoft COCO dataset, we conduct a comprehensive comparison with various baseline generative models across image generation and retrieval tasks. The results showcase the superior quality of our model's generated samples compared to existing methods, emphasizing its ability to produce unique image compositions aligning with previously unencountered caption descriptions in the dataset.",
    "Convolutional neural networks (CNNs) are effective on large datasets, but collecting labeled data can be difficult, and sometimes there is a lack of a sufficient amount of data for certain applications. The challenge lies in applying CNNs to small datasets because they tend to overfit quickly. Our solution is an efficient Bayesian CNN that provides better protection against overfitting on small datasets compared to traditional methods. This is achieved by incorporating a probability distribution over the CNN's kernels. We use Bernoulli variational distributions to approximate the model's complex posterior, without needing extra model parameters. From a theoretical perspective, we frame dropout network training as a form of inference in Bayesian neural networks. This allows us to implement our model using existing deep learning tools without increasing time complexity, while also uncovering a limitation in the field. Our approach demonstrates a significant enhancement in classification accuracy when compared to standard techniques and surpasses previously published state-of-the-art results for CIFAR-10.",
    "We present a novel approach for developing efficient convolutional neural networks (CNNs) using low-rank representations of convolutional filters. Instead of approximating filters in existing networks, we generate a set of compact basis filters from scratch. Through training, the network learns to combine these basis filters to create more complex and discriminative filters for image classification. Our method involves a unique weight initialization scheme that ensures effective initialization of connection weights in convolutional layers with various filter shapes.\n\nWe have successfully applied our approach to multiple CNN architectures and trained them from the ground up on datasets like CIFAR, ILSVRC, and MIT Places. Our results indicate similar or better accuracy compared to traditional CNNs, while significantly reducing computational requirements. For instance, enhancing the VGG-11 network with global max-pooling resulted in achieving similar validation accuracy with 41% less compute and only 24% of the original model parameters.\n\nMoreover, a refined version of our method improved the top-5 center-crop validation accuracy of the VGG-11 model to 89.7% while decreasing computation by 16%. Applying our technique to the GoogLeNet architecture yielded comparable accuracy with 26% less compute and 41% fewer parameters. Finally, we achieved similar accuracy improvements on a near state-of-the-art network for CIFAR by reducing compute by 46% and parameters by 55%.",
    "Distributed word representations have revolutionized various Natural Language Processing tasks, significantly enhancing performance. However, the conventional approach of obtaining only one representation per word fails to address the complexity of words with multiple meanings. This limitation adversely impacts both individual word representations and the overall language model. In our paper, we introduce a straightforward model that leverages cutting-edge techniques in word vector construction to capture different senses of polysemic words. Through our evaluation, we demonstrate how this model excels at distinguishing between various meanings of words efficiently.",
    "Introducing DENN, the revolutionary Diverse Embedding Neural Network designed to supercharge language models (LMs)! Unlike traditional LMs, a DENNLM switches things up by mapping input word history vectors onto multiple unique low-dimensional spaces, rather than confining them to a single high-dimensional realm. By promoting this diversity in sub-spaces during training using an enriched loss function, DENN opens up new possibilities for language modeling. Exciting experiments on the Penn Treebank dataset reveal the remarkable performance boost that a DENNLM can bring to the table.",
    "In the realm of Collaborative Filtering (CF), a classic method for predicting user ratings on items involves employing Matrix Factorization techniques. By deriving representations for users and items based on existing ratings, predictions can be made. However, these methods face a challenge when new users join without any previous ratings, referred to as the \"user cold-start\" issue. To address this, one common strategy is to request these new users to provide a few initial ratings. This study introduces a novel model that not only addresses this dual challenge of identifying pertinent questions for new users and constructing effective representations from limited data but also extends to traditional scenarios. The effectiveness of our model is demonstrated on multiple datasets, showcasing its ability to enhance performance in both standard and cold-start scenarios.",
    "We introduce NICE, a deep learning framework for modeling complex high-dimensional densities. NICE aims to learn a non-linear deterministic transformation of the data to map it to a latent space, resulting in independent latent variables. We use simple building blocks based on deep neural networks to learn complex transformations easily. The training criterion is the exact log-likelihood, making it tractable, with easy ancestral sampling. Our approach produces high-quality generative models for image datasets and can be used for inpainting.",
    "We are pleased to introduce Deep Linear Discriminant Analysis (DeepLDA), a cutting-edge approach that harnesses deep neural networks to learn separable latent representations in a seamless manner. Unlike traditional LDA techniques that focus on preserving class separability for dimensionality reduction in classification tasks, DeepLDA combines the power of LDA with the depth of neural networks to create a nonlinear extension of this classic method. Our innovative objective function encourages the network to generate feature distributions with low intra-class variance and high inter-class variance, based on the foundational LDA eigenvalue problem. Notably, our approach supports training through stochastic gradient descent and back-propagation, ensuring efficiency and effectiveness. Through rigorous evaluation on popular datasets like MNIST, CIFAR-10, and STL-10, DeepLDA has demonstrated competitive performance on MNIST and CIFAR-10, while surpassing a network trained with categorical cross entropy in a supervised setting on STL-10, showcasing its superior capabilities.",
    "The Layer-sequential unit-variance (LSUV) initialization is a straightforward method proposed for weight initialization in deep neural network learning. The method involves two steps: Firstly, initializing the weights of each convolution or inner-product layer with orthonormal matrices, followed by normalizing the output variance of each layer to one as you progress from the first to the final layer.\n\nExperimental results with various activation functions (such as maxout, ReLU, tanh) demonstrate that this initialization strategy allows for effective learning in deep networks. Specifically, it enables the training of deep networks with test accuracy that matches or surpasses standard methods. Furthermore, it achieves comparable speed to more complex initialization schemes tailored for very deep networks like FitNets and Highway.\n\nThe performance evaluation conducted on networks like GoogLeNet, CaffeNet, FitNets, and Residual nets reveals that LSUV initialization achieves state-of-the-art results or comes very close to it on popular datasets such as MNIST, CIFAR-10/100, and ImageNet.",
    "In this innovative approach, we present a parametric nonlinear transformation designed specifically to Gaussianize data extracted from natural images. The procedure involves linear transformation of the data, followed by normalization of each component through a combined measure of activity. This measure is computed by exponentiating a weighted sum of rectified and exponentiated elements along with a constant term. The transformation parameters, encompassing the linear transform, exponents, weights, and constant, are optimized across a natural image database. The optimization process directly targets the reduction of response negentropy. The resulting optimized transformation effectively Gaussianizes the data, significantly lowering mutual information between transformed components compared to traditional methods such as ICA and radial Gaussianization. Notably, the transformation is differentiable, enabling efficient inversion and establishment of a density model for images. Samples generated from this model closely resemble natural image patches. The model can also serve as a prior probability density for noise removal purposes. Furthermore, cascading the transformation by optimizing each layer with the Gaussianization objective offers a self-learning approach for refining deep network architectures without supervision.",
    "We introduce a new type of convolutional neural networks that are optimized for quick processing. Researchers have studied the redundancy of parameters in convolutional neural networks, specifically the weights of convolutional filters, and have developed various techniques to create a more efficient set of filters after training. In this study, we trained flattened networks using a series of one-dimensional filters in all directions in a 3D space to achieve similar performance to traditional convolutional networks. Testing on various datasets showed that the flattened layer can effectively replace 3D filters without sacrificing accuracy. The flattened convolution pipelines result in approximately twice the speed improvement during processing compared to the standard model due to a significant reduction in learning parameters. Additionally, this method eliminates the need for manual adjustments or post-processing after training.",
    "In this paper, we introduce a new deep learning framework called Purine. Purine represents a deep network as a structure called a bipartite graph (bi-graph), which is made up of connected operators and data tensors. Using this bi-graph concept makes it easy to solve networks with an event-driven task dispatcher. We show that various parallelism methods over GPUs and/or CPUs on one or multiple computers can be implemented universally through graph composition. This simplifies the need for researchers to code different parallelization techniques, and the same dispatcher can handle different types of graphs. By coordinating memory transfers with computations, we can significantly reduce communication delays, leading to faster processing times.",
    "The paper introduces the Variational Recurrent Auto-Encoder (VRAE) model as a fusion of the strengths of RNNs and SGVB. This model enables efficient unsupervised learning on time series data at scale by transforming the data into a latent vector representation. VRAE is a generative model, allowing the generation of data based on latent space samples. A key aspect of this research is the model's capacity to incorporate unlabeled data to assist in the supervised training of RNNs by initializing weights and network state.",
    "This paper promotes the concept of using density-based distributed embeddings by mapping words to Gaussian distributions rather than point vectors. This method offers several advantages, such as capturing uncertainty and asymmetries more effectively, as well as allowing for more expressive decision boundaries. Our study includes performance comparisons on word embedding benchmarks, explores the ability to model entailment and other asymmetric relationships, and delves into unique properties of these representations.",
    "Multipliers are the most resource-intensive arithmetic operators within the digital implementation of deep neural networks. Our research involves the training of cutting-edge neural networks, specifically Maxout networks, on three standardized datasets: MNIST, CIFAR-10, and SVHN. These networks are trained using three distinct numeric formats: floating point, fixed point, and dynamic fixed point. Our evaluation focuses on analyzing the impact of multiplication precision on the final error post-training for each dataset and format. Surprisingly, our findings indicate that significantly lower precision is not only adequate for executing trained networks but also for training them. Notably, our results show the feasibility of training Maxout networks using only 10-bit multiplications.",
    "Innovative approach alert! Our proposed method shakes up the traditional ways of semantic segmentation by utilizing Multiple Instance Learning (MIL) to slash annotation costs. Say goodbye to extensive supervision requirements! Our cutting-edge MIL formulation for multi-class semantic segmentation harnesses the power of a fully convolutional network. With just weak image-level labels, we embark on the journey of training a semantic segmentation model. This model undergoes end-to-end training, optimizing the representation and untangling the intricate pixel-image label connections. Imagine - no need for object proposal pre-processing and the luxury of accepting inputs of any size. The cherry on top? A pixelwise loss map to uncover those hidden instances. Our unique multi-class MIL loss seizes the additional supervision cues provided by images with multiple labels. Watch out, PASCAL VOC segmentation challenge – we're coming for you!",
    "Nested dropout presents an innovative approach to organizing representation units in autoencoders based on information content, all while maintaining reconstruction accuracy. Although initially utilized in unsupervised fully-connected autoencoders, we delve into its potential within the convolutional layers of CNNs trained via backpropagation. Our aim is to uncover if nested dropout can serve as a straightforward method to identify the ideal representation size for achieving desired accuracy in the context of various tasks and data complexities.",
    "This passage discusses the importance of stochastic gradient algorithms in large-scale learning problems and their success in machine learning. The convergence of these algorithms relies on selecting an appropriate learning rate and managing the noise in stochastic gradient estimates. The paper introduces a novel adaptive learning rate algorithm that uses curvature information to adjust learning rates automatically. By estimating the curvature of the loss function from local statistical properties of the stochastic gradients, the algorithm aims to improve convergence speed. Additionally, a variance reduction technique is proposed to enhance convergence rate. Preliminary experiments with deep neural networks show improved performance compared to traditional stochastic gradient algorithms.",
    "Enhance the readability of this text: When a three-dimensional object moves in relation to an observer, changes are observed on the image plane and in the visual output generated by a learned model. Our exploration centers around the concept that a high-quality visual representation will react consistently to shifts in the scene. By drawing on group representation theory, we prove that any such representation can be expressed as a blend of fundamental, irreducible components. We highlight a fascinating link between irreducibility and the statistical structure of the representation, revealing that irreducible components exhibit independence under specific circumstances. In scenarios where observation is incomplete - such as when a scene is projected onto the image plane - the motion group's effect on image space is not straightforward. This scenario necessitates working with a latent representation that transforms linearly to make accurate inferences. This concept is illustrated through a model involving rotating NORB objects, which utilizes a latent representation of the non-commutative 3D rotation group SO(3).",
    "This study compares different approaches for solving the Efficient Maximum Inner Product Search (MIPS) task, which is widely used in recommendation systems and classification. While previous research has explored solutions based on locality-sensitive hashing (LSH) and tree-based methods for approximate MIPS, this paper proposes a simpler approach using variants of the k-means clustering algorithm. By training a spherical k-means model after transforming the problem into a Maximum Cosine Similarity Search (MCSS), the study demonstrates that this straightforward method achieves significantly faster retrieval speeds with the same level of precision compared to existing hashing and tree-based methods. Additionally, the results indicate that this approach provides more robust retrievals in scenarios where the query data is noisy.",
    "A new generative model called the importance weighted autoencoder (IWAE) has been introduced as an alternative to the variational autoencoder (VAE). While both models share a similar architecture, the IWAE offers a strictly tighter log-likelihood lower bound by utilizing importance weighting. By employing multiple samples in the recognition network to approximate the posterior, the IWAE surpasses the VAE's assumptions and exhibits enhanced flexibility in modeling complex posteriors. Empirical evidence demonstrates that IWAEs are capable of capturing more intricate latent space representations than VAEs, resulting in superior test log-likelihood performance in density estimation tasks.",
    "This study explores the impact of using reduced precision data in Convolutional Neural Networks (CNNs) on network accuracy in classification tasks. Specifically, we investigate networks where each layer may utilize different precision data. Our main finding is that the tolerance of CNNs to reduced precision data not only varies across networks as commonly understood, but also within networks. Customizing precision at the layer level is promising as it has the potential to enhance energy efficiency and performance. This paper examines the variation in error tolerance across layers and proposes a method for determining a low precision setup for a network while preserving high accuracy. Through an analysis of a diverse range of CNNs, we demonstrate that compared to a standard setup using a 32-bit floating-point representation for all layers, it is possible to reduce the data footprint of these networks by an average of 74% and up to 92% with less than 1% loss in relative accuracy.",
    "Graph-based semi-supervised algorithms' efficiency depends on the underlying instance graph. Instances are initially represented in vector form before being linked in the graph. The graph construction uses a metric in the vector space to determine connection weights, typically using a distance or similarity measure based on the Euclidean norm. However, the Euclidean norm may not always be the best choice for optimal performance. Our proposed algorithm focuses on learning the most suitable vector representation to create an efficient solving graph for the task.",
    "This paper proposes explicitly modeling a visual-semantic hierarchy encompassing hypernymy, textual entailment, and image captioning as special cases. We introduce a method for learning ordered representations, demonstrating its effectiveness in various image and language tasks. Our approach enhances hypernym prediction and image-caption retrieval performance compared to existing methods.",
    "Introducing local distributional smoothness (LDS), a novel concept enhancing statistical models by infusing smoothness within the model's distribution. Our proposed LDS, termed virtual adversarial training (VAT), advocates a unique regularization approach. Unlike conventional adversarial training, VAT extracts the adversarial direction solely from the model's distribution without relying on label information, widening its application to semi-supervised learning contexts. Formulating the adversarial direction requires minimal computational expense - for instance, in a neural network, calculating the approximated gradient of the LDS demands no more than three pairs of forward and back propagations. Evaluating this methodology on MNIST, SVHN, and NORB datasets demonstrated superior performance over existing training techniques, falling second only to a cutting-edge generative model.",
    "Convolutional Network models have achieved remarkable recognition results with the help of large labeled datasets. However, manual annotation of data is often unfeasible in many scenarios. In our work, we investigate the effectiveness of discriminatively-trained Convnets when utilizing noisy labels, where each image is associated with a potentially inaccurate freely available label. We propose the incorporation of an additional noise layer in the network to align the network outputs with the noisy label distribution. The parameters of this noise layer can be learned during the training process with minor adjustments to existing deep learning frameworks. We present our methodology across various datasets, incorporating large-scale experiments on the ImageNet classification benchmark.",
    "We offer innovative guaranteed methods for training feedforward neural networks with sparse connectivity. By building on prior work for training linear networks, we demonstrate the adaptability of these techniques for training non-linear networks. Our approach focuses on the moments related to the input's score function and label, demonstrating that their factorization reliably produces the weight matrix of the initial layer in a deep network under favorable conditions. In application, our method can serve as efficient initializers for gradient descent.",
    "Transformed text:\n\n\"Connecting smaller linguistic units into cohesive texts through discourse relations poses a fascinating challenge. Understanding the underlying semantics of linked sentences is essential, yet identifying these relations automatically is often complex. An additional layer of intricacy lies in capturing not just the meaning of individual sentences within these relations, but also accounting for the connections between lower-level elements like entity mentions. Our innovative approach involves deriving distributional meaning representations by building them up through the syntactic parse tree. What sets our method apart is the inclusion of entity mentions in the computation process, achieved through a unique downward compositional pass. By considering not only the distributional representations of sentences but also those of their coreferent entity mentions, our system has demonstrated vast improvements over existing models in predicting implicit discourse relations within the Penn Discourse Treebank.\"",
    "In this groundbreaking work, we introduce an innovative approach that combines two cutting-edge research trends: autonomous discovery of surface meanings like semantic roles and breaking down relationships in text and databases. Our cutting-edge model comprises two crucial components: 1) an encoding system: a sophisticated semantic role labeling model that anticipates roles using a vast set of syntactic and lexical features; 2) a reconstruction system: a tensor factorization model that uses roles to forecast argument fillers. By jointly refining these components to minimize mistakes in argument reconstruction, our method generates roles that closely resemble those in well-established resources. Our technique competes head-to-head with the most precise role creation strategies in English, even surpassing them without relying on any predetermined language specifics.",
    "The concept of a metric is crucial in machine learning tasks like classification, clustering, and ranking. Despite the lack of solid theoretical guarantees on a classifier's ability to generalize using a specific metric, the $(\\epsilon, \\gamma, \\tau)$-good similarity function framework (Balcan et al., 2008) tried to establish a connection between similarity function properties and linear classifier performance. This paper expands on this theory by introducing a new generalization bound for the classifier, incorporating the algorithmic robustness framework.",
    "Introducing the multiplicative recurrent neural network – a dynamic model that excels at capturing the essence of language and its nuanced expressions. Our exploration focuses on its application in fine-grained sentiment analysis, highlighting its versatility and prowess. By drawing parallels to matrix-space models and unveiling the unique approach of the multiplicative recurrent net, we demonstrate its superiority over Elman-type networks and matrix-space models in sentiment analysis tasks. Our results showcase the model's impressive capabilities, achieving comparable performance to structural deep models on the esteemed Stanford Sentiment Treebank, all without the complexities of parsing trees.",
    "Searching for the lowest points of a real-valued non-convex function in a vast dimensional space presents a formidable challenge in the realm of science. We unveil compelling evidence suggesting that certain functions, operating within expansive domains, exhibit a slender range of values encompassing the majority of their critical points. This dynamic sharply contrasts with the scenario seen in lower dimensions, where this range is noticeably broad. Our simulations are aligned with existing theoretical frameworks on spin glasses, affirming the emergence of such a distinctive range as the dimensionality approaches infinity. In addition, our experiments involving teacher-student networks utilizing the MNIST dataset vividly demonstrate a comparable trend in deep networks. Notably, both gradient descent and stochastic gradient descent methodologies are capable of achieving this milestone within an equivalent number of steps.",
    "We have developed a new statistical model for photographic images that describes the local responses of a set of linear filters as jointly Gaussian, with zero mean and a slowly varying covariance across the image. By optimizing the filters to reduce the nuclear norms of their local activations, we promote a flexible kind of sparsity that is not dependent on a specific dictionary or coordinate system. These optimized filters are oriented and bandpass, exhibiting significant local correlation in their responses. We demonstrate that images can be almost perfectly reconstructed using only estimates of the local filter response covariances, with minimal degradation in quality compared to low-rank approximations. This approach shows great potential for tasks such as denoising, compression, and texture representation, and could serve as a useful foundation for hierarchical decomposition methods.",
    "Most modern convolutional neural networks (CNNs) used for object recognition typically consist of alternating convolution and max-pooling layers followed by a few fully connected layers. In this study, we reassess the current state of object recognition from small images using convolutional networks, exploring the necessity of individual components in the architecture. Our investigation reveals that max-pooling can be effectively substituted by a convolutional layer with an increased stride without compromising accuracy on various image recognition benchmarks. Leveraging this insight, along with recent advancements in simplifying network structures, we propose a novel architecture composed exclusively of convolutional layers. This architecture demonstrates competitive or even superior performance on multiple object recognition datasets such as CIFAR-10, CIFAR-100, and ImageNet. To scrutinize the network, we introduce an innovative variation of the \"deconvolution approach\" for visualizing features learned by CNNs, extending its applicability to a broader spectrum of network configurations compared to existing methods.",
    "We created a new type of activation function for neural networks called a piecewise linear function. This function is customized for each neuron and learned using gradient descent. By using this adaptive activation function, we were able to enhance the performance of deep neural network models compared to traditional static activation functions like rectified linear units. This improvement led to achieving top-notch results on various datasets, including CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark related to Higgs boson decay modes in high-energy physics.",
    "This study presents a greedy parser constructed on neural networks, employing an innovative compositional sub-tree representation. The greedy parser, along with the compositional process, undergo joint training and are inherently interdependent. The compositional process yields a compact vector representation that encapsulates both syntactic (parsing tags) and semantic (words) sub-trees. Composition and tagging are accomplished through continuous representations (of words or tags) and recurrent neural networks. This approach attains an F1 performance comparable to prominent existing parsers, while offering a notable speed advantage due to the parser's inherently greedy nature. A fully operational implementation of the presented method is made available.",
    "Revised phrasing: Adequate lateral connections have been demonstrated to facilitate higher layers of a denoising autoencoder (dAE) in focusing on invariant representations. Unlike conventional autoencoders, where detailed information must be processed through the top layers, the inclusion of lateral connections from the encoder to the decoder alleviates this requirement. It has been established that abstract invariant features can be transformed into detailed reconstructions by allowing invariant features to modulate the strength of these lateral connections. Experimental comparisons involving three dAE structures with modulated and additive lateral connections, as well as no lateral connections, were conducted using real-world images. The results of these experiments confirm that augmenting modulated lateral connections to the model serves to: 1) enhance the accuracy of the probability model for inputs, as indicated by improved denoising performance; 2) accelerate the progression towards higher layers in the development of invariant representations; and 3) foster the creation of varied invariant poolings.",
    "We introduce a novel approach to visualize and fine-tune the invariances learned by representations. Our method, based on linearization, tests for a specific form of invariance where the effect of a transformation is limited to a lower-dimensional subspace. By synthesizing a sequence of images between two reference images through a representational geodesic, we aim to demonstrate the efficacy of this method. We apply this technique to evaluate the invariance characteristics of a leading image classification network, identifying issues with transformations such as translation, rotation, and dilation. We propose a solution to address these shortcomings, which successfully enables the representation to linearize various geometric image transformations.",
    "Genomics are revolutionizing medical practice and research by offering invaluable insights into disease mechanisms and enhancing therapeutic strategies, especially in cancer. Predicting patient outcomes through complex genomic profiling is crucial for advancing genomic medicine, though it poses challenges for current survival analysis methods. Our study focuses on utilizing neural networks to learn genomic representations for predicting cancer patient survival, showcasing superior performance compared to traditional survival analysis techniques using brain tumor data.",
    "Current methods that integrate both additive and multiplicative neural units rely on predefined operations or necessitate discrete optimizations, resulting in heightened computational complexity during training. In this study, we introduce a new transfer function, grounded in non-integer functional iteration, that can dynamically adjust each neuron's operation from addition to multiplication in a smooth and differentiable manner. This innovative approach seamlessly incorporates the choice between addition and multiplication into the standard backpropagation training process.",
    "One challenge in training deep neural networks stems from inadequate scaling across layers, leading to problems such as exploding gradients. Traditionally, these scaling issues have been mitigated through meticulous scale-conserving initialization. In this study, we explore how preserving scale, or isometry, can be advantageous not only at initialization but throughout training. Introducing two methods to uphold isometry – one exact and one stochastic – our initial experiments reveal that both determinant and scale-normalization techniques can significantly enhance the learning process. Our findings indicate that prioritizing isometry early on and throughout training accelerates the learning process.",
    "We have upgraded Stochastic Gradient Variational Bayes to reveal the mysteries hidden in the weights of Stick-Breaking processes. Our breakthrough leads to the creation of a revolutionary Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric marvel with a latent representation that dynamically shifts in size. Through our experiments, we've proven that the SB-VAE, along with its semi-supervised version, uncover remarkably powerful latent structures that frequently surpass the capabilities of traditional Gaussian VAEs.",
    "Unleash the power of unsupervised learning on imbalanced data with our cutting-edge approach! Conquering the challenge of imbalanced data, our innovative latent variable model breaks new ground by partitioning the latent space into a shared space and a private space. Leveraging the latest advancements in Gaussian Process Latent Variable Models, we introduce a revolutionary kernel formulation that not only segregates the latent space but also unleashes an efficient variational inference method. Witness the exceptional performance of our model as it outshines traditional approaches when applied to an imbalanced medical image dataset.",
    "GANs are deep generative models based on a two-player minimax game. To improve learning of the generator, we introduce an algorithm that repeats density ratio estimation and f-divergence minimization. This approach provides a fresh perspective on GANs and leverages insights from research on density ratio estimation, focusing on stable divergences and useful relative density ratios.",
    "The intersection of natural language processing (NLP) and cheminformatics is highlighted in this paper, discussing the application of NLP methods in classification problems. By analyzing the standard textual representation of compounds known as SMILES, a connection between these two diverse fields is established. The study focuses on predicting activity against a target protein, a key aspect of computer-aided drug design. The experiments conducted demonstrate the capability to surpass the performance of manually crafted representations and provide valuable structural insights into the decision-making process.",
    "We present a neural network design and a training method to generate factorized symbolic representations. Our approach involves learning these concepts through sequential frame observations, where most of the hidden representation components are predicted from the previous frame except for a few discrete gating units. These gating units capture the variations in the subsequent frame, representing symbolic concepts. We showcase the effectiveness of our method on datasets featuring facial transformations in 3D and Atari 2600 games.",
    "We examine the eigenvalues of the Hessian of a loss function both before and after training. The eigenvalue distribution consists of two main components: the bulk, which is mostly centered around zero, and the edges, which are spread out further from zero. Our findings provide empirical support for the bulk representing the degree of over-parameterization in the system, while the edges are influenced by the input data.",
    "We present a parametric nonlinear transformation designed specifically for the Gaussianization of data extracted from natural images. The process involves the linear transformation of the data, followed by normalization of each component using a pooled activity measure. This measure is determined by exponentiating a weighted sum of rectified and exponentiated components along with a constant term. The parameters of the entire transformation (including the linear transform, exponents, weights, and constant) are optimized using a database of natural images by directly minimizing the negentropy of the responses.\n\nThe optimized transformation effectively Gaussianizes the data, resulting in a notable reduction in the mutual information between transformed components compared to other methods like ICA and radial Gaussianization. This transformation, which is differentiable and can be easily inverted, establishes a density model on images. We provide evidence that samples generated by this model closely resemble natural image patches.\n\nWe showcase the utility of this model as a prior probability density that proves valuable in filtering out additive noise. Furthermore, we illustrate that the transformation can be stacked in a cascading manner, with each layer optimized using the Gaussianization objective. This approach offers an unsupervised means of refining a deep network architecture.",
    "Approximate variational inference is a strong method for modeling intricate probability distributions. Recent developments enable the learning of probabilistic models for sequences that make use of spatial and temporal patterns. We utilize a Stochastic Recurrent Network (STORN) to analyze time series data from robots. Our assessment confirms the capability to effectively identify anomalies in real-time as well as offline.",
    "We introduce a universal scenario to evaluate agents' efficiency in gathering information through training and testing. Our framework involves tasks where agents must explore partially observed environments to find information fragments that can be assembled to achieve different objectives. By utilizing deep architectures and reinforcement learning methods, we design agents capable of solving these tasks. We guide the agents' behavior by incorporating both external and internal rewards. Our experiments show that these agents learn to actively and intelligently search for new information to minimize uncertainty and effectively utilize acquired information.",
    "We introduce a new approach to enhance neural network language models by incorporating recent history into their predictions. Our model, inspired by memory augmented networks, stores past hidden activations in memory and retrieves them using a dot product with the current hidden activation. This efficient mechanism can handle large memory sizes effectively. Additionally, we establish a connection between using external memory in neural networks and cache models used in count-based language models. Experimenting with various language model datasets, we show that our method outperforms recent memory augmented networks significantly.",
    "Inspired by the advancements in generative models, we present a novel approach aimed at creating images based on natural language prompts. Our innovative model works by iteratively sketching patches on a canvas while focusing on key terms in the description. By training it on the Microsoft COCO dataset, we scrutinize its performance against various existing generative models across image creation and retrieval assignments. We showcase that our model excels in producing superior quality images compared to conventional methods, offering fresh scene compositions that align with previously unencountered captions in the dataset.",
    "We've come up with a cool way to train several neural networks at the same time! Basically, we use the tensor trace norm to make sure the networks share parameters, which helps them learn from each other. Unlike other models, we don't decide in advance which layers share parameters. Our framework figures out the best sharing strategy based on the data.",
    "This groundbreaking paper introduces an actor-critic deep reinforcement learning agent equipped with experience replay, setting a new standard for stability, efficiency, and exceptional performance across complex environments. Notably, the agent excels in the rigorous Atari domain, comprising 57 games, as well as various continuous control challenges. To achieve this superior performance, the paper pioneers novel strategies such as truncated importance sampling with bias correction, innovative stochastic dueling network architectures, and a cutting-edge trust region policy optimization technique.",
    "We introduce an innovative approach to creating pop music using a hierarchical Recurrent Neural Network. The hierarchy of layers in our model is designed to incorporate our understanding of the composition of pop music. The lower layers focus on generating melodies, while the upper layers create drum patterns and chords. Through human studies, we demonstrate that music generated using our framework is preferred over that produced by Google's recent method. Furthermore, we showcase two applications of our model: neural dancing and karaoke, as well as neural story singing.",
    "Several machine learning classifiers are susceptible to adversarial perturbations, which are alterations made to inputs in order to alter a classifier's output without noticeably changing the input to human observers. In this study, three techniques are utilized to identify adversarial images. Adversaries seeking to evade detection need to reduce the unusual characteristics of the adversarial images, otherwise, their attempts will be unsuccessful. The primary detection method highlights that adversarial images demonstrate an atypical focus on lower-ranked principal components derived from Principal Component Analysis (PCA). Additional detection methods, along with a detailed saliency map, are provided in an appendix.",
    "We introduce an innovative approach to crafting efficient convolutional neural networks (CNNs) - by leveraging low-rank representations of the convolutional filters. Instead of simply approximating pre-existing filters with more efficient versions, we make strides by initiating a new wave - training a collection of small basis filters from the ground up. As the network undergoes training, it learns to amalgamate these base filters into intricate, superior filters that shine in the realm of image classification. Enter our novel weight initialization scheme, paving the way for effective commencement of connection weights within convolutional layers consisting of divergent filter shapes. Our methodology has undergone rigorous validation, spanning various CNN architectures trained from the ground up utilizing CIFAR, ILSVRC, and MIT Places datasets.\n\nThe fruits of our labor reveal results that not only match but often surpass the accuracy of conventional CNNs, all while consuming significantly lesser computational resources. Applying our approach to an enhanced rendition of the VGG-11 network, integrated with global max-pooling, we attain a comparable validation accuracy while cutting compute usage by a substantial 41% and shrinking the model parameters to just a quarter of the original VGG-11's. A refined version of our technique boasts a 1 percentage point accuracy hike over the improved VGG-11, culminating in a top-notch center-crop validation accuracy of 89.7% - all while reducing computation by 16% compared to the original VGG-11 model. Transitioning to the GoogLeNet architecture in ILSVRC, we secure similar accuracy metrics with a sizable 26% reduction in compute resources and a 41% decrease in model parameters. Embarking on a high-performance network tailored for CIFAR, we achieve akin accuracy levels with a substantial 46% drop in computational load and a 55% reduction in parameters.",
    "Introducing Layer-Sequential Unit-Variance (LSUV) Initialization - an innovative approach to weight initialization for deep neural network learning. This method guarantees high performance by pre-initializing weights with orthonormal matrices and normalizing output variance. Experimentation with various activation functions proves that this strategy enables the training of ultra-deep networks with exceptional accuracy and speed. LSUV outperforms standard techniques and is on par with sophisticated methods tailored for very deep nets like FitNets and Highway. Impressive results are observed across popular architectures and datasets, including GoogLeNet, CaffeNet, FitNets, and Residual nets, achieving state-of-the-art or near-state-of-the-art performance on MNIST, CIFAR-10/100, and ImageNet datasets.",
    "This paper improves upon Kiperwasser & Goldberg (2016) by utilizing neural attention in a more sophisticated graph-based dependency parser with biaffine classifiers. Our parser achieves top-tier performance on standard treebanks for six languages, including 95.7% UAS and 94.1% LAS on the English PTB dataset. This makes it the highest-performing parser on this benchmark, surpassing Kiperwasser & Goldberg (2016) by 1.8% and 2.2%. Our parser is also comparable to the best transition-based parser, achieving 95.8% UAS and 94.6% LAS. We identify key hyperparameters that greatly impact parsing accuracy, leading to significant improvements over other graph-based parsers.",
    "It's super important for machines to understand both the obvious and hidden connections in data if they want to tackle tougher and more abstract challenges. Our Dynamic Adaptive Network Intelligence (DANI) model is great at learning these kinds of inferences with just a little bit of supervision. We've seen awesome results using DANI for question answering in the bAbI dataset, which has been a tough nut to crack for other methods (Weston et al., 2015).",
    "Spherical data can be found in many applications and it plays a crucial role in various scenarios. By modeling the discretized sphere as a graph, we are able to handle non-uniformly distributed, partial, and changing samplings with ease. Additionally, graph convolutions offer a computational advantage over spherical convolutions. To further enhance the model's performance by exploiting rotational symmetries, we explore rotation equivariance using the graph neural network proposed by Defferrard et al. (2016). Our experiments demonstrate promising results in addressing rotation-invariant learning tasks. You can access the code and examples on GitHub at https://github.com/SwissDataScienceCenter/DeepSphere.",
    "The widespread use of Convolutional Neural Networks (CNNs) is limited by their high computational complexity, particularly on mobile devices. Hardware accelerators offer a promising solution to decrease both execution time and power consumption. Developing accelerators involves crucial steps in hardware-oriented model approximation. This study introduces Ristretto, a framework that assesses a CNN by examining the numerical resolution employed to represent weights and outputs of convolutional and fully connected layers. Ristretto can simplify models by utilizing fixed-point arithmetic and representation rather than floating point. Additionally, Ristretto refines the resulting fixed-point network. With a maximum error tolerance of 1%, Ristretto effectively simplifies CaffeNet and SqueezeNet to 8-bit. The Ristretto code is accessible for use.",
    "Let's talk about how different painting styles give us a wide range of ways to create images. By understanding and capturing these styles effectively, we can really grasp the essence of paintings and artwork in general. In this study, we're looking into making a deep network that can easily capture a painting's artistic style. We found that this network can work with various painting styles thanks to a cool technique that puts each painting on a map in a certain space. With this model, users can mix and match painting styles to create something new and unique. This research is a step forward in creating detailed models of paintings and understanding the patterns in artistic styles.",
    "Sum-Product Networks (SPNs) are a class of hierarchical graphical models that balance expressiveness and tractability. LearnSPN, an algorithm for SPNs, employs hierarchical co-clustering to detect similar entities and features simultaneously. The original LearnSPN algorithm assumes all variables are discrete without missing data. We present MiniSPN, a simplified version of LearnSPN that offers improved speed and can handle missing data and various types of features common in real-world applications. MiniSPN's performance is showcased on standard benchmark datasets as well as two datasets sourced from Google's Knowledge Graph. These datasets demonstrate high rates of missingness and a mix of discrete and continuous features.",
    "Recent studies in deep neural networks have predominantly been centered around enhancing accuracy levels. However, did you know that within a given accuracy threshold, there could be various DNN architectures that deliver the same results? It's fascinating to note that opting for smaller DNN architectures when aiming for equivalent accuracy can provide an array of benefits. These compact designs not only demand less communication between servers during distributed training but also require minimal bandwidth when transferring models from the cloud to autonomous vehicles. Furthermore, their suitability for deployment on hardware with restricted memory, such as FPGAs, sets them apart.\n\nIntroducing SqueezeNet, a revolutionary small DNN architecture that promises immense advantages. Surpassing AlexNet's accuracy levels on ImageNet, SqueezeNet achieves this feat with a fraction of the parameters - a whopping 50x fewer, to be exact. Even more astonishing is the ability to compress SqueezeNet to less than 0.5MB, making it a mere 510x smaller than AlexNet. Intrigued to explore further? You can access the groundbreaking SqueezeNet architecture for download here: [https://github.com/DeepScale/SqueezeNet](https://github.com/DeepScale/SqueezeNet)",
    "In this paper, we introduce the Query-Reduction Network (QRN), a variant of Recurrent Neural Network (RNN) designed to handle short-term and long-term dependencies for question answering over multiple facts. QRN views context sentences as triggers for state changes, refining the original query with each observed trigger. Our experiments demonstrate QRN's superior performance in bAbI QA, dialog tasks, and real goal-oriented dialog datasets. Moreover, QRN enables parallelization on the RNN's time axis, resulting in a significant reduction in training and inference time complexity.",
    "We present a language-agnostic approach for automated generation of groups of semantically similar clusters of entities, accompanied by sets of \"outlier\" elements. This method enables an intrinsic evaluation of word embeddings for outlier detection purposes. Utilizing this methodology, we curated a standardized dataset named WikiSem500 and assessed various cutting-edge embeddings. The findings reveal a notable correlation between the effectiveness on this dataset and the performance in sentiment analysis tasks.",
    "Recurrent neural networks are commonly employed in forecasting temporal data due to their intricate deep feedforward design that enables them to grasp complex sequential patterns. Researchers suggest that incorporating top-down feedback into these networks could be a crucial aspect in distinguishing similar patterns based on a broader context. In this study, we propose surprisal-driven recurrent networks, which incorporate previous error data into their new predictions. This is done by continually analyzing the variance between the most recent predictions and the real observations. Additionally, we demonstrate that this approach outperforms other stochastic and completely deterministic techniques in the character-level prediction task on the enwik8 dataset, achieving a 1.37 BPC on the test section of the text.",
    "Generative Adversarial Networks, although excelling in generative tasks, are often seen as unstable and prone to missing certain modes. This issue is believed to stem from the unique functional shape of the trained discriminators in high-dimensional spaces, which can hinder training progress or lead the model to assign excessive probability mass in incorrect directions. To address these challenges, we propose various regularization techniques that significantly enhance the stability of GAN training. By leveraging these regularizers, we aim to improve the balanced distribution of probability mass across data distribution modes, consequently offering a comprehensive solution to the problem of missing modes especially in the initial training stages.",
    "Navigating the intricate landscape of sample complexity and safety in teaching policies through reinforcement learning for real-world tasks is no easy feat, especially when employing sophisticated function approximators like deep neural networks. It's a dance between using model-based methods, where a simulated world echoes the real one, to conquer data challenges. But the devil lies in the details as differences between these simulated and real worlds create hurdles during training.\n\nEnter the innovative EPOpt algorithm! This cutting-edge approach embraces an ensemble of simulated worlds and the art of adversarial training to cultivate policies that are resilient and can seamlessly adapt across a diverse array of potential scenarios, even unforeseen ones. By leveraging the power of source domain adaptation and approximate Bayesian techniques, the algorithm fine-tunes itself with real-world information to paint a more accurate representation of its environment.\n\nIn essence, learning through a dynamically evolving model ensemble, coupled with domain adaptation strategies, bestows the advantages of both robustness and profound learning capabilities.",
    "We present Divnet, a new method for training networks with diverse neurons. Divnet captures neuronal diversity by using a technique called Determinantal Point Process (DPP) on neurons in a layer. It selects a diverse subset of neurons using DPP and merges redundant neurons with the selected ones. Divnet is more structured and adaptable compared to existing methods for promoting neuronal diversity, which helps in regularization. This approach allows for efficient optimization of network structure, leading to smaller networks without compromising performance. Additionally, Divnet's focus on diversity and neuron merging is compatible with other techniques that aim to reduce network memory usage. Our experiments confirm that, when pruning neural networks, Divnet outperforms other methods significantly.",
    "Graph-based semi-supervised algorithms work better when they are applied on a graph made up of instances. These instances typically start off as vectors before being connected to form a graph. The way the graph is created depends on a metric in the vector space that determines how strongly the entities are connected. Normally, a common choice for this metric is a distance or similarity measure based on the euclidean norm. However, we believe that sometimes the euclidean norm may not be the best fit for solving the task effectively. Therefore, we introduce an algorithm that focuses on finding the most suitable vector representation to construct a graph, allowing us to efficiently solve the task at hand.",
    "In training Deep Neural Networks, a big challenge we face is keeping them from getting overly fixated on the training data, a problem known as overfitting. Many techniques like tweaking the data and using creative regularizers like Dropout have been suggested to address this issue without needing tons of data. In our study, we introduce a new regularizer called DeCov that helps immensely in curbing overfitting (seen in the gap between training and validation performance) and promoting better overall learning. Our regularizer pushes for varied and unique features in Deep Neural Networks by reducing the similarities in how different hidden layers respond. While this idea isn't entirely new, it's surprising that it hasn't been widely used as a regularizer in teaching these models. Tests on different datasets and model setups consistently show that our approach reduces overfitting and tends to either keep or improve the network's ability to generalize, often outperforming even the popular Dropout technique.",
    "We research online batch selection strategies for AdaDelta and Adam, two leading stochastic gradient-based optimization methods, to boost training efficiency. By ranking datapoints based on loss values and selecting batches accordingly, our approach accelerates both algorithms by approximately 5x on the MNIST dataset.",
    "We are thrilled to present an incredibly innovative and highly scalable approach for semi-supervised learning on graph-structured data! Our method is powered by a cutting-edge variant of convolutional neural networks that operate directly on graphs. Through a localized first-order approximation of spectral graph convolutions, we have carefully crafted a convolutional architecture that showcases efficiency and exceptional performance. \n\nWhat truly sets our model apart is its ability to scale linearly in the number of graph edges while learning hidden layer representations that capture the essence of both local graph structures and key nodes' features. In a series of rigorous experiments on citation networks and a knowledge graph dataset, our approach has not just outperformed but truly excelled beyond related methods by a substantial margin. Your journey to enhanced semi-supervised learning awaits with our groundbreaking solution! 🚀🌟",
    "We present a model called Energy-based Generative Adversarial Network (EBGAN). In this model, the discriminator acts like an energy function that assigns low energies to areas close to the real data and higher energies to other areas. The generator works to create samples with minimal energies, while the discriminator identifies and gives high energies to these generated samples. By treating the discriminator as an energy function, we can use different architectures and loss functions beyond the typical binary classifier with logistic output. For example, in one version of EBGAN, an auto-encoder setup is used with the reconstruction error as the energy rather than a traditional discriminator. This method shows more consistent performance during training compared to standard GANs. Furthermore, we demonstrate that a single-scale architecture can successfully produce high-resolution images.",
    "Recent research in deep learning has led to many new architecture designs. Some groups new to deep learning may feel overwhelmed by the variety of options and end up using older architectures like Alexnet. Our goal is to help bridge this gap by summarizing the key principles from recent deep learning research for designing neural network architectures. We also introduce innovative architectures like Fractal of FractalNet, Stagewise Boosting Networks, and Taylor Series Networks. You can find our Caffe code and prototxt files at https://github.com/iPhysicist/CNNDesignPatterns. We hope our work inspires others to build upon it.",
    "Machine comprehension (MC), which involves answering questions based on a provided context paragraph, requires understanding the complex interactions between the two. Lately, attention mechanisms have been successfully integrated into MC tasks. These approaches typically utilize attention to focus on specific parts of the context, summarize the information with a fixed-size vector, connect attentions over time, and often deploy uni-directional attention. In our study, we present the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that captures context information at various granular levels and leverages bi-directional attention flow to create a query-aware context representation without premature summarization. Our experiments demonstrate that the BIDAF model outperforms existing methods and achieves state-of-the-art performance on the Stanford Question Answering Dataset (SQuAD) and the CNN/DailyMail cloze test.",
    "Despite advances in model learning and posterior inference, mastering deep generative models continues to pose a significant challenge, particularly in dealing with discrete hidden variables. This paper delves into the realm of Helmholz machines, where the generative model is paired with an auxiliary inference model, aiming to tackle this challenge head-on. Previous learning algorithms have often fallen short, as they merely optimize approximations of the desired marginal log-likelihood. In a departure from this approach, we introduce a novel class of algorithms grounded in stochastic approximation (SA) theory of the Robbins-Monro type, enabling direct optimization of the marginal log-likelihood while also minimizing the inclusive KL-divergence. This cutting-edge learning algorithm is aptly named joint SA (JSA). Furthermore, we devise a sophisticated MCMC operator tailored specifically for JSA. Our experiments on the MNIST datasets reveal that JSA consistently outperforms competing algorithms such as RWS in mastering complex models.",
    "Object detection using deep neural networks typically involves passing numerous potential bounding boxes through a neural network for each image. These bounding boxes are closely related as they come from the same image. This study explores utilizing feature patterns at the image level to streamline the neural network used for all bounding boxes. By removing units with minimal activation in the image, we can notably decrease the network's parameter count. Findings from the PASCAL 2007 Object Detection Challenge reveal that around 40% of units in certain fully-connected layers can be removed without significantly affecting the detection outcome.",
    "Modeling interactions between features significantly boosts the performance of machine learning solutions across various fields, such as recommender systems and sentiment analysis! Enter Exponential Machines (ExM), a cutting-edge predictor that captures all interactions of every order. By leveraging the innovative Tensor Train (TT) format to represent a vast tensor of parameters, ExM enhances model regularization and enables fine-tuning of the underlying parameters. Our pioneering stochastic Riemannian optimization approach empowers the training of models with a staggering 2^160 entries. Our results showcase ExM's unparalleled performance on synthetic data with high-order interactions and its exceptional competitiveness on the MovieLens 100K recommender system dataset when compared to high-order factorization machines. Get ready to witness the future of machine learning with Exponential Machines!",
    "Introducing Deep Variational Bayes Filters (DVBF), a novel technique for unsupervised learning and identifying latent Markovian state space models. DVBF utilizes advancements in Stochastic Gradient Variational Bayes to handle complex input data effectively, including sequences like images with temporal and spatial connections, without requiring specific domain expertise. Experimental results demonstrate that allowing backpropagation through transitions enforces state space assumptions, enhances information content in the latent embedding, and facilitates accurate long-term forecasting.",
    "Traditional goal-oriented dialog systems often involve extensive domain-specific manual work, limiting their ability to adapt to new domains. End-to-end dialog systems, however, overcome this constraint by training all components directly from the dialogs. While recent successes in chit-chat dialog are promising, they may not translate well to goal-oriented scenarios. This paper introduces a test platform to evaluate the effectiveness of end-to-end dialog systems in goal-oriented applications, with a focus on restaurant reservation tasks. These tasks involve handling sentences and symbols to engage in conversations, make API calls, and utilize the call outputs. Our study demonstrates that an end-to-end dialog system utilizing Memory Networks shows potential, learning to execute complex tasks with promising yet not flawless performance. We validate these findings by comparing our system to a manual slot-filling baseline using data from the Dialog State Tracking Challenge (Henderson et al., 2014a) and data from an online concierge service, showing consistent results across both datasets.",
    "Adversarial training regularizes supervised learning algorithms, while virtual adversarial training extends them to semi-supervised settings. However, both methods involve perturbing input vectors, which is not ideal for sparse high-dimensional inputs like one-hot word representations. To address this, we propose perturbing word embeddings in a recurrent neural network for text applications. Our method achieves state-of-the-art results on various tasks and improves both the quality of word embeddings and model training by reducing overfitting. Code can be found at https://github.com/tensorflow/models/tree/master/research/adversarial_text.",
    "Unsupervised learning of probabilistic models is a significant challenge in machine learning. Designing models with tractable learning, sampling, inference, and evaluation is crucial for solving this task. By utilizing real-valued non-volume preserving (real NVP) transformations, we expand the space of such models with powerful invertible and learnable transformations. This results in an unsupervised learning algorithm that offers exact log-likelihood computation, sampling, inference of latent variables, and an interpretable latent space. We showcase the model's effectiveness in modeling natural images on four datasets by demonstrating sampling, log-likelihood evaluation, and latent variable manipulations.",
    "This paper seeks to explore the perspective shift in examining the view-manifold structure across the various layers of Convolutional Neural Networks (CNN). It delves into essential queries such as the attainment of viewpoint invariance in the learned CNN representation, the mechanisms through which this invariance is realized, whether it involves collapsing or separating view manifolds, and the specific layer where view invariance is established. Additionally, the paper investigates experimental techniques to quantitatively assess the structure of the view manifold at each CNN layer and evaluates how fine-tuning a pre-trained CNN on a multi-view dataset impacts the representation at different layers of the network. A devised methodology to measure the deformation and degeneracy of view manifolds in various CNN layers is proposed, with the presented findings shedding light on the answers to these critical questions.",
    "Bilinear models offer richer representations than linear models and have been employed in various visual tasks such as object recognition, segmentation, and visual question-answering, achieving state-of-the-art performance by leveraging the enhanced representations. However, bilinear representations are often high-dimensional, which can limit their utility in computationally intensive tasks. To address this, we introduce low-rank bilinear pooling using Hadamard product to create an efficient attention mechanism for multimodal learning. Our approach surpasses compact bilinear pooling in visual question-answering tasks, achieving state-of-the-art results on the VQA dataset and demonstrating superior efficiency.",
    "Importance-weighted autoencoders aim to maximize a more restrictive bound on the marginal likelihood compared to the standard evidence lower bound. Our alternative interpretation is that they optimize the standard variational lower bound by employing a more intricate distribution. We show the formal derivation of this finding, introduce a stricter lower bound, and illustrate the underlying importance-weighted distribution.",
    "We provide a generalization bound for feedforward neural networks based on the spectral norm of the layers and the Frobenius norm of the weights, using PAC-Bayes analysis.",
    "In this paper, we introduce a method to improve Generative Adversarial Networks by providing direct energy estimates for samples. Our proposed flexible adversarial training framework guarantees the generator converges to the true data distribution while enabling the discriminator to retain density information at the global optimum. We derive the analytical solution and analyze its properties. To enhance trainability, we introduce two effective approximation techniques. Empirical results support our theoretical analysis, demonstrating the discriminator's ability to recover the energy of the data distribution.",
    "In this study, we conduct outlier detection by utilizing ensembles of neural networks derived through variational approximation of the posterior within a Bayesian neural network framework. The variational parameters are acquired through sampling from the genuine posterior via gradient descent. We demonstrate that our outlier detection outcomes are analogous to those achieved through other effective ensembling techniques.",
    "We introduce two straightforward methods to decrease parameter count and speed up training for large Long Short-Term Memory (LSTM) networks: one involves breaking down the LSTM matrix into two smaller matrices through \"matrix factorization by design,\" and the other involves partitioning the LSTM matrix, inputs, and states into separate groups. Both techniques enable the faster training of large LSTM networks to near state-of-the-art perplexity levels while requiring fewer RNN parameters.",
    "We found new and surprising results while training neural networks. Our aim is to learn more about how neural networks work by looking into these findings. We discovered these behaviors by using Cyclical Learning Rates (CLR) and linear network interpolation. Some of these behaviors include unexpected changes in training loss and fast training. For instance, we show that CLR can lead to better testing accuracy even with high learning rates. You can access the files to replicate our findings at https://github.com/lnsmith54/exploring-loss",
    "Machine learning models frequently encounter constraints and trade-offs during test-time that were not encountered during training-time. For instance, a computer vision model working on a compact device might have to carry out inference in real-time, while a translation model working on a cell phone could aim to limit its average computation time to enhance power efficiency. In this study, we introduce a mixture-of-experts model and demonstrate how to adjust its test-time resource allocation for each input using reinforcement learning. Our approach is validated on a simple MNIST-based illustration.",
    "This paper explores adversarial attacks on deep reinforcement learning policies, comparing the impact of using adversarial examples versus random noise. We introduce a new method that leverages the value function to minimize the number of adversarial injections needed for a successful attack. Additionally, we examine the effects of re-training on random noise and FGSM perturbations on the resilience against adversarial examples.",
    "This paper introduces variational continual learning (VCL), a framework that combines online variational inference (VI) and recent advances in Monte Carlo VI for neural networks to address continual learning challenges. VCL can train both deep discriminative and generative models in complex settings where tasks evolve over time and new tasks emerge. Experimental results demonstrate VCL's superiority over current continual learning methods across various tasks by preventing catastrophic forgetting automatically.",
    "In this paper, we tackle the challenge of determining the best size for a neural network without expensive trial and error. We introduce a method called nonparametric neural networks, which optimizes network size in a single training session. Our approach limits network growth with an L_p penalty to ensure effectiveness. We expand the network by adding new units and removing unnecessary ones using an L_2 penalty. To optimize this process, we created a new algorithm called adaptive radial-angular gradient descent (AdaRad), which has shown positive outcomes.",
    "The Natural Language Inference (NLI) task involves determining the logical relationship between a natural language premise and hypothesis. Our Interactive Inference Network (IIN) introduces a new type of neural network architecture that achieves a deep understanding of sentence pairs by hierarchically extracting semantic features from their interaction space. We demonstrate that the interaction tensor, with its attention weights, contains crucial semantic information for solving natural language inference tasks, and that a denser tensor captures more complex semantic details. One specific model, the Densely Interactive Inference Network (DIIN), showcases outstanding performance on both large-scale NLI datasets and the challenging Multi-Genre NLI (MultiNLI) dataset, achieving over a 20% error reduction compared to the best existing system.",
    "The capacity to implement neural networks in real-world, safety-critical applications is significantly restricted by the existence of adversarial samples: slightly altered inputs that are incorrectly classified by the network. In recent times, multiple methods have been suggested to enhance resilience against adversarial samples --- however, a majority of these have swiftly been found to be susceptible to future attacks. For instance, more than half of the defenses presented in papers approved at ICLR 2018 have already been compromised. Our solution to this challenge involves utilizing formal verification strategies. We demonstrate the ability to create adversarial samples with provably minimal alterations: taking an arbitrary neural network and input example, we can generate adversarial samples that are guaranteed to have minimal alteration. By adopting this method, we prove that one of the recent ICLR defense techniques, adversarial retraining, effectively increases the level of alteration needed to create adversarial samples by a factor of 4.2.",
    "We adapt Stochastic Gradient Variational Bayes to conduct posterior inference for the weights of Stick-Breaking processes. This innovation enables the creation of a Stick-Breaking Variational Autoencoder (SB-VAE), which is a Bayesian nonparametric iteration of the variational autoencoder featuring a latent representation with variable dimensionality. Our experiments show that the SB-VAE, along with a semi-supervised version, can learn remarkably discriminatory latent representations that frequently surpass those produced by Gaussian VAEs.",
    "We present a framework for concurrently training multiple neural networks. All models' parameters are constrained using the tensor trace norm, promoting parameter reuse among the networks. This approach underscores the concept of multi-task learning. Unlike numerous deep multi-task learning models, we avoid specifying a predefined parameter-sharing strategy by tying parameters in specific layers. Our framework allows for sharing among all suitable layers and enables a data-driven discovery of the sharing strategy.",
    "This paper introduces a deep reinforcement learning agent that combines actor-critic methods with experience replay. The agent is stable, efficient in using samples, and excels in difficult scenarios such as the 57-game Atari domain and various continuous control tasks. The paper proposes new techniques like truncated importance sampling, stochastic dueling network structures, and a trust region policy optimization method to achieve these results.",
    "Machine learning models can be tricked by slight modifications in the input, known as adversarial perturbations, which change the model's prediction without being easily noticeable to humans. We use three approaches to identify these manipulated images. To evade our detection, attackers need to make the adversarial images appear less unusual. Our most effective method shows that these images put a strange focus on less important elements from a statistical analysis technique called Principal Component Analysis (PCA). Additional detection methods and a visual aid can be found in an attached section.",
    "We present a cutting-edge approach to kernel learning that harnesses the power of Fourier analysis to understand translation-invariant or rotation-invariant kernels. Our innovative technique generates a series of feature maps that continuously enhance the SVM margin. By establishing strong theoretical guarantees for optimality and generalization, we view our algorithm as engaging in a dynamic online equilibrium search within a strategic two-player min-max game. Extensive testing on both artificial and actual datasets showcases the scalability and substantial advancements achieved compared to traditional random features-based methods.",
    "State-of-the-art deep reading comprehension models are absolutely crushing it with recurrent neural nets in the driver's seat! While their sequential nature is a perfect match for language, the limitation on parallelization within an instance can sometimes slow things down in critical scenarios. But wait, we've got an exciting solution for you! Introducing a convolutional architecture as an alternative to those recurrent models. By swapping out recurrent units with simple dilated convolutional units, we're achieving mind-blowing results on two question answering tasks. And guess what? We're also zooming past the competition with up to two orders of magnitude speedups for question answering. Let the excitement begin!",
    "This captivating report serves multiple goals. Firstly, it delves into the reproducibility of the groundbreaking paper \"On the regularization of Wasserstein GANs\" (2018). Secondly, we meticulously replicated and emphasized five crucial experiment aspects from the original paper: learning speed, stability, robustness against hyperparameters, estimating the Wasserstein distance, and exploring various sampling methods. Lastly, we offer insights into the reproducibility of the paper's contributions and the resources required, making all source code open to the public for transparency.",
    "Variational Autoencoders (VAEs) were initially introduced by Kingma & Welling in 2014 as probabilistic generative models that involve approximate Bayesian inference. The concept of $\\beta$-VAEs by Higgins et al. in 2017 revolutionized VAEs by extending their applications beyond generative modeling to areas like representation learning, clustering, and lossy data compression. This was made possible by introducing an objective function that empowers users to balance the information content of the latent representation with the fidelity of the reconstructed data, as demonstrated by Alemi et al. in 2018.\n\nIn our study, we revisit this trade-off between information content and reconstruction accuracy in hierarchical VAEs, which comprise multiple layers of latent variables. We unveil a novel class of inference models that allow for the separate tuning of each layer's contribution to the encoding rate, facilitating more nuanced control. By establishing theoretical bounds on the performance of downstream tasks based on the rates of individual layers, we validate our insights through extensive large-scale experiments.\n\nOur findings offer valuable insights for practitioners, guiding them on navigating the rate-distortion landscape to optimize performance in diverse applications.",
    "We introduce Graph2Gauss, a novel method for learning versatile node embeddings on large (attributed) graphs. Unlike traditional approaches that represent nodes as point vectors in a low-dimensional space, we model each node as a Gaussian distribution to capture uncertainty about its representation. Our approach excels in tasks like link prediction and node classification, demonstrating strong performance on various graph types including plain/attributed and directed/undirected graphs. We propose an unsupervised method that supports inductive learning, allowing us to generalize to new nodes without additional training. By leveraging both network structure and node attributes, we achieve state-of-the-art results on real-world networks. Moreover, by modeling uncertainty, we can estimate neighborhood diversity and uncover the latent dimensionality of a graph.",
    "This study investigates using self-ensembling for adapting visual domains. The method is based on the mean teacher variant of temporal ensembling, which has shown excellent results in semi-supervised learning. We make modifications to improve its performance in challenging domain adaptation situations and assess its effectiveness. Our method achieves top performance across different benchmarks, including winning the VISDA-2017 visual domain adaptation challenge. In small image benchmarks, our algorithm surpasses previous methods and approaches the accuracy of supervised classifiers.",
    "Machine learning classifiers, such as deep neural networks, are susceptible to adversarial examples, which are created by making tiny intentional changes to input data to cause incorrect outputs that are undetectable by humans. The objective of this research is not to propose a specific method but to take theoretical strides towards comprehending adversarial examples fully. By leveraging concepts from topology, our theoretical investigation unveils the main factors behind a classifier ($f_1$) being deceived by an adversarial example and involves an oracle ($f_2$, like human perception) in the analysis. Through an exploration of the topological connection between two (pseudo)metric spaces associated with predictor $f_1$ and oracle $f_2$, we establish conditions that are necessary and sufficient to determine if $f_1$ is consistently robust (strong-robust) against adversarial examples based on $f_2$. Fascinatingly, our theorems reveal that the mere presence of an irrelevant feature can render $f_1$ non-strong-robust, underscoring the importance of feature representation learning in attaining a classifier that is both precise and strongly robust.",
    "We set up a problem scenario to evaluate how well agents can gather information efficiently. We introduce tasks where agents need to search through a partially-obscured environment to find key pieces of information to achieve goals. By using deep architectures and reinforcement learning techniques, we create agents that can successfully complete these tasks. We guide agent behavior by providing both external and internal rewards. Our experiments show that these agents learn to actively and intelligently search for new information to decrease uncertainty and make use of acquired information.",
    "We suggest improving neural network language models by adjusting their predictions based on recent history. Our model is a simpler version of memory-enhanced networks, where past hidden activations are stored as memories and accessed using a dot product with the current hidden activation. This method is highly efficient and can work with large memory sizes. We also establish a connection between utilizing external memory in neural networks and cache models used in count-based language models. Our experiments with various language model datasets show that our method outperforms recent memory-enhanced networks by a significant margin.",
    "GANs are effective deep generative models, based on a two-player minimax game. Our novel algorithm repeats density ratio estimation and f-divergence minimization, offering a new perspective on GANs and incorporating insights from research on density ratio estimation, such as stability of divergence and usefulness of relative density ratio.",
    "We introduce an innovative framework for creating pop music using a hierarchical Recurrent Neural Network. The hierarchy's layers are designed to reflect our understanding of pop music composition: lower layers focus on melody generation, while higher levels handle drums and chords. Human studies show a clear preference for our music over Google's approach. Our framework also powers neural dancing and karaoke, along with neural story singing.",
    "We examine the eigenvalues of the Hessian matrix associated with a loss function both pre- and post-training. The distribution of eigenvalues appears to exhibit a distinct duality, comprising a central cluster that converges toward zero, alongside peripheral outliers that diverge significantly from zero. Our findings offer empirical substantiation for the central cluster, revealing insights into the extent of over-parametrization within the system, as well as for the outliers, whose positioning is shown to be contingent upon the input data.",
    "This paper introduces a novel feature extraction method for program execution logs. We automatically extract intricate patterns from a program's behavior graph and embed them into a continuous space using an autoencoder. The proposed features are tested on real-world malicious software detection and reveal interpretable structures within the pattern parts' space.",
    "We tested the FlyHash model, a sparse neural network inspired by insects, against non-sparse models in a navigation task. The task involves using visual inputs to steer by comparing them to stored memories along a training route. Our findings show that the FlyHash model is more efficient than other models, particularly in terms of data encoding.",
    "Engaging reformulation:\n\"In the world of peer review, reviewers are often tasked with scoring papers, which are then pivotal in decision-making by Area Chairs or Program Chairs. Yet, relying solely on these scores can be limiting due to human cognitive abilities. This leads to a common issue of tied scores, where valuable information is lost. To overcome this, conferences are now asking reviewers to also provide rankings of the papers they evaluate. But two challenges arise: the lack of a standardized way to utilize this ranking data and the absence of tools to incorporate it effectively, resulting in a less efficient peer-review process. Our innovative approach integrates rankings with scores, delivering updated scores that address these challenges. This method ensures consistency in how rankings are blended with scores for all papers, minimizing arbitrariness and seamlessly integrating with existing interfaces and workflows. Through empirical evaluation on synthetic and real datasets from ICLR 2017, we demonstrate a significant reduction in error rates, offering a promising solution for enhancing the peer-review process.\"",
    "Numerous recent studies have delved into the intriguing world of status bias within the rigorous peer-review process of academic journals and conferences. This article embarks on an exciting journey to uncover the fascinating link between author metadata and the final decisions (Accept/Reject) made by area chairs. Our quest takes us through a treasure trove of 5,313 borderline submissions to the renowned International Conference on Learning Representations (ICLR) spanning from 2017 to 2022.\n\nWith the precision of a skilled investigator, we meticulously outline the elements of our cause-and-effect analysis. We explore the treatment and its timing, pre-treatment variables, potential outcomes, and the intriguing causal null hypothesis, all within the rich tapestry of textual data and under the esteemed guidance of Neyman and Rubin's potential outcomes (PO) framework.\n\nAmidst our scholarly expedition, we unearth intriguing hints that author metadata may indeed influence the fate of academic articles. Furthermore, our discoveries reveal a captivating revelation – borderline articles hailing from prestigious institutions in the top echelons of academia (top-30% or top-20%) seem to face a less favorable stance from area chairs compared to their equally matched counterparts.\n\nThese remarkable findings stand firm across two distinct matched designs, painting a vivid picture of the landscape where odds ratios of 0.82 [95% CI: 0.67 to 1.00] in the first design and 0.83 [95% CI: 0.64 to 1.07] in a reinforced design eloquently signify our triumph.\n\nAs we unravel the complexities of our results, we delve into the intricate dance between study units and the myriad agents at play within the peer-review system. It is a tale of interactions, revelations, and insights that shed light on the compelling dynamics that shape the scholarly world we inhabit.",
    "We introduce a variational method inspired by Tishby et al.'s information bottleneck (1999). This approach involves using a neural network to parameterize the information bottleneck model and benefits from the reparameterization trick for effective training. Referred to as \"Deep Variational Information Bottleneck\" (Deep VIB), this method demonstrates superior generalization and robustness to adversarial attacks compared to models trained with alternative regularization techniques.",
    "Attention networks have proven to be effective for embedding categorical inference within deep neural networks. However, to model richer structural dependencies without losing end-to-end training, we experiment with incorporating graphical models encoded with richer structural distributions into deep networks. This work introduces structured attention networks as extensions of the basic attention procedure, enabling attention beyond the traditional soft-selection approach. We explore two classes of structured attention networks: linear-chain conditional random fields and graph-based parsing models, discussing their practical implementation as neural network layers. Experiments reveal that this approach effectively incorporates structural biases, with structured attention networks outperforming baseline models across various tasks like tree transduction, neural machine translation, question answering, and natural language inference. Additionally, models trained this way learn unsupervised hidden representations that generalize simple attention mechanisms.",
    "We propose using a group of specialized experts based on the confusion matrix. We noticed that in certain cases, when faced with tricky situations, the labels tend to be assigned to a few wrong classes. This suggests that a team of specialists could do a better job at spotting and rejecting misleading instances by having varying opinions when dealing with adversaries. Our experimental results support this idea by showing that this approach can enhance the system's resilience to tricky examples. Instead of solely focusing on classifying them correctly, we aim to improve the system by rejecting such cases.",
    "In this scholarly article, we introduce the Neural Phrase-based Machine Translation (NPMT) framework. Our approach intricately incorporates phrase structures in the generated sequences by leveraging Sleep-WAke Networks (SWAN), a novel segmentation-based sequence modeling technique. To address the strict requirement of monotonic alignment in SWAN, a new layer is introduced to facilitate (soft) local reordering of input sequences. Differing from conventional neural machine translation (NMT) methods, NPMT eschews attention-based decoding mechanisms. Instead, it directly yields phrases in a systematic manner, allowing for linear-time decoding. Empirical results from our conducted experiments demonstrate that NPMT outperforms existing NMT models on the IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks. Additionally, our findings signify that NPMT generates coherent and meaningful phrases in the target languages.",
    "Introducing LR-GAN: a cutting-edge adversarial image generator that goes beyond the norm by incorporating scene structure and context. Unlike its predecessors, this innovative GAN has the ability to intricately craft backgrounds and foregrounds separately and recursively, expertly weaving them together in a way that captures the essence of a natural image. Through its unsupervised training, LR-GAN refines the generation of appearance, shape, and pose for each foreground element with seamless integration. The comprehensive experiments reveal LR-GAN's knack for producing lifelike images with objects that are notably more human recognizable compared to conventional models like DCGAN.",
    "We present an elegant framework in which an agent can autonomously acquire knowledge about its environment. Our framework introduces a dynamic interaction between two autonomous entities, Alice and Bob, as they engage in a collaborative learning process. Specifically, Alice initiates a task for Bob to complete, prompting him to respond by executing the required actions. Our focus in this study lies on two distinct types of environments: those that are nearly reversible, and those amenable to reset. To convey the task, Alice executes a sequence of actions that Bob must either reverse or replicate. Through a carefully designed reward system, Alice and Bob jointly shape a learning curriculum that facilitates unsupervised training of the agent. Leveraging this unsupervised training method in Bob's reinforcement learning applications within the environment, we observe a notable decrease in the required supervised episodes for learning, sometimes resulting in superior reward convergence.",
    "Maximum entropy modeling serves as a versatile and widely embraced approach for constructing statistical models when possessing only partial information. Instead of following the customary path of directly optimizing the continuous density, this study focuses on acquiring a smooth and reversible transformation that aligns a basic distribution with the desired maximum entropy distribution. This task is particularly challenging as the optimization target (entropy) is contingent on the density itself. Leveraging advancements in normalizing flow networks, the researchers outline a strategy for reformulating the maximum entropy conundrum into a finite-dimensional, constrained optimization issue. This is tackled through the amalgamation of stochastic optimization with the augmented Lagrangian method. The efficacy of the proposed method is demonstrated through simulation findings, while real-world applications in finance and computer vision underscore the adaptability and precision of maximum entropy flow networks.",
    "With machine learning breaking new ground by conquering challenging tasks daily, the vision of achieving general AI is coming within reach. Yet, the current research primarily emphasizes specific applications like image classification and machine translation. This trend largely stems from the difficulty in objectively gauging advancements towards comprehensive machine intelligence. To address this issue, we introduce a clear set of objectives for general AI and a testing framework to assess machine performance against these objectives, streamlining the process.",
    "Neural networks that compute over graph structures are suitable for problems in various domains like natural language and cheminformatics. However, these networks do not directly support batched training or inference due to the varied shape and size of the computation graph for each input. Implementing them in popular static data-flow graph-based deep learning libraries is also challenging. We introduce dynamic batching, a technique that enables batching operations across different input graphs and nodes within a single input graph. This technique allows us to create static graphs using popular libraries that mimic dynamic computation graphs of any shape and size. We also present a high-level library of compositional blocks to simplify the creation of dynamic graph models and demonstrate concise batch-wise parallel implementations for various models.",
    "Deep learning models in natural language processing are effective but often operate as black boxes, providing little insight into their decision-making process. This paper introduces a new method for tracking the importance of inputs to Long Short Term Memory networks (LSTMs) in producing outputs. We identify significant word patterns to distill state-of-the-art LSTMs on sentiment analysis and question answering into key phrases. These phrases are then validated quantitatively by constructing a rule-based classifier that closely mimics the LSTM's output.",
    "In recent years, deep reinforcement learning has accomplished impressive feats. However, challenges persist with tasks that offer sparse rewards or have long horizons. To address these issues, we propose a general framework that involves learning useful skills in a pre-training environment before applying them to expedite learning in subsequent tasks. Our method combines aspects of intrinsic motivation and hierarchical methods by guiding the learning of valuable skills through a single proxy reward that demands minimal domain knowledge of the tasks. This leads to the training of a high-level policy on these skills, enhancing exploration and enabling handling of sparse rewards. To efficiently pre-train diverse skills, we utilize Stochastic Neural Networks with an information-theoretic regularizer. Our experiments demonstrate that this approach is effective in learning a broad range of understandable skills in a resource-efficient manner and significantly improves learning performance consistently across various subsequent tasks.",
    "In recent years, deep generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), have shown significant success. While traditionally viewed as separate paradigms, GANs and VAEs have been the focus of extensive research efforts. This paper seeks to establish formal connections between GANs and VAEs by introducing a new formulation. We consider sample generation in GANs as a form of posterior inference and demonstrate that both GANs and VAEs aim to minimize KL divergences in their respective posterior and inference distributions, albeit in opposite directions. This approach extends the two learning phases of the classic wake-sleep algorithm. By adopting this unified perspective, we are able to analyze various model variants effectively and transfer techniques across different research avenues in a systematic manner. For instance, we integrate the importance weighting method from VAE literature to enhance GAN learning, and introduce an adversarial mechanism to VAEs that utilizes generated samples. Through experiments, we demonstrate the general applicability and effectiveness of these transferred techniques.",
    "We address the issue of identifying out-of-distribution images in neural networks using ODIN, a straightforward yet powerful technique that does not necessitate modifications to the pre-trained neural network. By leveraging temperature scaling and introducing minor perturbations to the input, we achieve a clear separation of softmax scores for in-distribution and out-of-distribution images, enhancing detection capabilities significantly. Through various experiments, we demonstrate that ODIN is versatile across different network architectures and datasets, consistently outperforming the baseline method by a significant margin and setting a new performance benchmark. For instance, in the case of the DenseNet model on CIFAR-10, ODIN dramatically decreases the false positive rate from 34.7% to 4.3% while maintaining a 95% true positive rate.",
    "A framework has been introduced for unsupervised learning of representations by leveraging the infomax principle with large-scale neural populations. Through the utilization of an asymptotic approximation of Shannon's mutual information, it has been shown that an effective initial estimation of the global information-theoretic optimum can be achieved by employing a hierarchical infomax approach. Following this initial stage, a proficient algorithm, utilizing gradient descent of the final objective function, has been suggested to acquire representations from input datasets, adapting to complete, overcomplete, and undercomplete bases. Through numerical experiments, it has been evidenced that our method is both robust and highly efficient in extracting prominent features from input datasets. In comparison to prevailing methods, our algorithm stands out due to its noteworthy speed in training and the overall resilience of unsupervised representation learning. Additionally, the proposed method can be readily expanded to encompass supervised or unsupervised models for the training of deep structural networks.",
    "Recurrent Neural Networks (RNNs) are well-known for their excellent performance in sequence modeling tasks. However, training RNNs on long sequences can present challenges such as slow inference speed, vanishing gradients, and difficulty in capturing long term dependencies. In the context of backpropagation through time, these challenges are often linked to the large, sequential computational graph that arises from unrolling the RNN over time. To address these issues, we propose the Skip RNN model, which extends traditional RNNs by learning to skip state updates, thereby reducing the effective size of the computational graph. The model can also be trained to minimize the number of state updates through a budget constraint. Our evaluations on various tasks demonstrate that the Skip RNN model can decrease the required number of RNN updates while maintaining or even enhancing the performance compared to baseline RNN models. For those interested, the source code is available at https://imatge-upc.github.io/skiprnn-2017-telecombcn/.",
    "Restart methods are commonly used in optimization without using gradient information to handle complex functions with multiple peaks. Partial restart methods are becoming popular in optimization with gradient information to speed up convergence in methods that move faster through difficult functions. In this research, we introduce a straightforward restart approach for stochastic gradient descent to enhance its performance during the training of deep neural networks, regardless of when it is stopped. We test its effectiveness on CIFAR-10 and CIFAR-100 datasets and report improved performance of 3.14% and 16.21% accuracy, respectively, setting new records. Additionally, we show its benefits on an EEG dataset and a downsized version of the ImageNet dataset. The code used is accessible at https://github.com/loshchil/SGDR.",
    "Policy gradient methods have demonstrated considerable success in addressing complex reinforcement learning tasks. Nevertheless, they frequently encounter challenges associated with high variance in policy gradient estimation, resulting in suboptimal sample efficiency during the training process. This study introduces a novel control variate approach to mitigate variance issues in policy gradient methods. Drawing inspiration from Stein's identity, our method expands upon existing control variate techniques applied in REINFORCE and advantage actor-critic strategies by incorporating versatile action-dependent baseline functions. Empirical findings illustrate the substantial enhancement in sample efficiency achieved by our approach relative to contemporary policy gradient methodologies.",
    "Skip connections have revolutionized the training of deep neural networks, making it possible to build networks with numerous layers. They have now become a crucial element in a wide range of neural architectures. While the exact reasons behind their effectiveness remain a mystery, we propose a fresh perspective on why skip connections are so beneficial in training very deep networks.\n\nThe challenge of training deep networks stems from the singularities caused by the inherent non-identifiability of the model. These singularities, identified in previous studies, include overlap singularities resulting from the permutation symmetry of nodes, elimination singularities from the deactivation of nodes, and singularities arising from node linear dependence. These singularities create degenerate regions in the loss landscape, hindering the learning process.\n\nWe posit that skip connections mitigate these singularities by disrupting the permutation symmetry of nodes, reducing the likelihood of node deactivation, and decreasing node linear dependence. Additionally, by initializing the network with skip connections, we can steer the network away from these problematic singularities, reshaping the landscape to facilitate smoother learning. This hypothesis is validated through simplified models and experiments with deep networks trained on real-world datasets.",
    "We endeavored to replicate the findings of the research paper \"Natural Language Inference across Interaction Space\" which was presented at the ICLR 2018 conference, forming a part of the ICLR 2018 Reproducibility Challenge. Initially, we embarked on creating our own implementation of the network, only later finding out that the code was readily available. Our version of the model was put to the test on the Stanford NLI dataset, achieving an impressive accuracy of 86.38% on the test set, while the original paper reported 88.0% accuracy. The key disparities seem to stem from discrepancies in optimizers and the methodology employed for model selection.",
    "We have effectively applied the \"Learn to Pay Attention\" attention mechanism model in convolutional neural networks, and have achieved the same outcomes as the original paper in image classification and fine-grained recognition categories.",
    "Learning universal distributed representations of sentences is a crucial objective in the field of natural language processing. Our proposed technique involves encoding the suffixes of word sequences in sentences and leveraging the Stanford Natural Language Inference (SNLI) dataset for training. Through evaluation on the SentEval benchmark, we showcase the efficiency of our method, which outperforms existing approaches across various transfer tasks.",
    "In modern neural models, advanced features are created by leveraging polynomial functions of existing ones to enhance representations. For instance, in our exploration utilizing the natural language inference task, we delved into the effectiveness of incorporating scaled polynomials of degree 2 and higher as matching features. Notably, our results revealed that scaling degree 2 features significantly boosts performance, leading to a remarkable 5% reduction in classification error for the most successful models.",
    "We introduce a generalization bound for feedforward neural networks based on the product of the spectral norm of the layers and the Frobenius norm of the weights. This generalization bound is obtained through a PAC-Bayes analysis.",
    "In our research, we delve into the Batch Normalization technique from a fresh perspective by introducing its probabilistic interpretation. We present a probabilistic model that highlights how Batch Normalization optimizes the lower bound of its marginalized log-likelihood. Our proposed probabilistic framework guides the development of a training algorithm that remains consistent across training and testing phases. Despite the efficiency of our approach, computational challenges arise during inference stages. To address this issue and enhance computational efficiency, we introduce Stochastic Batch Normalization as a practical approximation of the proper inference process. This method not only streamlines memory usage and computational demands but also enables a scalable uncertainty estimation approach. Through rigorous experimentation on well-known architectures such as VGG-like and ResNets for MNIST and CIFAR-10 datasets, we showcase the effectiveness of Stochastic Batch Normalization in enhancing model performance.",
    "Discover a groundbreaking revelation in the realm of deep convolutional networks! Contrary to popular belief, you'll be amazed to learn that losing information isn't the key to their success. Enter the realm of i-RevNet, a revolutionary network that defies traditional norms by retaining all information through a cascade of homeomorphic layers. Unravel the mystery of invertibility and witness the magic of progressive contraction and linear separation. Embark on a journey through natural image representations and shed light on the enigma of i-RevNet's learned model.",
    "In this paper, we explore the effectiveness of deep latent variable models, particularly the deep information bottleneck model. We highlight its limitations and introduce an enhanced model that overcomes these challenges. Our approach involves implementing a copula transformation, which restores the information bottleneck method's key invariance properties. This transformation enables the disentanglement of features within the latent space and promotes sparsity. Through experimentation on artificial and real data, we demonstrate the performance of our proposed method.",
    "We propose a modified version of the MAC model (Hudson and Manning, ICLR 2018) that utilizes simplified equations to maintain high accuracy and faster training speed. Evaluation on CLEVR and CoGenT demonstrates a significant 15-point increase in accuracy through transfer learning with fine-tuning, achieving state-of-the-art performance. Additionally, our study highlights that incorrect fine-tuning can diminish the model's accuracy.",
    "Adaptive Computation Time for Recurrent Neural Networks (ACT) is a promising architecture that can adjust the amount of computation needed for different tasks. ACT can look at each input sample multiple times and learn how many repetitions are necessary. In this study, we compare ACT with Repeat-RNN, a new architecture that repeats each sample a set number of times. Surprisingly, we found that Repeat-RNN performs just as well as ACT in the tasks we tested. You can find the source code in TensorFlow and PyTorch at https://imatge-upc.github.io/danifojo-2018-repeatrnn/",
    "Generative adversarial networks (GANs) have the ability to represent the intricate patterns found in real-world data, making them promising for spotting anomalies. Despite this potential, only a limited number of studies have delved into using GANs for anomaly detection. By utilizing advanced GAN models, we excel in detecting anomalies on image and network intrusion datasets and outperform the only known GAN-based method by a significant margin in terms of speed during testing.",
    "The Natural Language Inference (NLI) task involves determining how two sentences are related to each other. We have developed the Interactive Inference Network (IIN), a new type of neural network that can understand sentence pairs by extracting meaning in a step-by-step way. Our research shows that paying attention to how sentence parts interact helps in understanding the relationship between sentences. This approach, known as the Densely Interactive Inference Network (DIIN), has shown excellent results on large datasets. In fact, DIIN reduces errors by more than 20% compared to the best-known system when tested on the challenging Multi-Genre NLI (MultiNLI) dataset.",
    "The deployment of neural networks in real-world, safety-critical systems faces a significant challenge due to adversarial examples – perturbed inputs that can cause misclassification. Despite numerous techniques aimed at improving robustness, many have quickly succumbed to new attacks. For instance, more than half of the defenses introduced at ICLR 2018 have already been breached. Our solution to this dilemma lies in formal verification methods. We present a method for generating provably minimal adversarial examples – ensuring that the distortions are minimized. By applying this technique, we showcase the effectiveness of adversarial retraining, a recent defense proposal from ICLR, in increasing the distortion required for crafting adversarial examples by a factor of 4.2.",
    "Sure! Here is a reworded version:\n\nDeep neural networks (DNNs) have shown impressive predictive abilities by understanding complex, non-linear connections between variables. However, their lack of transparency has led them to be labeled as black boxes, limiting their applications. To address this issue, we introduce hierarchical interpretations through a method called agglomerative contextual decomposition (ACD). ACD explains DNN predictions by providing a hierarchy of input features and their contributions to the final prediction. This hierarchy helps identify predictive feature clusters learned by the DNN. Our experiments using Stanford Sentiment Treebank and ImageNet datasets demonstrate that ACD can diagnose incorrect predictions and detect dataset biases effectively. Human experiments show that ACD enables users to identify the more accurate DNN among two choices and instills greater confidence in a DNN's outputs. ACD's hierarchy is also shown to be resilient against adversarial perturbations, focusing on essential input aspects while ignoring noise.",
    "In our latest project, we tackle the fascinating challenge of transforming musical timbre. Imagine being able to change the timbre of a sound sample from one instrument to match another, without altering its pitch, rhythm, or volume. This process involves a delicate balance of art and science, inspired by image-based style transfer techniques applied to audio signals.\n\nEnter TimbreTron - our groundbreaking approach to musical timbre transfer. By leveraging cutting-edge technology, we use an \"image\" domain style transfer method on the time-frequency representation of the audio, creating a seamless transition between instruments. The magic continues as we generate a high-fidelity audio waveform using a conditional WaveNet synthesizer.\n\nOur research reveals that the Constant Q Transform (CQT) representation is key to our success, allowing us to maintain pitch consistency throughout the process. Through rigorous human assessments, we have confirmed that TimbreTron achieves remarkable results, effectively transferring timbre while preserving the essence of the original musical composition, whether it's a solo melody or a complex harmony. Join us on this sonic journey where innovation meets musicality!",
    "We're diving into word-level language modeling and exploring how we can blend short-term representations with medium-term ones in a language model. Building on recent experiments using dynamically changing weights in language models, we're framing the language modeling challenge as an online learning adventure. In this setup, a meta-learner gets busy training to fine-tune our language model's weights continually through gradient descent.",
    "Generative Adversarial Networks (GANs) are potent generative models capable of representing the manifold of natural images. This feature is employed to execute manifold regularization through an approximation of the Laplacian norm using a Monte Carlo method that can be readily computed within the GAN framework. By integrating this approach into the feature-matching GAN proposed in Improved GAN, we attain cutting-edge outcomes for semi-supervised learning based on GANs using the CIFAR-10 dataset. Notably, our method offers a notable advantage in terms of implementation simplicity compared to alternative approaches.",
    "We have uncovered a fascinating class of over-parameterized deep neural networks that defy the odds by possessing a unique property - they have no bad local valleys! This means that you can traverse through continuous paths in their parameter space, continuously decreasing the cross-entropy loss until it nearly reaches zero. Consequently, these networks are free from sub-optimal strict local minima. How incredible is that?",
    "Visual Question Answering (VQA) models have STRUGGLED with COUNTING objects in natural images. A FUNDAMENTAL PROBLEM, identified as SOFT ATTENTION in these models, is the culprit. To SOLVE this, we propose a neural network component for ROBUST COUNTING from object proposals. Our EXPERIMENTS on a toy task demonstrate the EFFECTIVENESS of this approach, leading to STATE-OF-THE-ART ACCURACY on the number category of the VQA v2 dataset. Surprisingly, our SINGLE MODEL OUTPERFORMS ensemble models. Our component offers a 6.6% IMPROVEMENT over a STRONG baseline in counting.",
    "A significant challenge in the examination of generative adversarial networks lies in the volatility of their training process. Within this manuscript, we present a groundbreaking weight normalization approach known as spectral normalization designed to enhance the stability of the discriminator's training. This innovative normalization method is lightweight computationally and seamlessly integrates into current frameworks. Through rigorous testing on the CIFAR10, STL-10, and ILSVRC2012 datasets, we verified experimentally that using spectrally normalized GANs (SN-GANs) enables the generation of images that are superior to or on par with those produced by preceding training stabilization methods.",
    "Embedding graph nodes into a vector space enables the utilization of machine learning for tasks such as predicting node classes. However, it is noteworthy that the study of node embedding algorithms is not as well-developed compared to the natural language processing field, largely due to the heterogeneous and complex nature of graphs. In this study, we analyze the efficacy of various node embedding algorithms in relation to graph centrality metrics that capture the diversity of graphs. Through structured experimentation involving four node embedding algorithms, four to five graph centrality metrics, and six distinct datasets, we have obtained empirical insights into the performance of node embedding algorithms. These findings serve as a foundational framework for further exploration and research in this domain.",
    "Introducing a groundbreaking dataset of logical entailments to evaluate models' capacity in capturing and leveraging logical structures for an entailment prediction task. We put various architectures, common in sequence-processing, up against each other, including a novel model class - PossibleWorldNets, which processes entailment through a \"convolution over possible worlds\". Findings reveal that while convolutional networks possess an unsuitable inductive bias for these tasks compared to LSTM RNNs, tree-structured neural networks excel over LSTM RNNs owing to their superior syntax exploitation. Notably, PossibleWorldNets surpass all benchmarks, showcasing exceptional performance.",
    "Neural network pruning methods can significantly reduce the parameter counts of trained networks, which decreases storage requirements and enhances computational performance during inference without compromising accuracy. However, sparse architectures resulting from pruning are often challenging to train initially, which could also enhance training performance. Our research shows that a typical pruning technique naturally reveals subnetworks that were effectively trainable due to their initializations. Based on these findings, we introduce the \"lottery ticket hypothesis,\" suggesting that within dense, randomly-initialized, feed-forward networks are subnetworks (\"winning tickets\") that, when trained independently, achieve test accuracy similar to the original network in a comparable number of iterations. These winning tickets benefit from fortuitous initializations, with connections possessing initial weights conducive to effective training. We propose an algorithm to identify winning tickets and present experimental results that confirm the lottery ticket hypothesis and the significance of these fortunate initializations. Our experiments consistently identify winning tickets that are less than 10-20% the size of various fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Networks corresponding to these winning tickets, larger than the mentioned size range, outperform the original network by learning faster and achieving higher test accuracy.",
    "The singular values of the linear transformation linked with a standard 2D multi-channel convolutional layer are identified to facilitate efficient computation. This identification further paves the way for developing an algorithm to project a convolutional layer onto an operator-norm ball. It has been demonstrated that this serves as a valuable regularizer, as evidenced by its ability to reduce the test error of a deep residual network utilizing batch normalization on CIFAR-10 from 6.2\\% to 5.3\\%.",
    "Despite the empirical success of deep and locally connected nonlinear networks like deep convolutional neural networks (DCNN), understanding their theoretical properties remains a challenging task. In this paper, we introduce a new theoretical framework designed for these networks using ReLU nonlinearity. This framework explicitly defines the data distribution, promotes disentangled representations, and integrates well with popular regularization methods such as Batch Norm. By leveraging a teacher-student approach, we extend the student's forward and backward propagation within the teacher's computational graph. Importantly, our model avoids making unrealistic assumptions, such as Gaussian inputs or activation independence. Our proposed framework offers a means to analyze various practical issues theoretically, including overfitting, generalization, and disentangled representations within deep networks.",
    "We introduce a Neural Program Search algorithm, which creates programs from natural language descriptions and a few input/output examples. This algorithm merges techniques from Deep Learning and Program Synthesis areas, forming a specialized domain-specific language (DSL) and a powerful search algorithm guided by a Seq2Tree model. Additionally, to assess the effectiveness of the approach, we offer a semi-synthetic dataset containing descriptions, test examples, and corresponding programs. Our results demonstrate a substantial improvement over a baseline sequence-to-sequence model with attention.",
    "State-of-the-art neural machine translation systems vary in their structures but typically include the crucial feature of Attention. While many attention methods focus on individual tokens and overlook the significance of phrasal alignments, important for traditional phrase-based machine translation success, our study introduces innovative phrase-based attention techniques. These methods consider n-grams of tokens as attention units, enhancing the Transformer network. Our experiments show that incorporating our phrase-based attention leads to performance boosts of 1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English translations when evaluated on the WMT newstest2014 dataset using WMT'16 training data.",
    "We address the challenge of learning distributed representations of edits using a combination of a \"neural editor\" and an \"edit encoder\". Our models effectively capture key information of edits and can be applied to new inputs. We conduct experiments on natural language and source code edit data, with promising results indicating that our neural network models successfully capture the structure and meaning of edits. We encourage other researchers to explore this task and data source further.",
    "We introduce a principled kernel learning approach based on Fourier-analytic properties of translation-invariant or rotation-invariant kernels. Our method generates a series of feature maps to enhance the SVM margin iteratively. We offer rigorous guarantees on optimality and generalization, viewing our algorithm as a dynamic equilibrium-finding process in a specific two-player min-max game. Experiments on synthetic and real datasets show scalability and consistent enhancements over comparable random features methods.",
    "This paper introduces Variational Continual Learning (VCL), a sophisticated and versatile framework for continual learning that integrates online variational inference (VI) and recent advancements in Monte Carlo VI tailored for neural networks. The framework exhibits the ability to effectively train deep discriminative and deep generative models in intricate continual learning scenarios characterized by evolving existing tasks and emerging new tasks. Empirical findings demonstrate that VCL surpasses existing state-of-the-art continual learning approaches across various tasks, mitigating catastrophic forgetting through fully autonomous means.",
    "This report serves a number of exciting purposes! To begin with, we're on a mission to explore the reproducibility standards of the enlightening paper titled \"On the regularization of Wasserstein GANs (2018)\". We dive deep into five key experiment facets highlighted in the paper: learning speed, stability, resilience to hyperparameters, accurate estimation of Wasserstein distance, and the exploration of diverse sampling methods. Our ultimate goal is to unveil the parts of this research that can be successfully replicated, all while considering the resources involved. The best part? All the source code needed for reproduction is freely available to the public! Let's embark on this fascinating journey together.",
    "Here is the revised text:\n\nThis paper introduces a novel method for extracting features from program execution logs. Initially, intricate patterns are automatically extracted from a program's behavior graph. Subsequently, these patterns are transformed into a continuous space through the training of an autoencoder. The effectiveness of the proposed features is assessed in the context of detecting real-world malicious software. Additionally, it is observed that the embedding space reveals intelligible structures within the pattern components.",
    "We present a novel neural probabilistic model utilizing variational autoencoder architecture. This model offers the flexibility to be conditioned on any chosen subset of observed features, enabling simultaneous sampling of the remaining features. The features can encompass both real-valued and categorical data. The model is trained using stochastic variational Bayes method. Results from experiments on synthetic data, feature imputation, and image inpainting tasks demonstrate the success of our approach in producing diverse and effective samples.",
    "Variational Autoencoders (VAEs) were initially proposed by Kingma & Welling (2014) as probabilistic generative models that enable approximate Bayesian inference. The concept of $\\beta$-VAEs (Higgins et al., 2017) expanded the scope of VAEs beyond generative modeling to encompass various application domains such as representation learning, clustering, and lossy data compression. This extension introduced an objective function that empowers practitioners to balance the trade-off between the information content (bit rate) of the latent representation and the fidelity of the reconstructed data (Alemi et al., 2018). \n\nThis study reexamines the rate/distortion trade-off within the framework of hierarchical VAEs, which involve multiple layers of latent variables. We introduce a broad class of inference models that enable the division of the overall rate into individual layer contributions, providing the flexibility to adjust each layer's rate independently. By establishing theoretical performance bounds for downstream tasks based on the rates of individual layers, we conduct extensive large-scale experiments to validate our theoretical insights. Our findings offer practical guidance for practitioners seeking to optimize the rate-space for specific applications.",
    "Studying the subspaces of adversarial examples is crucial for assessing the resilience of deep neural networks (DNNs) against adversarial perturbations. A recent study by Ma et al. (ICLR 2018) introduced the concept of using the local intrinsic dimensionality (LID) within the hidden layers of DNNs to examine these adversarial subspaces. Their research showcased the applicability of LID in delineating the adversarial subspaces related to various attack techniques, such as the Carlini and Wagner's (C&W) attack and the fast gradient sign attack.\n\nIn this particular study, experimental investigations on MNIST and CIFAR-10 datasets were conducted to delve into new dimensions beyond the traditional LID analysis, highlighting the challenges in using LID to fully characterize the corresponding adversarial subspaces. The study identified two key limitations: (i) the sensitivity of LID performance to the confidence levels set by an attacking algorithm, with ensemble learning on adversarial examples at different confidence levels yielding unexpectedly weak results, and (ii) the inadequacy of LID in delineating adversarial subspaces when the attacks originate from a different DNN model, as noted in the context of black-box transfer attacks.\n\nThese insightful findings collectively underscore the restricted capacity of LID in effectively characterizing the subspaces associated with adversarial examples, shedding light on the overarching significance of further research in this domain.",
    "GANs are known for producing appealing samples but are difficult to train. Current approaches focus on proposing new formulations of the GAN objective, with little attention given to optimization methods for adversarial training. Our work shifts the focus by applying variational inequality framework to GAN optimization problems. Drawing from mathematical programming literature, we challenge misconceptions about saddle point optimization difficulties and introduce techniques tailored for training GANs, such as averaging, extrapolation, and a computationally cheaper variant called extrapolation from the past, to stochastic gradient descent (SGD) and Adam.",
    "Recently, groundbreaking advances have been made in neural message passing algorithms for semi-supervised classification on graphs. Despite their success, these methods have been limited by their consideration of only nodes within a few propagation steps, making it challenging to extend the neighborhood size effectively. In this study, we introduce a novel approach that exploits the connection between graph convolutional networks (GCN) and PageRank to enhance the propagation scheme, utilizing personalized PageRank. This leads to the development of a streamlined model, personalized propagation of neural predictions (PPNP), alongside its efficient approximation, APPNP. Notably, our model offers training times comparable to or faster than existing models, with an equivalent or reduced number of parameters. By enabling a large, adjustable neighborhood for classification, it seamlessly integrates with any neural network. Extensive evaluations demonstrate that our model surpasses several state-of-the-art methods in semi-supervised classification, marking a significant milestone in GCN-like model research. Our implementation is readily accessible online for further exploration.",
    "We have identified disguised gradients, a type of gradient hiding, as a phenomenon that creates a false sense of safety in protections against adversarial instances. Although protections that produce disguised gradients seem to block iterative optimization-driven attacks, we have observed that safeguards based on this feature can still be bypassed. We detail typical behaviors of safeguards displaying this feature and, for each of the three forms of disguised gradients we have identified, we craft offensive strategies to surpass them. In an investigation, focusing on non-certified white-box-secure protections at ICLR 2018, we note that disguised gradients are prevalent, with 7 out of 9 defenses counting on them. Our new strategies effectively bypass 6 defenses completely, and 1 partially, in the initially considered threat model of each paper.",
    "Our groundbreaking approach, Graph2Gauss, revolutionizes network analysis by efficiently learning versatile node embeddings on large-scale graphs. By representing nodes as Gaussian distributions instead of typical point vectors, we capture uncertainty and improve performance in tasks such as link prediction and node classification. This method excels in handling diverse types of graphs and inductive learning scenarios, offering superior generalization to unseen nodes without the need for further training. Through personalized ranking formulation based on node distances, we leverage network structure to achieve outstanding results, surpassing state-of-the-art methods on various tasks. Our experiments on real-world networks highlight the exceptional performance of Graph2Gauss, showcasing its capability to model uncertainty, estimate neighborhood diversity, and reveal the latent dimensionality of graphs.",
    "Convolutional Neural Networks (CNNs) have become the go-to approach for tackling learning tasks concerning 2D flat images. Nevertheless, various contemporary challenges have surfaced, necessitating models capable of analyzing spherical images. Scenarios include all-encompassing vision for drones, robots, and self-driving vehicles, molecular regression dilemmas, and worldwide weather and climate simulations. Simply employing convolutional networks on a flat projection of the spherical signal is doomed to fail due to the varying distortions caused by such a projection, rendering translational weight sharing ineffective. In this manuscript, we lay out the foundational elements for architecting spherical CNNs. We present a formulation for the spherical cross-correlation that is both eloquent and rotation-equivalent. The spherical correlation adheres to a general Fourier principle, empowering us to efficiently compute it using a general (non-commutative) Fast Fourier Transform (FFT) technique. We showcase the computational efficiency, numerical preciseness, and efficiency of spherical CNNs in the realm of 3D model identification and atomization energy regression.",
    "This study highlights the utilization of natural language processing (NLP) techniques for addressing classification challenges in cheminformatics. The integration of these distinct fields is demonstrated through the analysis of the conventional textual representation of chemical compounds, known as Simplified Molecular Input Line Entry System (SMILES). The research focuses on activity prediction against a specific target protein, a pivotal aspect of the computer-aided drug design process. Experimental results indicate that the application of NLP methods not only surpasses current state-of-the-art outcomes achieved through manual feature engineering but also offers valuable structural insights regarding the decision-making process.",
    "Utilizing Computer Vision and Deep Learning technologies in Agriculture is intended to enhance the quality and productivity of harvests for farmers. The sorting of fruits and vegetables plays a crucial role in postharvest processes and market viability. Apples, in particular, are prone to various defects that may arise during harvesting or post-harvest stages. This study seeks to support farmers in post-harvest management by investigating the potential of contemporary computer vision and deep learning approaches, like YOLOv3 (Redmon & Farhadi (2018)), in identifying healthy apples among those with defects.",
    "We introduce two straightforward methods to decrease the number of parameters and speed up the training of extensive Long Short-Term Memory (LSTM) networks. The first method involves \"matrix factorization by design,\" where the LSTM matrix is broken down into the product of two smaller matrices. The second method is the partitioning of the LSTM matrix, its inputs, and states into independent groups. Both techniques enable the training of large LSTM networks much faster, nearing state-of-the-art perplexity levels, while utilizing notably fewer RNN parameters.",
    "Modern deep reading comprehension models are mostly built on recurrent neural networks, which are inherently sequential and well-suited for language processing. However, their lack of parallelization within instances can hinder deployment in time-sensitive scenarios, especially with longer texts. In this study, we propose a convolutional architecture as a viable alternative to traditional recurrent models. By employing straightforward dilated convolutional units instead of recurrent ones, we were able to achieve comparable results to current state-of-the-art models on two question answering tasks. Additionally, we achieved significant speed improvements of up to two orders of magnitude in question answering tasks.",
    "In this study, we examine the reinstatement process described by Ritter et al. (2018) and identify two types of neurons that appear in the working memory of the agent (an epLSTM cell) when it is trained with episodic meta-reinforcement learning on a version of the Harlow visual fixation task that involves episodes. More specifically, Abstract neurons store information that is common to multiple tasks, whereas Episodic neurons contain details specific to a particular task episode.",
    "The rate-distortion-perception function (RDPF), introduced by Blau and Michaeli in 2019, offers a versatile framework for examining realism and distortion in lossy compression. While similar to the rate-distortion function, it remains uncertain whether there are encoding and decoding mechanisms capable of attaining the rates implied by the RDPF. Following the findings of Li and El Gamal in 2018, our research demonstrates that stochastic, variable-length codes can effectively realize the RDPF. Moreover, we establish that for this category of codes, the RDPF serves as a lower-bound on the achievable rate.",
    "In this paper, we introduce Neural Phrase-based Machine Translation (NPMT). Our approach incorporates phrase structures into the output sequences by leveraging Sleep-WAke Networks (SWAN), a segmentation-based sequence modeling method. To address the strict alignment requirement of SWAN, we introduce a new layer that enables (soft) local reordering of input sequences. Unlike traditional neural machine translation (NMT) models, NPMT does not rely on attention-based decoding mechanisms. Instead, it directly generates phrases in a sequential manner, allowing for linear-time decoding. Our experiments demonstrate that NPMT outperforms strong NMT baselines on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese translation tasks. Additionally, our method generates coherent phrases in the target languages.",
    "This passage reveals the importance of utilizing sparse representations of input data as a strategic defense against adversarial perturbations that can lead to errors in deep neural networks. The study not only demonstrates the effectiveness of sparsity in linear classifiers against specific attacks but also introduces the concept of a \"locally linear\" model for developing theoretical approaches to both attacks and defenses in deep neural networks. The experimental findings with the MNIST dataset further support the effectiveness of the suggested sparsifying front end.",
    "We introduce a novel method, named Supervised Policy Update (SPU), for deep reinforcement learning that is highly efficient with samples. SPU utilizes data from the current policy to formulate and solve a constrained optimization problem within the non-parameterized proximal policy space. By employing supervised regression, SPU transforms the optimal non-parameterized policy into a parameterized one, allowing for the generation of new samples. This approach is versatile, accommodating both discrete and continuous action spaces and a diverse range of proximity constraints for the non-parameterized optimization. Our research demonstrates how this methodology can effectively tackle Natural Policy Gradient, Trust Region Policy Optimization (NPG/TRPO), and Proximal Policy Optimization (PPO) problems. Notably, the implementation of SPU is notably less complex than TRPO. Through comprehensive experiments, we illustrate that SPU outperforms TRPO in Mujoco simulated robotic tasks and surpasses PPO in Atari video game tasks in terms of sample efficiency.",
    "We introduce a synthetic dataset, Moving Symbols, to analyze video prediction networks objectively. By controlling variations in the dataset, we identify shortcomings in a leading approach and suggest a more meaningful performance metric for better experimental interpretation. Our dataset offers standardized test cases to enhance comprehension and enhance the learned representations of these networks. Access the code at https://github.com/rszeto/moving-symbols."
  ]
}