{
  "original": [
    "In this report, we describe a Theano-based AlexNet (Krizhevsky et al., 2012) implementation and its naive data parallelism on multiple GPUs. Our performance on 2 GPUs is comparable with the state-of-art Caffe library (Jia et al., 2014) run on 1 GPU. To the best of our knowledge, this is the first open-source Python-based AlexNet implementation to-date.",
    "We show that deep narrow Boltzmann machines are universal approximators of probability distributions on the activities of their visible units, provided they have sufficiently many hidden layers, each containing the same number of units as the visible layer. We show that, within certain parameter domains, deep Boltzmann machines can be studied as feedforward networks. We provide upper and lower bounds on the sufficient depth and width of universal approximators. These results settle various intuitions regarding undirected networks and, in particular, they show that deep narrow Boltzmann machines are at least as compact universal approximators as narrow sigmoid belief networks and restricted Boltzmann machines, with respect to the currently available bounds for those models.",
    "Leveraging advances in variational inference, we propose to enhance recurrent neural networks with latent variables, resulting in Stochastic Recurrent Networks (STORNs). The model i) can be trained with stochastic gradient methods, ii) allows structured and multi-modal conditionals at each time step, iii) features a reliable estimator of the marginal likelihood and iv) is a generalisation of deterministic recurrent neural networks. We evaluate the method on four polyphonic musical data sets and motion capture data.",
    "We describe a general framework for online adaptation of optimization hyperparameters by `hot swapping' their values during learning. We investigate this approach in the context of adaptive learning rate selection using an explore-exploit strategy from the multi-armed bandit literature. Experiments on a benchmark neural network show that the hot swapping approach leads to consistently better solutions compared to well-known alternatives such as AdaDelta and stochastic gradient with exhaustive hyperparameter search.",
    "Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm for partial least squares, whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results.",
    "Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report state-of-the-art results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015).",
    "Automatic speech recognition systems usually rely on spectral-based features, such as MFCC of PLP. These features are extracted based on prior knowledge such as, speech perception or/and speech production. Recently, convolutional neural networks have been shown to be able to estimate phoneme conditional probabilities in a completely data-driven manner, i.e. using directly temporal raw speech signal as input. This system was shown to yield similar or better performance than HMM/ANN based system on phoneme recognition task and on large scale continuous speech recognition task, using less parameters. Motivated by these studies, we investigate the use of simple linear classifier in the CNN-based framework. Thus, the network learns linearly separable features from raw speech. We show that such system yields similar or better performance than MLP based system using cepstral-based features as input.",
    "We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machines. In order to be as hardware-agnostic as possible, we needed a way to use multiple machines without generating excessive network traffic. Our method is to average the neural network parameters periodically (typically every minute or two), and redistribute the averaged parameters to the machines for further training. Each machine sees different data. By itself, this method does not work very well. However, we have another method, an approximate and efficient implementation of Natural Gradient for Stochastic Gradient Descent (NG-SGD), which seems to allow our periodic-averaging method to work well, as well as substantially improving the convergence of SGD on a single machine.",
    "We develop a new method for visualizing and refining the invariances of learned representations. Specifically, we test for a general form of invariance, linearization, in which the action of a transformation is confined to a low-dimensional subspace. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images lying on a path between them that is of minimal length in the space of the representation (a \"representational geodesic\"). If the transformation relating the two reference images is linearized by the representation, this sequence should follow the gradual evolution of this transformation. We use this method to assess the invariance properties of a state-of-the-art image classification network and find that geodesics generated for image pairs differing by translation, rotation, and dilation do not evolve according to their associated transformations. Our method also suggests a remedy for these failures, and following this prescription, we show that the modified representation is able to linearize a variety of geometric image transformations.",
    "Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning.   One factor behind the recent resurgence of the subject is a key algorithmic step called {\\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\\em shadow} groups whose elements serve as close approximations.   Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.",
    "We present a novel architecture, the \"stacked what-where auto-encoders\" (SWWAE), which integrates discriminative and generative pathways and provides a unified approach to supervised, semi-supervised and unsupervised learning without relying on sampling during training. An instantiation of SWWAE uses a convolutional net (Convnet) (LeCun et al. (1998)) to encode the input, and employs a deconvolutional net (Deconvnet) (Zeiler et al. (2010)) to produce the reconstruction. The objective function includes reconstruction terms that induce the hidden states in the Deconvnet to be similar to those of the Convnet. Each pooling layer produces two sets of variables: the \"what\" which are fed to the next layer, and its complementary variable \"where\" that are fed to the corresponding layer in the generative decoder.",
    "We investigate the problem of inducing word embeddings that are tailored for a particular bilexical relation. Our learning algorithm takes an existing lexical vector space and compresses it such that the resulting word embeddings are good predictors for a target bilexical relation. In experiments we show that task-specific embeddings can benefit both the quality and efficiency in lexical prediction tasks.",
    "A generative model is developed for deep (multi-layered) convolutional dictionary learning. A novel probabilistic pooling operation is integrated into the deep model, yielding efficient bottom-up (pretraining) and top-down (refinement) probabilistic learning. Experimental results demonstrate powerful capabilities of the model to learn multi-layer features from images, and excellent classification results are obtained on the MNIST and Caltech 101 datasets.",
    "Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.",
    "Convolutional neural networks (CNNs) work well on large datasets. But labelled data is hard to collect, and in some applications larger amounts of data are not available. The problem then is how to use CNNs with small data -- as CNNs overfit quickly. We present an efficient Bayesian CNN, offering better robustness to over-fitting on small data than traditional approaches. This is by placing a probability distribution over the CNN's kernels. We approximate our model's intractable posterior with Bernoulli variational distributions, requiring no additional model parameters.   On the theoretical side, we cast dropout network training as approximate inference in Bayesian neural networks. This allows us to implement our model using existing tools in deep learning with no increase in time complexity, while highlighting a negative result in the field. We show a considerable improvement in classification accuracy compared to standard techniques and improve on published state-of-the-art results for CIFAR-10.",
    "We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.",
    "Distributed representations of words have boosted the performance of many Natural Language Processing tasks. However, usually only one representation per word is obtained, not acknowledging the fact that some words have multiple meanings. This has a negative effect on the individual word representations and the language model as a whole. In this paper we present a simple model that enables recent techniques for building word vectors to represent distinct senses of polysemic words. In our assessment of this model we show that it is able to effectively discriminate between words' senses and to do so in a computationally efficient manner.",
    "We propose Diverse Embedding Neural Network (DENN), a novel architecture for language models (LMs). A DENNLM projects the input word history vector onto multiple diverse low-dimensional sub-spaces instead of a single higher-dimensional sub-space as in conventional feed-forward neural network LMs. We encourage these sub-spaces to be diverse during network training through an augmented loss function. Our language modeling experiments on the Penn Treebank data set show the performance benefit of using a DENNLM.",
    "A standard approach to Collaborative Filtering (CF), i.e. prediction of user ratings on items, relies on Matrix Factorization techniques. Representations for both users and items are computed from the observed ratings and used for prediction. Unfortunatly, these transductive approaches cannot handle the case of new users arriving in the system, with no known rating, a problem known as user cold-start. A common approach in this context is to ask these incoming users for a few initialization ratings. This paper presents a model to tackle this twofold problem of (i) finding good questions to ask, (ii) building efficient representations from this small amount of information. The model can also be used in a more standard (warm) context. Our approach is evaluated on the classical CF problem and on the cold-start problem on four different datasets showing its ability to improve baseline performance in both cases.",
    "We propose a deep learning framework for modeling complex high-dimensional densities called Non-linear Independent Component Estimation (NICE). It is based on the idea that a good representation is one in which the data has a distribution that is easy to model. For this purpose, a non-linear deterministic transformation of the data is learned that maps it to a latent space so as to make the transformed data conform to a factorized distribution, i.e., resulting in independent latent variables. We parametrize this transformation so that computing the Jacobian determinant and inverse transform is trivial, yet we maintain the ability to learn complex non-linear transformations, via a composition of simple building blocks, each based on a deep neural network. The training criterion is simply the exact log-likelihood, which is tractable. Unbiased ancestral sampling is also easy. We show that this approach yields good generative models on four image datasets and can be used for inpainting.",
    "We introduce Deep Linear Discriminant Analysis (DeepLDA) which learns linearly separable latent representations in an end-to-end fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classification problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a non-linear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and back-propagation. For evaluation we test our approach on three different benchmark datasets (MNIST, CIFAR-10 and STL-10). DeepLDA produces competitive results on MNIST and CIFAR-10 and outperforms a network trained with categorical cross entropy (same architecture) on a supervised setting of STL-10.",
    "Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.   Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)).   Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.",
    "We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. The data are linearly transformed, and each component is then normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and a constant. We optimize the parameters of the full transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. The optimized transformation substantially Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than alternative methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We demonstrate the use of the model as a prior probability density that can be used to remove additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized using the same Gaussianization objective, thus offering an unsupervised method of optimizing a deep network architecture.",
    "We present flattened convolutional neural networks that are designed for fast feedforward execution. The redundancy of the parameters, especially weights of the convolutional filters in convolutional neural networks has been extensively studied and different heuristics have been proposed to construct a low rank basis of the filters after training. In this work, we train flattened networks that consist of consecutive sequence of one-dimensional filters across all directions in 3D space to obtain comparable performance as conventional convolutional networks. We tested flattened model on different datasets and found that the flattened layer can effectively substitute for the 3D filters without loss of accuracy. The flattened convolution pipelines provide around two times speed-up during feedforward pass compared to the baseline model due to the significant reduction of learning parameters. Furthermore, the proposed method does not require efforts in manual tuning or post processing once the model is trained.",
    "In this paper, we introduce a novel deep learning framework, termed Purine. In Purine, a deep network is expressed as a bipartite graph (bi-graph), which is composed of interconnected operators and data tensors. With the bi-graph abstraction, networks are easily solvable with event-driven task dispatcher. We then demonstrate that different parallelism schemes over GPUs and/or CPUs on single or multiple PCs can be universally implemented by graph composition. This eases researchers from coding for various parallelization schemes, and the same dispatcher can be used for solving variant graphs. Scheduled by the task dispatcher, memory transfers are fully overlapped with other computations, which greatly reduce the communication overhead and help us achieve approximate linear acceleration.",
    "In this paper we propose a model that combines the strengths of RNNs and SGVB: the Variational Recurrent Auto-Encoder (VRAE). Such a model can be used for efficient, large scale unsupervised learning on time series data, mapping the time series data to a latent vector representation. The model is generative, such that data can be generated from samples of the latent space. An important contribution of this work is that the model can make use of unlabeled data in order to facilitate supervised training of RNNs by initialising the weights and network state.",
    "Current work in lexical distributed representations maps each word to a point vector in low-dimensional space. Mapping instead to a density provides many interesting advantages, including better capturing uncertainty about a representation and its relationships, expressing asymmetries more naturally than dot product or cosine similarity, and enabling more expressive parameterization of decision boundaries. This paper advocates for density-based distributed embeddings and presents a method for learning representations in the space of Gaussian distributions. We compare performance on various word embedding benchmarks, investigate the ability of these embeddings to model entailment and other asymmetric relationships, and explore novel properties of the representation.",
    "Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them. For example, it is possible to train Maxout networks with 10 bits multiplications.",
    "Multiple instance learning (MIL) can reduce the need for costly annotation in tasks such as semantic segmentation by weakening the required degree of supervision. We propose a novel MIL formulation of multi-class semantic segmentation learning by a fully convolutional network. In this setting, we seek to learn a semantic segmentation model from just weak image-level labels. The model is trained end-to-end to jointly optimize the representation while disambiguating the pixel-image label assignment. Fully convolutional training accepts inputs of any size, does not need object proposal pre-processing, and offers a pixelwise loss map for selecting latent instances. Our multi-class MIL loss exploits the further supervision given by images with multiple labels. We evaluate this approach through preliminary experiments on the PASCAL VOC segmentation challenge.",
    "Recently, nested dropout was proposed as a method for ordering representation units in autoencoders by their information content, without diminishing reconstruction cost. However, it has only been applied to training fully-connected autoencoders in an unsupervised setting. We explore the impact of nested dropout on the convolutional layers in a CNN trained by backpropagation, investigating whether nested dropout can provide a simple and systematic way to determine the optimal representation size with respect to the desired accuracy and desired task and data complexity.",
    "Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.",
    "When a three-dimensional object moves relative to an observer, a change occurs on the observer's image plane and in the visual representation computed by a learned model. Starting with the idea that a good visual representation is one that transforms linearly under scene motions, we show, using the theory of group representations, that any such representation is equivalent to a combination of the elementary irreducible representations. We derive a striking relationship between irreducibility and the statistical dependency structure of the representation, by showing that under restricted conditions, irreducible representations are decorrelated. Under partial observability, as induced by the perspective projection of a scene onto the image plane, the motion group does not have a linear action on the space of images, so that it becomes necessary to perform inference over a latent representation that does transform linearly. This idea is demonstrated in a model of rotating NORB objects that employs a latent representation of the non-commutative 3D rotation group SO(3).",
    "Efficient Maximum Inner Product Search (MIPS) is an important task that has a wide applicability in recommendation systems and classification with a large number of classes. Solutions based on locality-sensitive hashing (LSH) as well as tree-based solutions have been investigated in the recent literature, to perform approximate MIPS in sublinear time. In this paper, we compare these to another extremely simple approach for solving approximate MIPS, based on variants of the k-means clustering algorithm. Specifically, we propose to train a spherical k-means, after having reduced the MIPS problem to a Maximum Cosine Similarity Search (MCSS). Experiments on two standard recommendation system benchmarks as well as on large vocabulary word embeddings, show that this simple approach yields much higher speedups, for the same retrieval precision, than current state-of-the-art hashing-based and tree-based methods. This simple method also yields more robust retrievals when the query is corrupted by noise.",
    "The variational autoencoder (VAE; Kingma, Welling (2014)) is a recently proposed generative model pairing a top-down generative network with a bottom-up recognition network which approximates posterior inference. It typically makes strong assumptions about posterior inference, for instance that the posterior distribution is approximately factorial, and that its parameters can be approximated with nonlinear regression from the observations. As we show empirically, the VAE objective can lead to overly simplified representations which fail to use the network's entire modeling capacity. We present the importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log-likelihood lower bound derived from importance weighting. In the IWAE, the recognition network uses multiple samples to approximate the posterior, giving it increased flexibility to model complex posteriors which do not fit the VAE modeling assumptions. We show empirically that IWAEs learn richer latent space representations than VAEs, leading to improved test log-likelihood on density estimation benchmarks.",
    "This work investigates how using reduced precision data in Convolutional Neural Networks (CNNs) affects network accuracy during classification. More specifically, this study considers networks where each layer may use different precision data. Our key result is the observation that the tolerance of CNNs to reduced precision data not only varies across networks, a well established observation, but also within networks. Tuning precision per layer is appealing as it could enable energy and performance improvements. In this paper we study how error tolerance across layers varies and propose a method for finding a low precision configuration for a network while maintaining high accuracy. A diverse set of CNNs is analyzed showing that compared to a conventional implementation using a 32-bit floating-point representation for all layers, and with less than 1% loss in relative accuracy, the data footprint required by these networks can be reduced by an average of 74% and up to 92%.",
    "The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.",
    "Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.",
    "We propose local distributional smoothness (LDS), a new notion of smoothness for statistical model that can be used as a regularization term to promote the smoothness of the model distribution. We named the LDS based regularization as virtual adversarial training (VAT). The LDS of a model at an input datapoint is defined as the KL-divergence based robustness of the model distribution against local perturbation around the datapoint. VAT resembles adversarial training, but distinguishes itself in that it determines the adversarial direction from the model distribution alone without using the label information, making it applicable to semi-supervised learning. The computational cost for VAT is relatively low. For neural network, the approximated gradient of the LDS can be computed with no more than three pairs of forward and back propagations. When we applied our technique to supervised and semi-supervised learning for the MNIST dataset, it outperformed all the training methods other than the current state of the art method, which is based on a highly advanced generative model. We also applied our method to SVHN and NORB, and confirmed our method's superior performance over the current state of the art semi-supervised method applied to these datasets.",
    "The availability of large labeled datasets has allowed Convolutional Network models to achieve impressive recognition results. However, in many settings manual annotation of the data is impractical; instead our data has noisy labels, i.e. there is some freely available label for each image which may or may not be accurate. In this paper, we explore the performance of discriminatively-trained Convnets when trained on such noisy data. We introduce an extra noise layer into the network which adapts the network outputs to match the noisy label distribution. The parameters of this noise layer can be estimated as part of the training process and involve simple modifications to current training infrastructures for deep networks. We demonstrate the approaches on several datasets, including large scale experiments on the ImageNet classification benchmark.",
    "We provide novel guaranteed approaches for training feedforward neural networks with sparse connectivity. We leverage on the techniques developed previously for learning linear networks and show that they can also be effectively adopted to learn non-linear networks. We operate on the moments involving label and the score function of the input, and show that their factorization provably yields the weight matrix of the first layer of a deep network under mild conditions. In practice, the output of our method can be employed as effective initializers for gradient descent.",
    "Discourse relations bind smaller linguistic elements into coherent texts. However, automatically identifying discourse relations is difficult, because it requires understanding the semantics of the linked sentences. A more subtle challenge is that it is not enough to represent the meaning of each sentence of a discourse relation, because the relation may depend on links between lower-level elements, such as entity mentions. Our solution computes distributional meaning representations by composition up the syntactic parse tree. A key difference from previous work on compositional distributional semantics is that we also compute representations for entity mentions, using a novel downward compositional pass. Discourse relations are predicted not only from the distributional representations of the sentences, but also of their coreferent entity mentions. The resulting system obtains substantial improvements over the previous state-of-the-art in predicting implicit discourse relations in the Penn Discourse Treebank.",
    "In this work, we propose a new method to integrate two recent lines of work: unsupervised induction of shallow semantics (e.g., semantic roles) and factorization of relations in text and knowledge bases. Our model consists of two components: (1) an encoding component: a semantic role labeling model which predicts roles given a rich set of syntactic and lexical features; (2) a reconstruction component: a tensor factorization model which relies on roles to predict argument fillers. When the components are estimated jointly to minimize errors in argument reconstruction, the induced roles largely correspond to roles defined in annotated resources. Our method performs on par with most accurate role induction methods on English, even though, unlike these previous approaches, we do not incorporate any prior linguistic knowledge about the language.",
    "The notion of metric plays a key role in machine learning problems such as classification, clustering or ranking. However, it is worth noting that there is a severe lack of theoretical guarantees that can be expected on the generalization capacity of the classifier associated to a given metric. The theoretical framework of $(\\epsilon, \\gamma, \\tau)$-good similarity functions (Balcan et al., 2008) has been one of the first attempts to draw a link between the properties of a similarity function and those of a linear classifier making use of it. In this paper, we extend and complete this theory by providing a new generalization bound for the associated classifier based on the algorithmic robustness framework.",
    "We present the multiplicative recurrent neural network as a general model for compositional meaning in language, and evaluate it on the task of fine-grained sentiment analysis. We establish a connection to the previously investigated matrix-space models for compositionality, and show they are special cases of the multiplicative recurrent net. Our experiments show that these models perform comparably or better than Elman-type additive recurrent neural networks and outperform matrix-space models on a standard fine-grained sentiment analysis corpus. Furthermore, they yield comparable results to structural deep models on the recently published Stanford Sentiment Treebank without the need for generating parse trees.",
    "Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.",
    "We develop a new statistical model for photographic images, in which the local responses of a bank of linear filters are described as jointly Gaussian, with zero mean and a covariance that varies slowly over spatial position. We optimize sets of filters so as to minimize the nuclear norms of matrices of their local activations (i.e., the sum of the singular values), thus encouraging a flexible form of sparsity that is not tied to any particular dictionary or coordinate system. Filters optimized according to this objective are oriented and bandpass, and their responses exhibit substantial local correlation. We show that images can be reconstructed nearly perfectly from estimates of the local filter response covariances alone, and with minimal degradation (either visual or MSE) from low-rank approximations of these covariances. As such, this representation holds much promise for use in applications such as denoising, compression, and texture representation, and may form a useful substrate for hierarchical decompositions.",
    "Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the \"deconvolution approach\" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.",
    "Artificial neural networks typically have a fixed, non-linear activation function at each neuron. We have designed a novel form of piecewise linear activation function that is learned independently for each neuron using gradient descent. With this adaptive activation function, we are able to improve upon deep neural network architectures composed of static rectified linear units, achieving state-of-the-art performance on CIFAR-10 (7.51%), CIFAR-100 (30.83%), and a benchmark from high-energy physics involving Higgs boson decay modes.",
    "This paper introduces a greedy parser based on neural networks, which leverages a new compositional sub-tree representation. The greedy parser and the compositional procedure are jointly trained, and tightly depends on each-other. The composition procedure outputs a vector representation which summarizes syntactically (parsing tags) and semantically (words) sub-trees. Composition and tagging is achieved over continuous (word or tag) representations, and recurrent neural networks. We reach F1 performance on par with well-known existing parsers, while having the advantage of speed, thanks to the greedy nature of the parser. We provide a fully functional implementation of the method described in this paper.",
    "Suitable lateral connections between encoder and decoder are shown to allow higher layers of a denoising autoencoder (dAE) to focus on invariant representations. In regular autoencoders, detailed information needs to be carried through the highest layers but lateral connections from encoder to decoder relieve this pressure. It is shown that abstract invariant features can be translated to detailed reconstructions when invariant features are allowed to modulate the strength of the lateral connection. Three dAE structures with modulated and additive lateral connections, and without lateral connections were compared in experiments using real-world images. The experiments verify that adding modulated lateral connections to the model 1) improves the accuracy of the probability model for inputs, as measured by denoising performance; 2) results in representations whose degree of invariance grows faster towards the higher layers; and 3) supports the formation of diverse invariant poolings.",
    "We develop a new method for visualizing and refining the invariances of learned representations. Specifically, we test for a general form of invariance, linearization, in which the action of a transformation is confined to a low-dimensional subspace. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images lying on a path between them that is of minimal length in the space of the representation (a \"representational geodesic\"). If the transformation relating the two reference images is linearized by the representation, this sequence should follow the gradual evolution of this transformation. We use this method to assess the invariance properties of a state-of-the-art image classification network and find that geodesics generated for image pairs differing by translation, rotation, and dilation do not evolve according to their associated transformations. Our method also suggests a remedy for these failures, and following this prescription, we show that the modified representation is able to linearize a variety of geometric image transformations.",
    "Genomics are rapidly transforming medical practice and basic biomedical research, providing insights into disease mechanisms and improving therapeutic strategies, particularly in cancer. The ability to predict the future course of a patient's disease from high-dimensional genomic profiling will be essential in realizing the promise of genomic medicine, but presents significant challenges for state-of-the-art survival analysis methods. In this abstract we present an investigation in learning genomic representations with neural networks to predict patient survival in cancer. We demonstrate the advantages of this approach over existing survival analysis methods using brain tumor data.",
    "Existing approaches to combine both additive and multiplicative neural units either use a fixed assignment of operations or require discrete optimization to determine what function a neuron should perform. However, this leads to an extensive increase in the computational complexity of the training procedure.   We present a novel, parameterizable transfer function based on the mathematical concept of non-integer functional iteration that allows the operation each neuron performs to be smoothly and, most importantly, differentiablely adjusted between addition and multiplication. This allows the decision between addition and multiplication to be integrated into the standard backpropagation training procedure.",
    "One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigate the value of preserving scale, or isometry, beyond the initial weights. We propose two methods of maintaing isometry, one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effectively speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning.",
    "We extend Stochastic Gradient Variational Bayes to perform posterior inference for the weights of Stick-Breaking processes. This development allows us to define a Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric version of the variational autoencoder that has a latent representation with stochastic dimensionality. We experimentally demonstrate that the SB-VAE, and a semi-supervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE's.",
    "Unsupervised learning on imbalanced data is challenging because, when given imbalanced data, current model is often dominated by the major category and ignores the categories with small amount of data. We develop a latent variable model that can cope with imbalanced data by dividing the latent space into a shared space and a private space. Based on Gaussian Process Latent Variable Models, we propose a new kernel formulation that enables the separation of latent space and derives an efficient variational inference method. The performance of our model is demonstrated with an imbalanced medical image dataset.",
    "Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful.",
    "This paper shows how one can directly apply natural language processing (NLP) methods to classification problems in cheminformatics. Connection between these seemingly separate fields is shown by considering standard textual representation of compound, SMILES. The problem of activity prediction against a target protein is considered, which is a crucial part of computer aided drug design process. Conducted experiments show that this way one can not only outrank state of the art results of hand crafted representations but also gets direct structural insights into the way decisions are made.",
    "We introduce a neural network architecture and a learning algorithm to produce factorized symbolic representations. We propose to learn these concepts by observing consecutive frames, letting all the components of the hidden representation except a small discrete set (gating units) be predicted from the previous frame, and let the factors of variation in the next frame be represented entirely by these discrete gated units (corresponding to symbolic representations). We demonstrate the efficacy of our approach on datasets of faces undergoing 3D transformations and Atari 2600 games.",
    "We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data.",
    "We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. The data are linearly transformed, and each component is then normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and a constant. We optimize the parameters of the full transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. The optimized transformation substantially Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than alternative methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We demonstrate the use of the model as a prior probability density that can be used to remove additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized using the same Gaussianization objective, thus offering an unsupervised method of optimizing a deep network architecture.",
    "Approximate variational inference has shown to be a powerful tool for modeling unknown complex probability distributions. Recent advances in the field allow us to learn probabilistic models of sequences that actively exploit spatial and temporal structure. We apply a Stochastic Recurrent Network (STORN) to learn robot time series data. Our evaluation demonstrates that we can robustly detect anomalies both off- and on-line.",
    "We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.",
    "We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.",
    "Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.",
    "We propose a framework for training multiple neural networks simultaneously. The parameters from all models are regularised by the tensor trace norm, so that each neural network is encouraged to reuse others' parameters if possible -- this is the main motivation behind multi-task learning. In contrast to many deep multi-task learning models, we do not predefine a parameter sharing strategy by specifying which layers have tied parameters. Instead, our framework considers sharing for all shareable layers, and the sharing strategy is learned in a data-driven way.",
    "This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.",
    "We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.",
    "Many machine learning classifiers are vulnerable to adversarial perturbations. An adversarial perturbation modifies an input to change a classifier's prediction without causing the input to seem substantially different to human perception. We deploy three methods to detect adversarial images. Adversaries trying to bypass our detectors must make the adversarial image less pathological or they will fail trying. Our best detection method reveals that adversarial images place abnormal emphasis on the lower-ranked principal components from PCA. Other detectors and a colorful saliency map are in an appendix.",
    "We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.",
    "Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.   Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)).   Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.",
    "This paper builds off recent work from Kiperwasser & Goldberg (2016) using neural attention in a simple graph-based dependency parser. We use a larger but more thoroughly regularized parser than other recent BiLSTM-based approaches, with biaffine classifiers to predict arcs and labels. Our parser gets state of the art or near state of the art performance on standard treebanks for six different languages, achieving 95.7% UAS and 94.1% LAS on the most popular English PTB dataset. This makes it the highest-performing graph-based parser on this benchmark---outperforming Kiperwasser Goldberg (2016) by 1.8% and 2.2%---and comparable to the highest performing transition-based parser (Kuncoro et al., 2016), which achieves 95.8% UAS and 94.6% LAS. We also show which hyperparameter choices had a significant effect on parsing accuracy, allowing us to achieve large gains over other graph-based approaches.",
    "Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report state-of-the-art results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015).",
    "Spherical data is found in many applications. By modeling the discretized sphere as a graph, we can accommodate non-uniformly distributed, partial, and changing samplings. Moreover, graph convolutions are computationally more efficient than spherical convolutions. As equivariance is desired to exploit rotational symmetries, we discuss how to approach rotation equivariance using the graph neural network introduced in Defferrard et al. (2016). Experiments show good performance on rotation-invariant learning problems. Code and examples are available at https://github.com/SwissDataScienceCenter/DeepSphere",
    "High computational complexity hinders the widespread usage of Convolutional Neural Networks (CNNs), especially in mobile devices. Hardware accelerators are arguably the most promising approach for reducing both execution time and power consumption. One of the most important steps in accelerator development is hardware-oriented model approximation. In this paper we present Ristretto, a model approximation framework that analyzes a given CNN with respect to numerical resolution used in representing weights and outputs of convolutional and fully connected layers. Ristretto can condense models by using fixed point arithmetic and representation instead of floating point. Moreover, Ristretto fine-tunes the resulting fixed point network. Given a maximum error tolerance of 1%, Ristretto can successfully condense CaffeNet and SqueezeNet to 8-bit. The code for Ristretto is available.",
    "The diversity of painting styles represents a rich visual vocabulary for the construction of an image. The degree to which one may learn and parsimoniously capture this visual vocabulary measures our understanding of the higher level features of paintings, if not images in general. In this work we investigate the construction of a single, scalable deep network that can parsimoniously capture the artistic style of a diversity of paintings. We demonstrate that such a network generalizes across a diversity of artistic styles by reducing a painting to a point in an embedding space. Importantly, this model permits a user to explore new painting styles by arbitrarily combining the styles learned from individual paintings. We hope that this work provides a useful step towards building rich models of paintings and offers a window on to the structure of the learned representation of artistic style.",
    "Sum-Product Networks (SPNs) are a class of expressive yet tractable hierarchical graphical models. LearnSPN is a structure learning algorithm for SPNs that uses hierarchical co-clustering to simultaneously identifying similar entities and similar features. The original LearnSPN algorithm assumes that all the variables are discrete and there is no missing data. We introduce a practical, simplified version of LearnSPN, MiniSPN, that runs faster and can handle missing data and heterogeneous features common in real applications. We demonstrate the performance of MiniSPN on standard benchmark datasets and on two datasets from Google's Knowledge Graph exhibiting high missingness rates and a mix of discrete and continuous features.",
    "Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet).   The SqueezeNet architecture is available for download here: https://github.com/DeepScale/SqueezeNet",
    "In this paper, we study the problem of question answering when reasoning over multiple facts is required. We propose Query-Reduction Network (QRN), a variant of Recurrent Neural Network (RNN) that effectively handles both short-term (local) and long-term (global) sequential dependencies to reason over multiple facts. QRN considers the context sentences as a sequence of state-changing triggers, and reduces the original query to a more informed query as it observes each trigger (context sentence) through time. Our experiments show that QRN produces the state-of-the-art results in bAbI QA and dialog tasks, and in a real goal-oriented dialog dataset. In addition, QRN formulation allows parallelization on RNN's time axis, saving an order of magnitude in time complexity for training and inference.",
    "We propose a language-agnostic way of automatically generating sets of semantically similar clusters of entities along with sets of \"outlier\" elements, which may then be used to perform an intrinsic evaluation of word embeddings in the outlier detection task. We used our methodology to create a gold-standard dataset, which we call WikiSem500, and evaluated multiple state-of-the-art embeddings. The results show a correlation between performance on this dataset and performance on sentiment analysis.",
    "Recurrent neural nets are widely used for predicting temporal data. Their inherent deep feedforward structure allows learning complex sequential patterns. It is believed that top-down feedback might be an important missing ingredient which in theory could help disambiguate similar patterns depending on broader context. In this paper we introduce surprisal-driven recurrent networks, which take into account past error information when making new predictions. This is achieved by continuously monitoring the discrepancy between most recent predictions and the actual observations. Furthermore, we show that it outperforms other stochastic and fully deterministic approaches on enwik8 character level prediction task achieving 1.37 BPC on the test portion of the text.",
    "Although Generative Adversarial Networks achieve state-of-the-art results on a variety of generative tasks, they are regarded as highly unstable and prone to miss modes. We argue that these bad behaviors of GANs are due to the very particular functional shape of the trained discriminators in high dimensional spaces, which can easily make training stuck or push probability mass in the wrong direction, towards that of higher concentration than that of the data generating distribution. We introduce several ways of regularizing the objective, which can dramatically stabilize the training of GAN models. We also show that our regularizers can help the fair distribution of probability mass across the modes of the data generating distribution, during the early phases of training and thus providing a unified solution to the missing modes problem.",
    "Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks, especially when the policies are represented using rich function approximators like deep neural networks. Model-based methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real data with simulated data. However, discrepancies between the simulated source domain and the target domain pose a challenge for simulated training. We introduce the EPOpt algorithm, which uses an ensemble of simulated source domains and a form of adversarial training to learn policies that are robust and generalize to a broad range of possible target domains, including unmodeled effects. Further, the probability distribution over source domains in the ensemble can be adapted using data from target domain and approximate Bayesian methods, to progressively make it a better approximation. Thus, learning on a model ensemble, along with source domain adaptation, provides the benefit of both robustness and learning/adaptation.",
    "We introduce Divnet, a flexible technique for learning networks with diverse neurons. Divnet models neuronal diversity by placing a Determinantal Point Process (DPP) over neurons in a given layer. It uses this DPP to select a subset of diverse neurons and subsequently fuses the redundant neurons into the selected ones. Compared with previous approaches, Divnet offers a more principled, flexible technique for capturing neuronal diversity and thus implicitly enforcing regularization. This enables effective auto-tuning of network architecture and leads to smaller network sizes without hurting performance. Moreover, through its focus on diversity and neuron fusing, Divnet remains compatible with other procedures that seek to reduce memory footprints of networks. We present experimental results to corroborate our claims: for pruning neural networks, Divnet is seen to be notably superior to competing approaches.",
    "The efficiency of graph-based semi-supervised algorithms depends on the graph of instances on which they are applied. The instances are often in a vectorial form before a graph linking them is built. The construction of the graph relies on a metric over the vectorial space that help define the weight of the connection between entities. The classic choice for this metric is usually a distance measure or a similarity measure based on the euclidean norm. We claim that in some cases the euclidean norm on the initial vectorial space might not be the more appropriate to solve the task efficiently. We propose an algorithm that aims at learning the most appropriate vectorial representation for building a graph on which the task at hand is solved efficiently.",
    "One major challenge in training Deep Neural Networks is preventing overfitting. Many techniques such as data augmentation and novel regularizers such as Dropout have been proposed to prevent overfitting without requiring a massive amount of training data. In this work, we propose a new regularizer called DeCov which leads to significantly reduced overfitting (as indicated by the difference between train and val performance), and better generalization. Our regularizer encourages diverse or non-redundant representations in Deep Neural Networks by minimizing the cross-covariance of hidden activations. This simple intuition has been explored in a number of past works but surprisingly has never been applied as a regularizer in supervised learning. Experiments across a range of datasets and network architectures show that this loss always reduces overfitting while almost always maintaining or increasing generalization performance and often improving performance over Dropout.",
    "Deep neural networks are commonly trained using stochastic non-convex optimization procedures, which are driven by gradient information estimated on fractions (batches) of the dataset. While it is commonly accepted that batch size is an important parameter for offline tuning, the benefits of online selection of batches remain poorly understood. We investigate online batch selection strategies for two state-of-the-art methods of stochastic gradient-based optimization, AdaDelta and Adam. As the loss function to be minimized for the whole dataset is an aggregation of loss functions of individual datapoints, intuitively, datapoints with the greatest loss should be considered (selected in a batch) more frequently. However, the limitations of this intuition and the proper control of the selection pressure over time are open questions. We propose a simple strategy where all datapoints are ranked w.r.t. their latest known loss value and the probability to be selected decays exponentially as a function of rank. Our experimental results on the MNIST dataset suggest that selecting batches speeds up both AdaDelta and Adam by a factor of about 5.",
    "We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.",
    "We introduce the \"Energy-based Generative Adversarial Network\" model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.",
    "Recent research in the deep learning field has produced a plethora of new architectures. At the same time, a growing number of groups are applying deep learning to new applications. Some of these groups are likely to be composed of inexperienced deep learning practitioners who are baffled by the dizzying array of architecture choices and therefore opt to use an older architecture (i.e., Alexnet). Here we attempt to bridge this gap by mining the collective knowledge contained in recent deep learning research to discover underlying principles for designing neural network architectures. In addition, we describe several architectural innovations, including Fractal of FractalNet network, Stagewise Boosting Networks, and Taylor Series Networks (our Caffe code and prototxt files is available at https://github.com/iPhysicist/CNNDesignPatterns). We hope others are inspired to build on our preliminary work.",
    "Machine comprehension (MC), answering a query about a given context paragraph, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to focus on a small portion of the context and summarize it with a fixed-size vector, couple attentions temporally, and/or often form a uni-directional attention. In this paper we introduce the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization. Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQuAD) and CNN/DailyMail cloze test.",
    "Though with progress, model learning and performing posterior inference still remains a common challenge for using deep generative models, especially for handling discrete hidden variables. This paper is mainly concerned with algorithms for learning Helmholz machines, which is characterized by pairing the generative model with an auxiliary inference model. A common drawback of previous learning algorithms is that they indirectly optimize some bounds of the targeted marginal log-likelihood. In contrast, we successfully develop a new class of algorithms, based on stochastic approximation (SA) theory of the Robbins-Monro type, to directly optimize the marginal log-likelihood and simultaneously minimize the inclusive KL-divergence. The resulting learning algorithm is thus called joint SA (JSA). Moreover, we construct an effective MCMC operator for JSA. Our results on the MNIST datasets demonstrate that the JSA's performance is consistently superior to that of competing algorithms like RWS, for learning a range of difficult models.",
    "Object detection with deep neural networks is often performed by passing a few thousand candidate bounding boxes through a deep neural network for each image. These bounding boxes are highly correlated since they originate from the same image. In this paper we investigate how to exploit feature occurrence at the image scale to prune the neural network which is subsequently applied to all bounding boxes. We show that removing units which have near-zero activation in the image allows us to significantly reduce the number of parameters in the network. Results on the PASCAL 2007 Object Detection Challenge demonstrate that up to 40% of units in some fully-connected layers can be entirely eliminated with little change in the detection result.",
    "Modeling interactions between features improves the performance of machine learning solutions in many domains (e.g. recommender systems or sentiment analysis). In this paper, we introduce Exponential Machines (ExM), a predictor that models all interactions of every order. The key idea is to represent an exponentially large tensor of parameters in a factorized format called Tensor Train (TT). The Tensor Train format regularizes the model and lets you control the number of underlying parameters. To train the model, we develop a stochastic Riemannian optimization procedure, which allows us to fit tensors with 2^160 entries. We show that the model achieves state-of-the-art performance on synthetic data with high-order interactions and that it works on par with high-order factorization machines on a recommender system dataset MovieLens 100K.",
    "We introduce Deep Variational Bayes Filters (DVBF), a new method for unsupervised learning and identification of latent Markovian state space models. Leveraging recent advances in Stochastic Gradient Variational Bayes, DVBF can overcome intractable inference distributions via variational inference. Thus, it can handle highly nonlinear input data with temporal and spatial dependencies such as image sequences without domain knowledge. Our experiments show that enabling backpropagation through transitions enforces state space assumptions and significantly improves information content of the latent embedding. This also enables realistic long-term prediction.",
    "Traditional dialog systems used in goal-oriented applications require a lot of domain-specific handcrafting, which hinders scaling up to new domains. End-to-end dialog systems, in which all components are trained from the dialogs themselves, escape this limitation. But the encouraging success recently obtained in chit-chat dialog may not carry over to goal-oriented settings. This paper proposes a testbed to break down the strengths and shortcomings of end-to-end dialog systems in goal-oriented applications. Set in the context of restaurant reservation, our tasks require manipulating sentences and symbols, so as to properly conduct conversations, issue API calls and use the outputs of such calls. We show that an end-to-end dialog system based on Memory Networks can reach promising, yet imperfect, performance and learn to perform non-trivial operations. We confirm those results by comparing our system to a hand-crafted slot-filling baseline on data from the second Dialog State Tracking Challenge (Henderson et al., 2014a). We show similar result patterns on data extracted from an online concierge service.",
    "Adversarial training provides a means of regularizing supervised learning algorithms while virtual adversarial training is able to extend supervised learning algorithms to the semi-supervised setting. However, both methods require making small perturbations to numerous entries of the input vector, which is inappropriate for sparse high-dimensional inputs such as one-hot word representations. We extend adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself. The proposed method achieves state of the art results on multiple benchmark semi-supervised and purely supervised tasks. We provide visualizations and analysis showing that the learned word embeddings have improved in quality and that while training, the model is less prone to overfitting. Code is available at https://github.com/tensorflow/models/tree/master/research/adversarial_text.",
    "Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.",
    "This paper is focused on studying the view-manifold structure in the feature spaces implied by the different layers of Convolutional Neural Networks (CNN). There are several questions that this paper aims to answer: Does the learned CNN representation achieve viewpoint invariance? How does it achieve viewpoint invariance? Is it achieved by collapsing the view manifolds, or separating them while preserving them? At which layer is view invariance achieved? How can the structure of the view manifold at each layer of a deep convolutional neural network be quantified experimentally? How does fine-tuning of a pre-trained CNN on a multi-view dataset affect the representation at each layer of the network? In order to answer these questions we propose a methodology to quantify the deformation and degeneracy of view manifolds in CNN layers. We apply this methodology and report interesting results in this paper that answer the aforementioned questions.",
    "Bilinear models provide rich representations compared with linear models. They have been applied in various visual tasks, such as object recognition, segmentation, and visual question-answering, to get state-of-the-art performances taking advantage of the expanded representations. However, bilinear representations tend to be high-dimensional, limiting the applicability to computationally complex tasks. We propose low-rank bilinear pooling using Hadamard product for an efficient attention mechanism of multimodal learning. We show that our model outperforms compact bilinear pooling in visual question-answering tasks with the state-of-the-art results on the VQA dataset, having a better parsimonious property.",
    "The standard interpretation of importance-weighted autoencoders is that they maximize a tighter lower bound on the marginal likelihood than the standard evidence lower bound. We give an alternate interpretation of this procedure: that it optimizes the standard variational lower bound, but using a more complex distribution. We formally derive this result, present a tighter lower bound, and visualize the implicit importance-weighted distribution.",
    "We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.",
    "In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples.Specifically, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal. We derive the analytic form of the induced solution, and analyze the properties. In order to make the proposed framework trainable in practice, we introduce two effective approximation techniques. Empirically, the experiment results closely match our theoretical analysis, verifying the discriminator is able to recover the energy of data distribution.",
    "In this work we perform outlier detection using ensembles of neural networks obtained by variational approximation of the posterior in a Bayesian neural network setting. The variational parameters are obtained by sampling from the true posterior by gradient descent. We show our outlier detection results are comparable to those obtained using other efficient ensembling methods.",
    "We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is \"matrix factorization by design\" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the near state-of the art perplexity while using significantly less RNN parameters.",
    "We present observations and discussion of previously unreported phenomena discovered while training residual networks. The goal of this work is to better understand the nature of neural networks through the examination of these new empirical results. These behaviors were identified through the application of Cyclical Learning Rates (CLR) and linear network interpolation. Among these behaviors are counterintuitive increases and decreases in training loss and instances of rapid training. For example, we demonstrate how CLR can produce greater testing accuracy than traditional training despite using large learning rates. Files to replicate these results are available at https://github.com/lnsmith54/exploring-loss",
    "Machine learning models are often used at test-time subject to constraints and trade-offs not present at training-time. For example, a computer vision model operating on an embedded device may need to perform real-time inference, or a translation model operating on a cell phone may wish to bound its average compute time in order to be power-efficient. In this work we describe a mixture-of-experts model and show how to change its test-time resource-usage on a per-input basis using reinforcement learning. We test our method on a small MNIST-based example.",
    "Adversarial examples have been shown to exist for a variety of deep learning architectures. Deep reinforcement learning has shown promising results on training agent policies directly on raw inputs such as image pixels. In this paper we present a novel study into adversarial attacks on deep reinforcement learning polices. We compare the effectiveness of the attacks using adversarial examples vs. random noise. We present a novel method for reducing the number of times adversarial examples need to be injected for a successful attack, based on the value function. We further explore how re-training on random noise and FGSM perturbations affects the resilience against adversarial examples.",
    "This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that VCL outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.",
    "Automatically determining the optimal size of a neural network for a given task without prior information currently requires an expensive global search and training many networks from scratch. In this paper, we address the problem of automatically finding a good network size during a single training cycle. We introduce *nonparametric neural networks*, a non-probabilistic framework for conducting optimization over all possible network sizes and prove its soundness when network growth is limited via an L_p penalty. We train networks under this framework by continuously adding new units while eliminating redundant units via an L_2 penalty. We employ a novel optimization algorithm, which we term *adaptive radial-angular gradient descent* or *AdaRad*, and obtain promising results.",
    "Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It's noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to the strongest published system.",
    "The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.",
    "We extend Stochastic Gradient Variational Bayes to perform posterior inference for the weights of Stick-Breaking processes. This development allows us to define a Stick-Breaking Variational Autoencoder (SB-VAE), a Bayesian nonparametric version of the variational autoencoder that has a latent representation with stochastic dimensionality. We experimentally demonstrate that the SB-VAE, and a semi-supervised variant, learn highly discriminative latent representations that often outperform the Gaussian VAE's.",
    "We propose a framework for training multiple neural networks simultaneously. The parameters from all models are regularised by the tensor trace norm, so that each neural network is encouraged to reuse others' parameters if possible -- this is the main motivation behind multi-task learning. In contrast to many deep multi-task learning models, we do not predefine a parameter sharing strategy by specifying which layers have tied parameters. Instead, our framework considers sharing for all shareable layers, and the sharing strategy is learned in a data-driven way.",
    "This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.",
    "Many machine learning classifiers are vulnerable to adversarial perturbations. An adversarial perturbation modifies an input to change a classifier's prediction without causing the input to seem substantially different to human perception. We deploy three methods to detect adversarial images. Adversaries trying to bypass our detectors must make the adversarial image less pathological or they will fail trying. Our best detection method reveals that adversarial images place abnormal emphasis on the lower-ranked principal components from PCA. Other detectors and a colorful saliency map are in an appendix.",
    "We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.",
    "State-of-the-art deep reading comprehension models are dominated by recurrent neural nets. Their sequential nature is a natural fit for language, but it also precludes parallelization within an instances and often becomes the bottleneck for deploying such models to latency critical scenarios. This is particularly problematic for longer texts. Here we present a convolutional architecture as an alternative to these recurrent architectures. Using simple dilated convolutional units in place of recurrent ones, we achieve results comparable to the state of the art on two question answering tasks, while at the same time achieving up to two orders of magnitude speedups for question answering.",
    "This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.",
    "Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content (\"bit rate\") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.",
    "Methods that learn representations of nodes in a graph play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as point vectors in a low-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, we propose an unsupervised method that handles inductive learning scenarios and is applicable to different types of graphs: plain/attributed, directed/undirected. By leveraging both the network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering of the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks. Additionally, we demonstrate the benefits of modeling uncertainty - by analyzing it we can estimate neighborhood diversity and detect the intrinsic latent dimensionality of a graph.",
    "This paper explores the use of self-ensembling for visual domain adaptation problems. Our technique is derived from the mean teacher variant (Tarvainen et al., 2017) of temporal ensembling (Laine et al;, 2017), a technique that achieved state of the art results in the area of semi-supervised learning. We introduce a number of modifications to their approach for challenging domain adaptation scenarios and evaluate its effectiveness. Our approach achieves state of the art results in a variety of benchmarks, including our winning entry in the VISDA-2017 visual domain adaptation challenge. In small image benchmarks, our algorithm not only outperforms prior art, but can also achieve accuracy that is close to that of a classifier trained in a supervised fashion.",
    "Most machine learning classifiers, including deep neural networks, are vulnerable to adversarial examples. Such inputs are typically generated by adding small but purposeful modifications that lead to incorrect outputs while imperceptible to human eyes. The goal of this paper is not to introduce a single method, but to make theoretical steps towards fully understanding adversarial examples. By using concepts from topology, our theoretical analysis brings forth the key reasons why an adversarial example can fool a classifier ($f_1$) and adds its oracle ($f_2$, like human eyes) in such analysis. By investigating the topological relationship between two (pseudo)metric spaces corresponding to predictor $f_1$ and oracle $f_2$, we develop necessary and sufficient conditions that can determine if $f_1$ is always robust (strong-robust) against adversarial examples according to $f_2$. Interestingly our theorems indicate that just one unnecessary feature can make $f_1$ not strong-robust, and the right feature representation learning is the key to getting a classifier that is both accurate and strong-robust.",
    "We develop a general problem setting for training and testing the ability of agents to gather information efficiently. Specifically, we present a collection of tasks in which success requires searching through a partially-observed environment, for fragments of information which can be pieced together to accomplish various goals. We combine deep architectures with techniques from reinforcement learning to develop agents that solve our tasks. We shape the behavior of these agents by combining extrinsic and intrinsic rewards. We empirically demonstrate that these agents learn to search actively and intelligently for new information to reduce their uncertainty, and to exploit information they have already acquired.",
    "We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the use of external memory in neural network and cache models used with count based language models. We demonstrate on several language model datasets that our approach performs significantly better than recent memory augmented networks.",
    "Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful.",
    "We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.",
    "We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data.",
    "In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts.",
    "We compared the efficiency of the FlyHash model, an insect-inspired sparse neural network (Dasgupta et al., 2017), to similar but non-sparse models in an embodied navigation task. This requires a model to control steering by comparing current visual inputs to memories stored along a training route. We concluded the FlyHash model is more efficient than others, especially in terms of data encoding.",
    "In peer review, reviewers are usually asked to provide scores for the papers. The scores are then used by Area Chairs or Program Chairs in various ways in the decision-making process. The scores are usually elicited in a quantized form to accommodate the limited cognitive ability of humans to describe their opinions in numerical values. It has been found that the quantized scores suffer from a large number of ties, thereby leading to a significant loss of information. To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed. There are however two key challenges. First, there is no standard procedure for using this ranking information and Area Chairs may use it in different ways (including simply ignoring them), thereby leading to arbitrariness in the peer-review process. Second, there are no suitable interfaces for judicious use of this data nor methods to incorporate it in existing workflows, thereby leading to inefficiencies. We take a principled approach to integrate the ranking information into the scores. The output of our method is an updated score pertaining to each review that also incorporates the rankings. Our approach addresses the two aforementioned challenges by: (i) ensuring that rankings are incorporated into the updates scores in the same manner for all papers, thereby mitigating arbitrariness, and (ii) allowing to seamlessly use existing interfaces and workflows designed for scores. We empirically evaluate our method on synthetic datasets as well as on peer reviews from the ICLR 2017 conference, and find that it reduces the error by approximately 30% as compared to the best performing baseline on the ICLR 2017 data.",
    "Many recent studies have probed status bias in the peer-review process of academic journals and conferences. In this article, we investigated the association between author metadata and area chairs' final decisions (Accept/Reject) using our compiled database of 5,313 borderline submissions to the International Conference on Learning Representations (ICLR) from 2017 to 2022. We carefully defined elements in a cause-and-effect analysis, including the treatment and its timing, pre-treatment variables, potential outcomes and causal null hypothesis of interest, all in the context of study units being textual data and under Neyman and Rubin's potential outcomes (PO) framework. We found some weak evidence that author metadata was associated with articles' final decisions. We also found that, under an additional stability assumption, borderline articles from high-ranking institutions (top-30% or top-20%) were less favored by area chairs compared to their matched counterparts. The results were consistent in two different matched designs (odds ratio = 0.82 [95% CI: 0.67 to 1.00] in a first design and 0.83 [95% CI: 0.64 to 1.07] in a strengthened design). We discussed how to interpret these results in the context of multiple interactions between a study unit and different agents (reviewers and area chairs) in the peer-review system.",
    "We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method \"Deep Variational Information Bottleneck\", or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack.",
    "Attention networks have proven to be an effective approach for embedding categorical inference within a deep neural network. However, for many tasks we may want to model richer structural dependencies without abandoning end-to-end training. In this work, we experiment with incorporating richer structural distributions, encoded using graphical models, within deep networks. We show that these structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees. We experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. We further find that models trained in this way learn interesting unsupervised hidden representations that generalize simple attention.",
    "We are proposing to use an ensemble of diverse specialists, where speciality is defined according to the confusion matrix. Indeed, we observed that for adversarial instances originating from a given class, labeling tend to be done into a small subset of (incorrect) classes. Therefore, we argue that an ensemble of specialists should be better able to identify and reject fooling instances, with a high entropy (i.e., disagreement) over the decisions in the presence of adversaries. Experimental results obtained confirm that interpretation, opening a way to make the system more robust to adversarial examples through a rejection mechanism, rather than trying to classify them properly at any cost.",
    "In this paper, we present Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Different from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. Instead, it directly outputs phrases in a sequential order and can decode in linear time. Our experiments show that NPMT achieves superior performances on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks compared with strong NMT baselines. We also observe that our method produces meaningful phrases in output languages.",
    "We present LR-GAN: an adversarial image generation model which takes scene structure and context into account. Unlike previous generative adversarial networks (GANs), the proposed GAN learns to generate image background and foregrounds separately and recursively, and stitch the foregrounds on the background in a contextually relevant manner to produce a complete natural image. For each foreground, the model learns to generate its appearance, shape and pose. The whole model is unsupervised, and is trained in an end-to-end manner with gradient descent methods. The experiments demonstrate that LR-GAN can generate more natural images with objects that are more human recognizable than DCGAN.",
    "We describe a simple scheme that allows an agent to learn about its environment in an unsupervised manner. Our scheme pits two versions of the same agent, Alice and Bob, against one another. Alice proposes a task for Bob to complete; and then Bob attempts to complete the task. In this work we will focus on two kinds of environments: (nearly) reversible environments and environments that can be reset. Alice will \"propose\" the task by doing a sequence of actions and then Bob must undo or repeat them, respectively. Via an appropriate reward structure, Alice and Bob automatically generate a curriculum of exploration, enabling unsupervised training of the agent. When Bob is deployed on an RL task within the environment, this unsupervised training reduces the number of supervised episodes needed to learn, and in some cases converges to a higher reward.",
    "Maximum entropy modeling is a flexible and popular framework for formulating statistical models given partial knowledge. In this paper, rather than the traditional method of optimizing over the continuous density directly, we learn a smooth and invertible transformation that maps a simple distribution to the desired maximum entropy distribution. Doing so is nontrivial in that the objective being maximized (entropy) is a function of the density itself. By exploiting recent developments in normalizing flow networks, we cast the maximum entropy problem into a finite-dimensional constrained optimization, and solve the problem by combining stochastic optimization with the augmented Lagrangian method. Simulation results demonstrate the effectiveness of our method, and applications to finance and computer vision show the flexibility and accuracy of using maximum entropy flow networks.",
    "With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal. However, most current research focuses instead on important but narrow applications, such as image classification or machine translation. We believe this to be largely due to the lack of objective ways to measure progress towards broad machine intelligence. In order to fill this gap, we propose here a set of concrete desiderata for general AI, together with a platform to test machines on how well they satisfy such desiderata, while keeping all further complexities to a minimum.",
    "Neural networks that compute over graph structures are a natural fit for problems in a variety of domains, including natural language (parse trees) and cheminformatics (molecular graphs). However, since the computation graph has a different shape and size for every input, such networks do not directly support batched training or inference. They are also difficult to implement in popular deep learning libraries, which are based on static data-flow graphs. We introduce a technique called dynamic batching, which not only batches together operations between different input graphs of dissimilar shape, but also between different nodes within a single input graph. The technique allows us to create static graphs, using popular libraries, that emulate dynamic computation graphs of arbitrary shape and size. We further present a high-level library of compositional blocks that simplifies the creation of dynamic graph models. Using the library, we demonstrate concise and batch-wise parallel implementations for a variety of models from the literature.",
    "Although deep learning models have proven effective at solving problems in natural language processing, the mechanism by which they come to their conclusions is often unclear. As a result, these models are generally treated as black boxes, yielding no insight of the underlying learned patterns. In this paper we consider Long Short Term Memory networks (LSTMs) and demonstrate a new approach for tracking the importance of a given input to the LSTM for a given output. By identifying consistently important patterns of words, we are able to distill state of the art LSTMs on sentiment analysis and question answering into a set of representative phrases. This representation is then quantitatively validated by using the extracted phrases to construct a simple, rule-based classifier which approximates the output of the LSTM.",
    "Deep reinforcement learning has achieved many impressive results in recent years. However, tasks with sparse rewards or long horizons continue to pose significant challenges. To tackle these important problems, we propose a general framework that first learns useful skills in a pre-training environment, and then leverages the acquired skills for learning faster in downstream tasks. Our approach brings together some of the strengths of intrinsic motivation and hierarchical methods: the learning of useful skill is guided by a single proxy reward, the design of which requires very minimal domain knowledge about the downstream tasks. Then a high-level policy is trained on top of these skills, providing a significant improvement of the exploration and allowing to tackle sparse rewards in the downstream tasks. To efficiently pre-train a large span of skills, we use Stochastic Neural Networks combined with an information-theoretic regularizer. Our experiments show that this combination is effective in learning a wide span of interpretable skills in a sample-efficient way, and can significantly boost the learning performance uniformly across a wide range of downstream tasks.",
    "Deep generative models have achieved impressive success in recent years. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), as emerging families for generative model learning, have largely been considered as two distinct paradigms and received extensive independent studies respectively. This paper aims to establish formal connections between GANs and VAEs through a new formulation of them. We interpret sample generation in GANs as performing posterior inference, and show that GANs and VAEs involve minimizing KL divergences of respective posterior and inference distributions with opposite directions, extending the two learning phases of classic wake-sleep algorithm, respectively. The unified view provides a powerful tool to analyze a diverse set of existing model variants, and enables to transfer techniques across research lines in a principled way. For example, we apply the importance weighting method in VAE literatures for improved GAN learning, and enhance VAEs with an adversarial mechanism that leverages generated samples. Experiments show generality and effectiveness of the transferred techniques.",
    "We consider the problem of detecting out-of-distribution images in neural networks. We propose ODIN, a simple and effective method that does not require any change to a pre-trained neural network. Our method is based on the observation that using temperature scaling and adding small perturbations to the input can separate the softmax score distributions between in- and out-of-distribution images, allowing for more effective detection. We show in a series of experiments that ODIN is compatible with diverse network architectures and datasets. It consistently outperforms the baseline approach by a large margin, establishing a new state-of-the-art performance on this task. For example, ODIN reduces the false positive rate from the baseline 34.7% to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is 95%.",
    "A framework is presented for unsupervised learning of representations based on infomax principle for large-scale neural populations. We use an asymptotic approximation to the Shannon's mutual information for a large neural population to demonstrate that a good initial approximation to the global information-theoretic optimum can be obtained by a hierarchical infomax method. Starting from the initial solution, an efficient algorithm based on gradient descent of the final objective function is proposed to learn representations from the input datasets, and the method works for complete, overcomplete, and undercomplete bases. As confirmed by numerical experiments, our method is robust and highly efficient for extracting salient features from input datasets. Compared with the main existing methods, our algorithm has a distinct advantage in both the training speed and the robustness of unsupervised representation learning. Furthermore, the proposed method is easily extended to the supervised or unsupervised model for training deep structure networks.",
    "Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models. Source code is publicly available at https://imatge-upc.github.io/skiprnn-2017-telecombcn/ .",
    "Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on the CIFAR-10 and CIFAR-100 datasets, where we demonstrate new state-of-the-art results at 3.14% and 16.21%, respectively. We also demonstrate its advantages on a dataset of EEG recordings and on a downsampled version of the ImageNet dataset. Our source code is available at https://github.com/loshchil/SGDR",
    "Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from the large variance issue on policy gradient estimation, which leads to poor sample efficiency during training. In this work, we propose a control variate method to effectively reduce variance for policy gradient methods. Motivated by the Stein's identity, our method extends the previous control variate methods used in REINFORCE and advantage actor-critic by introducing more general action-dependent baseline functions. Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy gradient approaches.",
    "Skip connections made the training of very deep networks possible and have become an indispensable component in a variety of neural architectures. A completely satisfactory explanation for their success remains elusive. Here, we present a novel explanation for the benefits of skip connections in training very deep networks. The difficulty of training deep networks is partly due to the singularities caused by the non-identifiability of the model. Several such singularities have been identified in previous works: (i) overlap singularities caused by the permutation symmetry of nodes in a given layer, (ii) elimination singularities corresponding to the elimination, i.e. consistent deactivation, of nodes, (iii) singularities generated by the linear dependence of the nodes. These singularities cause degenerate manifolds in the loss landscape that slow down learning. We argue that skip connections eliminate these singularities by breaking the permutation symmetry of nodes, by reducing the possibility of node elimination and by making the nodes less linearly dependent. Moreover, for typical initializations, skip connections move the network away from the \"ghosts\" of these singularities and sculpt the landscape around them to alleviate the learning slow-down. These hypotheses are supported by evidence from simplified models, as well as from experiments with deep networks trained on real-world datasets.",
    "We have tried to reproduce the results of the paper \"Natural Language Inference over Interaction Space\" submitted to ICLR 2018 conference as part of the ICLR 2018 Reproducibility Challenge. Initially, we were not aware that the code was available, so we started to implement the network from scratch. We have evaluated our version of the model on Stanford NLI dataset and reached 86.38% accuracy on the test set, while the paper claims 88.0% accuracy. The main difference, as we understand it, comes from the optimizers and the way model selection is performed.",
    "We have successfully implemented the \"Learn to Pay Attention\" model of attention mechanism in convolutional neural networks, and have replicated the results of the original paper in the categories of image classification and fine-grained recognition.",
    "Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose a method to learn such representations by encoding the suffixes of word sequences in a sentence and training on the Stanford Natural Language Inference (SNLI) dataset. We demonstrate the effectiveness of our approach by evaluating it on the SentEval benchmark, improving on existing approaches on several transfer tasks.",
    "In many neural models, new features as polynomial functions of existing ones are used to augment representations. Using the natural language inference task as an example, we investigate the use of scaled polynomials of degree 2 and above as matching features. We find that scaling degree 2 features has the highest impact on performance, reducing classification error by 5% in the best models.",
    "We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.",
    "In this work, we investigate Batch Normalization technique and propose its probabilistic interpretation. We propose a probabilistic model and show that Batch Normalization maximazes the lower bound of its marginalized log-likelihood. Then, according to the new probabilistic model, we design an algorithm which acts consistently during train and test. However, inference becomes computationally inefficient. To reduce memory and computational cost, we propose Stochastic Batch Normalization -- an efficient approximation of proper inference procedure. This method provides us with a scalable uncertainty estimation technique. We demonstrate the performance of Stochastic Batch Normalization on popular architectures (including deep convolutional architectures: VGG-like and ResNets) for MNIST and CIFAR-10 datasets.",
    "It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural image representations.",
    "Deep latent variable models are powerful tools for representation learning. In this paper, we adopt the deep information bottleneck model, identify its shortcomings and propose a model that circumvents them. To this end, we apply a copula transformation which, by restoring the invariance properties of the information bottleneck method, leads to disentanglement of the features in the latent space. Building on that, we show how this transformation translates to sparsity of the latent space in the new model. We evaluate our method on artificial and real data.",
    "We introduce a variant of the MAC model (Hudson and Manning, ICLR 2018) with a simplified set of equations that achieves comparable accuracy, while training faster. We evaluate both models on CLEVR and CoGenT, and show that, transfer learning with fine-tuning results in a 15 point increase in accuracy, matching the state of the art. Finally, in contrast, we demonstrate that improper fine-tuning can actually reduce a model's accuracy as well.",
    "Adaptive Computation Time for Recurrent Neural Networks (ACT) is one of the most promising architectures for variable computation. ACT adapts to the input sequence by being able to look at each sample more than once, and learn how many times it should do it. In this paper, we compare ACT to Repeat-RNN, a novel architecture based on repeating each sample a fixed number of times. We found surprising results, where Repeat-RNN performs as good as ACT in the selected tasks. Source code in TensorFlow and PyTorch is publicly available at https://imatge-upc.github.io/danifojo-2018-repeatrnn/",
    "Generative adversarial networks (GANs) are able to model the complex highdimensional distributions of real-world data, which suggests they could be effective for anomaly detection. However, few works have explored the use of GANs for the anomaly detection task. We leverage recently developed GAN models for anomaly detection, and achieve state-of-the-art performance on image and network intrusion datasets, while being several hundred-fold faster at test time than the only published GAN-based method.",
    "Natural Language Inference (NLI) task requires an agent to determine the logical relationship between a natural language premise and a natural language hypothesis. We introduce Interactive Inference Network (IIN), a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. We show that an interaction tensor (attention weight) contains semantic information to solve natural language inference, and a denser interaction tensor contains richer semantic information. One instance of such architecture, Densely Interactive Inference Network (DIIN), demonstrates the state-of-the-art performance on large scale NLI copora and large-scale NLI alike corpus. It's noteworthy that DIIN achieve a greater than 20% error reduction on the challenging Multi-Genre NLI (MultiNLI) dataset with respect to the strongest published system.",
    "The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for increasing robustness to adversarial examples --- and yet most of these have been quickly shown to be vulnerable to future attacks. For example, over half of the defenses proposed by papers accepted at ICLR 2018 have already been broken. We propose to address this difficulty through formal verification techniques. We show how to construct provably minimally distorted adversarial examples: given an arbitrary neural network and input sample, we can construct adversarial examples which we prove are of minimal distortion. Using this approach, we demonstrate that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.",
    "Deep neural networks (DNNs) have achieved impressive predictive performance due to their ability to learn complex, non-linear relationships between variables. However, the inability to effectively visualize these relationships has led to DNNs being characterized as black boxes and consequently limited their applications. To ameliorate this problem, we introduce the use of hierarchical interpretations to explain DNN predictions through our proposed method, agglomerative contextual decomposition (ACD). Given a prediction from a trained DNN, ACD produces a hierarchical clustering of the input features, along with the contribution of each cluster to the final prediction. This hierarchy is optimized to identify clusters of features that the DNN learned are predictive. Using examples from Stanford Sentiment Treebank and ImageNet, we show that ACD is effective at diagnosing incorrect predictions and identifying dataset bias. Through human experiments, we demonstrate that ACD enables users both to identify the more accurate of two DNNs and to better trust a DNN's outputs. We also find that ACD's hierarchy is largely robust to adversarial perturbations, implying that it captures fundamental aspects of the input and ignores spurious noise.",
    "In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies \"image\" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.",
    "We consider the task of word-level language modeling and study the possibility of combining hidden-states-based short-term representations with medium-term representations encoded in dynamical weights of a language model. Our work extends recent experiments on language models with dynamically evolving weights by casting the language modeling problem into an online learning-to-learn framework in which a meta-learner is trained by gradient-descent to continuously update a language model weights.",
    "GANS are powerful generative models that are able to model the manifold of natural images. We leverage this property to perform manifold regularization by approximating the Laplacian norm using a Monte Carlo approximation that is easily computed with the GAN. When incorporated into the feature-matching GAN of Improved GAN, we achieve state-of-the-art results for GAN-based semi-supervised learning on the CIFAR-10 dataset, with a method that is significantly easier to implement than competing methods.",
    "We identify a class of over-parameterized deep neural networks with standard activation functions and cross-entropy loss which provably have no bad local valley, in the sense that from any point in parameter space there exists a continuous path on which the cross-entropy loss is non-increasing and gets arbitrarily close to zero. This implies that these networks have no sub-optimal strict local minima.",
    "Visual Question Answering (VQA) models have struggled with counting objects in natural images so far. We identify a fundamental problem due to soft attention in these models as a cause. To circumvent this problem, we propose a neural network component that allows robust counting from object proposals. Experiments on a toy task show the effectiveness of this component and we obtain state-of-the-art accuracy on the number category of the VQA v2 dataset without negatively affecting other categories, even outperforming ensemble models with our single model. On a difficult balanced pair metric, the component gives a substantial improvement in counting over a strong baseline by 6.6%.",
    "One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally light and easy to incorporate into existing implementations. We tested the efficacy of spectral normalization on CIFAR10, STL-10, and ILSVRC2012 dataset, and we experimentally confirmed that spectrally normalized GANs (SN-GANs) is capable of generating images of better or equal quality relative to the previous training stabilization techniques.",
    "Embedding graph nodes into a vector space can allow the use of machine learning to e.g. predict node classes, but the study of node embedding algorithms is immature compared to the natural language processing field because of a diverse nature of graphs. We examine the performance of node embedding algorithms with respect to graph centrality measures that characterize diverse graphs, through systematic experiments with four node embedding algorithms, four or five graph centralities, and six datasets. Experimental results give insights into the properties of node embedding algorithms, which can be a basis for further research on this topic.",
    "We introduce a new dataset of logical entailments for the purpose of measuring models' ability to capture and exploit the structure of logical expressions against an entailment prediction task. We use this task to compare a series of architectures which are ubiquitous in the sequence-processing literature, in addition to a new model class---PossibleWorldNets---which computes entailment as a \"convolution over possible worlds\". Results show that convolutional networks present the wrong inductive bias for this class of problems relative to LSTM RNNs, tree-structured neural networks outperform LSTM RNNs due to their enhanced ability to exploit the syntax of logic, and PossibleWorldNets outperform all benchmarks.",
    "Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.   We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the \"lottery ticket hypothesis:\" dense, randomly-initialized, feed-forward networks contain subnetworks (\"winning tickets\") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.   We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.",
    "We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2\\% to 5.3\\%.",
    "Understanding theoretical properties of deep and locally connected nonlinear network, such as deep convolutional neural network (DCNN), is still a hard problem despite its empirical success. In this paper, we propose a novel theoretical framework for such networks with ReLU nonlinearity. The framework explicitly formulates data distribution, favors disentangled representations and is compatible with common regularization techniques such as Batch Norm. The framework is built upon teacher-student setting, by expanding the student forward/backward propagation onto the teacher's computational graph. The resulting model does not impose unrealistic assumptions (e.g., Gaussian inputs, independence of activation, etc). Our framework could help facilitate theoretical analysis of many practical issues, e.g. overfitting, generalization, disentangled representations in deep networks.",
    "We present a Neural Program Search, an algorithm to generate programs from natural language description and a small number of input/output examples. The algorithm combines methods from Deep Learning and Program Synthesis fields by designing rich domain-specific language (DSL) and defining efficient search algorithm guided by a Seq2Tree model on it. To evaluate the quality of the approach we also present a semi-synthetic dataset of descriptions with test examples and corresponding programs. We show that our algorithm significantly outperforms a sequence-to-sequence model with attention baseline.",
    "Most state-of-the-art neural machine translation systems, despite being different in architectural skeletons (e.g. recurrence, convolutional), share an indispensable feature: the Attention. However, most existing attention methods are token-based and ignore the importance of phrasal alignments, the key ingredient for the success of phrase-based statistical machine translation. In this paper, we propose novel phrase-based attention methods to model n-grams of tokens as attention entities. We incorporate our phrase-based attentions into the recently proposed Transformer network, and demonstrate that our approach yields improvements of 1.3 BLEU for English-to-German and 0.5 BLEU for German-to-English translation tasks on WMT newstest2014 using WMT'16 training data.",
    "We introduce the problem of learning distributed representations of edits. By combining a \"neural editor\" with an \"edit encoder\", our models learn to represent the salient information of an edit and can be used to apply edits to new inputs. We experiment on natural language and source code edit data. Our evaluation yields promising results that suggest that our neural network models learn to capture the structure and semantics of edits. We hope that this interesting task and data source will inspire other researchers to work further on this problem.",
    "We propose a principled method for kernel learning, which relies on a Fourier-analytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.",
    "This paper develops variational continual learning (VCL), a simple but general framework for continual learning that fuses online variational inference (VI) and recent advances in Monte Carlo VI for neural networks. The framework can successfully train both deep discriminative models and deep generative models in complex continual learning settings where existing tasks evolve over time and entirely new tasks emerge. Experimental results show that VCL outperforms state-of-the-art continual learning methods on a variety of tasks, avoiding catastrophic forgetting in a fully automatic way.",
    "This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.",
    "In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts.",
    "We propose a single neural probabilistic model based on variational autoencoder that can be conditioned on an arbitrary subset of observed features and then sample the remaining features in \"one shot\". The features may be both real-valued and categorical. Training of the model is performed by stochastic variational Bayes. The experimental evaluation on synthetic data, as well as feature imputation and image inpainting problems, shows the effectiveness of the proposed approach and diversity of the generated samples.",
    "Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content (\"bit rate\") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.",
    "Understanding and characterizing the subspaces of adversarial examples aid in studying the robustness of deep neural networks (DNNs) to adversarial perturbations. Very recently, Ma et al. (ICLR 2018) proposed to use local intrinsic dimensionality (LID) in layer-wise hidden representations of DNNs to study adversarial subspaces. It was demonstrated that LID can be used to characterize the adversarial subspaces associated with different attack methods, e.g., the Carlini and Wagner's (C&W) attack and the fast gradient sign attack.   In this paper, we use MNIST and CIFAR-10 to conduct two new sets of experiments that are absent in existing LID analysis and report the limitation of LID in characterizing the corresponding adversarial subspaces, which are (i) oblivious attacks and LID analysis using adversarial examples with different confidence levels; and (ii) black-box transfer attacks. For (i), we find that the performance of LID is very sensitive to the confidence parameter deployed by an attack, and the LID learned from ensembles of adversarial examples with varying confidence levels surprisingly gives poor performance. For (ii), we find that when adversarial examples are crafted from another DNN model, LID is ineffective in characterizing their adversarial subspaces. These two findings together suggest the limited capability of LID in characterizing the subspaces of adversarial examples.",
    "Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. Tapping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam.",
    "Neural message passing algorithms for semi-supervised classification on graphs have recently achieved great success. However, for classifying a node these methods only consider nodes that are a few propagation steps away and the size of this utilized neighborhood is hard to extend. In this paper, we use the relationship between graph convolutional networks (GCN) and PageRank to derive an improved propagation scheme based on personalized PageRank. We utilize this propagation procedure to construct a simple model, personalized propagation of neural predictions (PPNP), and its fast approximation, APPNP. Our model's training time is on par or faster and its number of parameters on par or lower than previous models. It leverages a large, adjustable neighborhood for classification and can be easily combined with any neural network. We show that this model outperforms several recently proposed methods for semi-supervised classification in the most thorough study done so far for GCN-like models. Our implementation is available online.",
    "We identify obfuscated gradients, a kind of gradient masking, as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat iterative optimization-based attacks, we find defenses relying on this effect can be circumvented. We describe characteristic behaviors of defenses exhibiting the effect, and for each of the three types of obfuscated gradients we discover, we develop attack techniques to overcome it. In a case study, examining non-certified white-box-secure defenses at ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 9 defenses relying on obfuscated gradients. Our new attacks successfully circumvent 6 completely, and 1 partially, in the original threat model each paper considers.",
    "Methods that learn representations of nodes in a graph play a critical role in network analysis since they enable many downstream learning tasks. We propose Graph2Gauss - an approach that can efficiently learn versatile node embeddings on large scale (attributed) graphs that show strong performance on tasks such as link prediction and node classification. Unlike most approaches that represent nodes as point vectors in a low-dimensional continuous space, we embed each node as a Gaussian distribution, allowing us to capture uncertainty about the representation. Furthermore, we propose an unsupervised method that handles inductive learning scenarios and is applicable to different types of graphs: plain/attributed, directed/undirected. By leveraging both the network structure and the associated node attributes, we are able to generalize to unseen nodes without additional training. To learn the embeddings we adopt a personalized ranking formulation w.r.t. the node distances that exploits the natural ordering of the nodes imposed by the network structure. Experiments on real world networks demonstrate the high performance of our approach, outperforming state-of-the-art network embedding methods on several different tasks. Additionally, we demonstrate the benefits of modeling uncertainty - by analyzing it we can estimate neighborhood diversity and detect the intrinsic latent dimensionality of a graph.",
    "Convolutional Neural Networks (CNNs) have become the method of choice for learning problems involving 2D planar images. However, a number of problems of recent interest have created a demand for models that can analyze spherical images. Examples include omnidirectional vision for drones, robots, and autonomous cars, molecular regression problems, and global weather and climate modelling. A naive application of convolutional networks to a planar projection of the spherical signal is destined to fail, because the space-varying distortions introduced by such a projection will make translational weight sharing ineffective.   In this paper we introduce the building blocks for constructing spherical CNNs. We propose a definition for the spherical cross-correlation that is both expressive and rotation-equivariant. The spherical correlation satisfies a generalized Fourier theorem, which allows us to compute it efficiently using a generalized (non-commutative) Fast Fourier Transform (FFT) algorithm. We demonstrate the computational efficiency, numerical accuracy, and effectiveness of spherical CNNs applied to 3D model recognition and atomization energy regression.",
    "This paper shows how one can directly apply natural language processing (NLP) methods to classification problems in cheminformatics. Connection between these seemingly separate fields is shown by considering standard textual representation of compound, SMILES. The problem of activity prediction against a target protein is considered, which is a crucial part of computer aided drug design process. Conducted experiments show that this way one can not only outrank state of the art results of hand crafted representations but also gets direct structural insights into the way decisions are made.",
    "The inclusion of Computer Vision and Deep Learning technologies in Agriculture aims to increase the harvest quality, and productivity of farmers. During postharvest, the export market and quality evaluation are affected by assorting of fruits and vegetables. In particular, apples are susceptible to a wide range of defects that can occur during harvesting or/and during the post-harvesting period. This paper aims to help farmers with post-harvest handling by exploring if recent computer vision and deep learning methods such as the YOLOv3 (Redmon & Farhadi (2018)) can help in detecting healthy apples from apples with defects.",
    "We present two simple ways of reducing the number of parameters and accelerating the training of large Long Short-Term Memory (LSTM) networks: the first one is \"matrix factorization by design\" of LSTM matrix into the product of two smaller matrices, and the second one is partitioning of LSTM matrix, its inputs and states into the independent groups. Both approaches allow us to train large LSTM networks significantly faster to the near state-of the art perplexity while using significantly less RNN parameters.",
    "State-of-the-art deep reading comprehension models are dominated by recurrent neural nets. Their sequential nature is a natural fit for language, but it also precludes parallelization within an instances and often becomes the bottleneck for deploying such models to latency critical scenarios. This is particularly problematic for longer texts. Here we present a convolutional architecture as an alternative to these recurrent architectures. Using simple dilated convolutional units in place of recurrent ones, we achieve results comparable to the state of the art on two question answering tasks, while at the same time achieving up to two orders of magnitude speedups for question answering."
  ],
  "sampled": [
    "Okay, here's a more relatable version of that text:\n\n\"Ever wondered how to get your image recognition project running on multiple GPUs?  We built a version of AlexNet (a powerful image recognition model) using a tool called Theano, and figured out how to split the work across multiple GPUs without too much hassle.  Turns out, running our code on 2 GPUs is about as fast as the popular Caffe library running on a single, super-powered GPU.  Oh, and as far as we know, this is the first time anyone's shared the code for an AlexNet implementation in Python!  \" \n",
    "Imagine you want to teach a computer to understand the patterns in some data – like pictures or music. We can use a special type of artificial intelligence called a \"deep Boltzmann machine\" to do this. \n\nThis work shows that even a \"narrow\" deep Boltzmann machine – one where each layer is relatively small – can theoretically learn any pattern, as long as it has enough layers. We proved this by demonstrating how these machines, under certain conditions, act like simpler networks that pass information in only one direction. \n\nWe even figured out some limits on how \"deep\" and \"wide\" these networks need to be to learn any pattern. Turns out, these slim deep Boltzmann machines are surprisingly efficient learners, on par with other popular models, at least according to the best knowledge we have today! \n",
    "This study introduces Stochastic Recurrent Networks (STORNs), extending recurrent neural networks by incorporating latent variables using variational inference techniques. STORNs offer several advantages: they are trainable via stochastic gradient methods, accommodate structured and multimodal conditional inputs at each time step, provide a reliable estimator of the marginal likelihood, and encompass deterministic recurrent neural networks as a special case. The effectiveness of STORNs is assessed through experiments on four polyphonic musical datasets and motion capture data. \n",
    "This paper presents a novel framework for the dynamic adaptation of optimization hyperparameters during the training process, termed \"hot swapping.\" This approach enables the online modification of hyperparameter values without interrupting the training procedure. The efficacy of hot swapping is investigated in the context of adaptive learning rate selection, employing an explore-exploit strategy borrowed from the multi-armed bandit literature. Empirical evaluations conducted on a benchmark neural network architecture demonstrate that the proposed hot swapping method consistently outperforms established adaptive learning rate algorithms, including AdaDelta, as well as stochastic gradient descent with exhaustive hyperparameter search. This performance advantage is observed across a range of evaluation metrics, indicating the robustness and efficiency of the hot swapping approach for online hyperparameter optimization. \n",
    "Many contemporary multiclass and multilabel classification problems involve high-dimensional output spaces, posing computational and statistical challenges. Label embedding techniques have emerged as a promising approach to address these challenges. This work establishes a novel connection between rank-constrained estimation and low-dimensional label embeddings, leading to the development of a highly efficient label embedding algorithm applicable to both multiclass and multilabel scenarios. Specifically, the proposed algorithm leverages a randomized approach for partial least squares, achieving an exponential speedup in runtime compared to deterministic algorithms.  The empirical effectiveness of this technique is demonstrated on two large-scale benchmark datasets: the Large Scale Hierarchical Text Classification (LSHTC) dataset and the Open Directory Project (ODP) dataset.  In both cases, the proposed method achieves state-of-the-art classification performance, demonstrating its practical utility for handling large-scale multiclass and multilabel problems. \n",
    "Developing artificial intelligence capable of sophisticated reasoning requires models that can effectively learn both explicit and implicit relationships within data. This work introduces Dynamic Adaptive Network Intelligence (DANI), a novel approach for efficiently extracting such intricate relationships from data, even with minimal supervision. DANI's strength lies in its ability to dynamically adapt its internal representations, allowing it to uncover subtle dependencies and patterns that elude traditional methods.  \n\nWe showcase DANI's exceptional performance on challenging question-answering tasks from the bAbI dataset, a benchmark specifically designed to assess reasoning abilities in machine learning models. DANI significantly surpasses existing state-of-the-art approaches on these tasks, including those presented in Weston et al. (2015), demonstrating its capacity for complex reasoning and its potential to advance the field of machine learning towards more intelligent systems. \n",
    "Traditional automatic speech recognition (ASR) systems typically rely on handcrafted spectral features such as Mel-Frequency Cepstral Coefficients (MFCCs) or Perceptual Linear Prediction (PLP) coefficients. These feature extraction techniques are based on prior knowledge of human speech perception and production mechanisms. However, recent advancements in deep learning have demonstrated the efficacy of Convolutional Neural Networks (CNNs) for directly learning discriminative representations from raw speech signals, thereby circumventing the need for explicit feature engineering. Notably, CNN-based acoustic models have been shown to achieve comparable or superior phoneme recognition accuracy compared to conventional Hidden Markov Model/Artificial Neural Network (HMM/ANN) hybrid systems, while utilizing significantly fewer parameters. \n\nMotivated by these findings, this study investigates the feasibility of employing a simplified linear classifier within a CNN-based acoustic modeling framework. This architecture enables the network to learn linearly separable features directly from the raw speech waveform, eliminating the reliance on handcrafted features and potentially capturing complementary information. Through rigorous experimentation, we demonstrate that the proposed system achieves comparable or superior performance to a Multi-Layer Perceptron (MLP) classifier trained on conventional cepstral features. This suggests that CNNs can effectively learn robust and discriminative representations from raw speech, enabling the development of highly accurate and efficient ASR systems without the need for domain-specific feature engineering. \n",
    "This paper details a novel neural network training framework implemented within the Kaldi speech recognition toolkit, specifically designed to handle the massive datasets crucial for robust automatic speech recognition. Recognizing the necessity for efficient large-scale training, our framework exhibits exceptional scalability across diverse hardware configurations, encompassing multi-GPU and multi-core systems.\n\nTo minimize communication overhead often plaguing distributed training, we introduce a parameter averaging scheme. This method periodically averages neural network parameters across all machines, typically at intervals of one to two minutes. Remarkably, this simple averaging, when used in isolation, does not guarantee optimal performance.  \n\nTherefore, we introduce a second, pivotal component: an efficient and scalable approximation of Natural Gradient for Stochastic Gradient Descent (NG-SGD). This novel approach not only enhances the effectiveness of our parameter averaging scheme but also independently accelerates convergence rates, even on single-machine setups. The synergistic combination of these techniques results in a powerful and efficient training framework capable of handling the demands of modern deep learning for speech recognition. \n",
    "This paper introduces a powerful new method for analyzing and refining the invariance properties of learned representations, a critical aspect for robust performance in computer vision tasks.  Our approach centers on detecting \"linearization\", a general form of invariance where a transformation's effect is confined within a low-dimensional subspace of the representation.\n\nWe achieve this by generating \"representational geodesics\". Given two images related by a specific transformation, our method synthesizes a sequence of images along the shortest path connecting them within the representation space. If the representation successfully linearizes the transformation, this geodesic should visually depict a smooth, gradual application of that transformation.\n\nApplying our method to a state-of-the-art image classification network, we uncover a critical weakness:  the network fails to linearize common transformations like translation, rotation, and dilation.  However, our method doesn't stop at diagnosis. We demonstrate how to leverage representational geodesics to guide the refinement of learned representations, leading to a modified network capable of linearizing a variety of geometric transformations.  This ability to both diagnose and rectify representational shortcomings underscores the significance of our method as an invaluable tool for advancing robust and reliable computer vision systems. \n",
    "Unlocking the secrets of deep learning has become one of the most exciting frontiers in artificial intelligence!  We're all curious: Why does it work so well? What kind of magic happens inside those layers?  This work embarks on a quest to uncover the inner workings of deep learning through the elegant lens of group theory, opening up a whole new avenue for understanding this powerful tool.\n\nOne intriguing aspect of deep learning is the concept of \"pretraining,\" where we build up the network layer by layer, teaching it to understand the data in increasing levels of detail.  We delve into the profound implications of pretraining, revealing a fascinating connection to the world of group actions and their orbits.  \n\nThink of it like this: imagine the network is searching for the most fundamental building blocks of the data, those with the simplest structures. We show that pretraining, in a way, guides the network to discover these essential building blocks, represented by minimal orbits in a hidden \"shadow\" group. This elegant mathematical framework unveils why deep learning excels at first finding simple patterns and then gradually building up to more complex representations as we go deeper into the network.  It's a journey of uncovering increasingly intricate and abstract patterns hidden within the data! \n",
    "This paper introduces the Stacked What-Where Auto-encoder (SWWAE), a powerful new architecture that seamlessly integrates discriminative and generative learning pathways. SWWAE provides a single, unified framework for supervised, semi-supervised, and unsupervised learning, eliminating the need for cumbersome sampling techniques during training.\n\nOur SWWAE architecture leverages the strengths of convolutional neural networks (Convnets) for encoding input data and deconvolutional networks (Deconvnets) for reconstruction. By incorporating reconstruction terms into the objective function, we ensure that the hidden representations learned by the Deconvnet closely align with those of the Convnet.\n\nFurthermore, our novel \"what-where\" pooling mechanism further enhances the representational capacity of SWWAE. Each pooling layer generates two distinct sets of features: the \"what\" features are passed to subsequent layers for higher-level abstraction, while the complementary \"where\" features provide spatial information to the corresponding layer in the decoder. This strategy enables SWWAE to capture both global content and precise spatial relationships within the data, resulting in superior performance across a variety of learning paradigms. \n",
    "This study addresses the challenge of learning specialized word embeddings optimized for specific lexical relations. We propose a novel algorithm that takes pre-trained word embeddings as input and compresses them to create low-dimensional representations tailored for predicting a target bilexical relation, such as synonymy or antonymy. Through rigorous empirical evaluation, we demonstrate that these task-specific embeddings yield substantial improvements in both the accuracy and computational efficiency of lexical prediction tasks compared to using general-purpose word embeddings. \n\n\n",
    "Distinguished colleagues, \n\nWe stand at the forefront of a new era in machine learning. Today, I present to you a significant advancement in this domain: a novel generative model for deep convolutional dictionary learning.\n\nThis model distinguishes itself through the integration of a groundbreaking probabilistic pooling operation, seamlessly woven into its multi-layered architecture. This innovation enables a dual learning paradigm: an efficient bottom-up pretraining stage, rapidly discovering salient features, followed by a refined top-down approach for precise model optimization.\n\nRigorous empirical validation has showcased the exceptional capabilities of this model.  We observe a profound ability to extract multi-layered features from image data, leading to demonstrably superior classification performance. Notably, on the MNIST and Caltech 101 datasets, our model has achieved state-of-the-art results, showcasing its remarkable potential for advancing the field.\n\nThank you.\n",
    "This study presents a model for generating images from natural language descriptions, building upon recent advancements in generative modeling. The proposed model employs an iterative approach, synthesizing images by sequentially drawing patches on a canvas while attending to relevant words in the input description. Trained on the Microsoft COCO dataset, the model is evaluated against existing generative models on image generation and retrieval tasks. Results indicate that the model generates higher-quality images compared to baseline approaches and exhibits the ability to compose novel scenes reflecting previously unseen captions within the dataset. \n",
    "Convolutional Neural Networks (CNNs), while achieving remarkable success on large datasets, often struggle with overfitting when limited training data is available. This poses a significant challenge in domains where labeled data is scarce or expensive to acquire. This work addresses this limitation by introducing an efficient Bayesian CNN framework that demonstrates enhanced robustness to overfitting in low-data regimes.\n\nOur approach centers on introducing a probability distribution over the CNN's kernel weights, effectively capturing uncertainty in the model parameters. We utilize Bernoulli variational distributions to approximate the intractable posterior distribution, eliminating the need for additional model parameters.  This strategy provides a principled approach to regularize the network and mitigate overfitting.\n\nFurthermore, we establish a theoretical connection between dropout training, a popular regularization technique, and approximate Bayesian inference. This insight reframes dropout as a form of variational inference, enabling seamless integration of our Bayesian CNN within existing deep learning frameworks without incurring additional computational overhead.  However, this analysis also highlights a previously overlooked limitation in the traditional dropout method.\n\nThrough extensive empirical evaluation on the CIFAR-10 dataset, our proposed Bayesian CNN consistently outperforms standard CNNs and advances the state-of-the-art in low-data scenarios, demonstrating substantial improvements in classification accuracy. This underscores the efficacy and practicality of our method for real-world applications with limited labeled data. \n",
    "This paper introduces a novel method for building computationally efficient Convolutional Neural Networks (CNNs) by representing convolutional filters with low-rank approximations. Unlike previous approaches that rely on pre-trained networks, we learn a compact set of basis filters from scratch. During training, our method dynamically combines these basis filters to form more complex, discriminative filters. \n\nWe validate our approach across various CNN architectures and datasets, including CIFAR, ILSVRC, and MIT Places. Our results consistently demonstrate comparable or superior accuracy to conventional CNNs while significantly reducing computational costs. For instance, our method achieves comparable accuracy on ImageNet using an improved VGG-11 network with 41% less computation and only 24% of the parameters. Notably, we achieve a top-5 validation accuracy of 89.7% on ImageNet while reducing computation by 16% compared to the original VGG-11. Our method consistently reduces computational demands while maintaining or even exceeding the accuracy of conventional CNNs, paving the way for more efficient deep learning models. \n",
    "Distributed word representations have revolutionized Natural Language Processing (NLP), yet they suffer from a critical limitation: the inability to adequately capture the multiple meanings of polysemic words.  This deficiency hinders the accuracy and expressiveness of word embeddings and, consequently, limits the performance of downstream NLP tasks.\n\nThis paper tackles this challenge head-on, introducing a novel and elegant model that empowers existing word embedding techniques to generate distinct representations for different senses of polysemic words. Our approach is not only conceptually sound but also computationally efficient, making it practical for large-scale NLP applications.\n\nThrough rigorous evaluation, we demonstrate that our model excels at disambiguating word senses while maintaining computational efficiency. This breakthrough paves the way for a new generation of word representations capable of capturing the richness and subtlety of human language, ultimately leading to more accurate and sophisticated NLP models. \n",
    "Traditional feed-forward neural network language models (LMs) typically project word history into a single high-dimensional space.  We propose a novel alternative: the Diverse Embedding Neural Network (DENNLM).  DENNLM maps the input word history onto multiple, distinct low-dimensional subspaces. By incorporating a specialized loss function, our training process encourages diversity among these subspaces, enabling the model to capture richer linguistic information. Experiments on the Penn Treebank dataset demonstrate that DENNLM's unique approach leads to substantial improvements in language modeling performance.  \n",
    "Imagine being able to model the complexity of high-dimensional data in a way that is both powerful and elegant.  We introduce NICE (Non-linear Independent Component Estimation), a deep learning framework that achieves just that!  \n\nNICE is built on the idea that a truly good representation transforms data into a simplified form that's easy to model. Our approach learns a special non-linear transformation that maps the input data to a latent space where the components become independent – like magic!  \n\nWe designed NICE with clever parameterizations that make it incredibly efficient.  Computing the necessary transformations and inverses is a breeze, while still allowing us to learn rich, complex relationships within the data using deep neural networks.\n\nBut the best part? NICE keeps things simple and elegant by directly optimizing the exact log-likelihood – no need for complicated approximations.  And generating new samples? Easy – just use ancestral sampling.\n\nThrough experiments on four image datasets, we showcase NICE's remarkable ability to learn powerful generative models.  And to top it off, we demonstrate how NICE can be seamlessly applied to inpainting tasks, showcasing its versatility.   \n\n\n",
    "This paper introduces Deep Linear Discriminant Analysis (DeepLDA), a novel framework that unifies the strengths of classic Linear Discriminant Analysis (LDA) with the representation learning capabilities of deep neural networks.  DeepLDA learns linearly separable latent representations in an end-to-end manner, effectively performing dimensionality reduction and classification jointly.\n\nUnlike traditional LDA, which operates on hand-crafted features, DeepLDA leverages a deep neural network to learn a non-linear transformation of the input data. Instead of directly maximizing the likelihood of target labels, DeepLDA optimizes an objective function derived from the generalized eigenvalue problem in LDA. This objective encourages the network to learn feature representations characterized by:\n\n**(a) Low intra-class variance:** Features from the same class are tightly clustered in the latent space.\n**(b) High inter-class variance:** Features from different classes are well-separated.\n\nOur proposed objective function is compatible with stochastic gradient descent and backpropagation, enabling efficient training of deep architectures.  We evaluate DeepLDA on three benchmark image recognition datasets: MNIST, CIFAR-10, and STL-10. Our method achieves competitive results on MNIST and CIFAR-10. Notably, DeepLDA outperforms a network with the same architecture trained using categorical cross-entropy loss on a supervised STL-10 dataset, demonstrating its superior ability to learn discriminative and compact representations. \n",
    "Initializing the weights of deep neural networks effectively is crucial for successful training. Poor initialization can hinder convergence and lead to suboptimal performance.  This paper introduces Layer-sequential unit-variance (LSUV), a simple yet powerful method for weight initialization that addresses these challenges.\n\nLSUV consists of two straightforward steps:\n\n1. **Orthonormal Pre-Initialization:**  We initialize the weights of each convolutional or fully connected layer using orthonormal matrices. This helps ensure that the initial signal propagation through the network is well-behaved. \n\n2. **Layer-wise Variance Normalization:** Starting from the first layer and proceeding sequentially, we normalize the variance of each layer's output to be equal to one. This normalization step helps prevent vanishing or exploding gradients, which can impede training, especially in very deep networks. \n\nDespite its simplicity, LSUV consistently yields excellent results across a variety of network architectures and activation functions, including maxout, ReLU variants, and tanh.  Our experiments demonstrate that LSUV achieves:\n\n* **Competitive or superior accuracy:** LSUV matches or surpasses the performance of standard initialization methods.\n* **Fast convergence:** LSUV is at least as fast as complex initialization schemes specifically designed for very deep networks, such as FitNets and Highway Networks.\n\nWe validated LSUV on various architectures, including GoogLeNet, CaffeNet, FitNets, and Residual Networks. Our results on benchmark datasets (MNIST, CIFAR-10/100, and ImageNet) demonstrate state-of-the-art or near state-of-the-art performance. LSUV's effectiveness, simplicity, and efficiency make it a compelling choice for weight initialization in deep learning. \n",
    "Imagine being able to transform complex data, like images, into a beautifully simple Gaussian form!  That's exactly what our new method achieves. We introduce an elegant, parametric, and non-linear transformation that excels at Gaussianizing data from natural images.\n\nOur method works in a few key steps. First, we apply a linear transformation to the data. Then, each component is normalized by a clever \"pooled activity measure\" that captures dependencies between different parts of the data.  This measure is calculated by combining rectified and exponentiated components using learned weights and a constant.\n\nTo find the optimal transformation, we train our model on a vast collection of natural images.  We directly minimize the negentropy of the transformed data, effectively pushing it towards a Gaussian distribution.  The results are impressive! Our optimized transformation outperforms alternative methods, such as ICA and radial Gaussianization, by achieving significantly lower mutual information between the transformed components – a clear indicator of successful Gaussianization.\n\nBut it gets even better! This transformation is differentiable and easily invertible, allowing us to build a powerful generative model for images.  Samples from our model exhibit a striking resemblance to natural image patches. We further showcase the versatility of our model by using it as a prior for removing noise from images.  \n\nAs a testament to its elegance and efficiency, our transformation can be seamlessly stacked into multiple layers, forming a deep network.  Each layer is optimized using the same Gaussianization objective, providing an unsupervised approach to learning hierarchical representations. \n",
    "Looking for a way to speed up convolutional neural networks without sacrificing accuracy?  We introduce \"flattened\" convolutional neural networks, specifically designed for lightning-fast feedforward execution.  \n\nOur approach leverages the inherent redundancy in traditional convolutional filters. Instead of using computationally expensive 3D filters, we employ a series of consecutive 1D filters applied across all spatial directions.  Surprisingly, this simplification doesn't compromise accuracy!  \n\nWe validate our approach on various datasets, consistently demonstrating comparable performance to conventional CNNs. But here's the kicker: our flattened convolution pipelines achieve a remarkable 2x speedup during feedforward inference due to a significant reduction in the number of parameters. \n\nAnd the best part? Our method integrates seamlessly into existing training workflows and requires zero manual tuning or post-processing.  It's a win-win for efficiency and ease of use! \n",
    "This paper introduces Purine, a groundbreaking deep learning framework engineered for unparalleled performance and flexibility in parallel computing environments.  Purine reimagines deep network representation using a novel bipartite graph (bi-graph) abstraction.  By representing computations as interconnected operators and data tensors within this graph, Purine unlocks unprecedented opportunities for efficient parallelization.\n\nAt the heart of Purine lies a powerful event-driven task dispatcher. This intelligent dispatcher analyzes the network's bi-graph structure and dynamically schedules tasks across multiple GPUs and/or CPUs, spanning single or multiple PCs.  This flexible design allows for seamless implementation of various parallelism schemes, liberating researchers from the burden of low-level coding for each specific hardware configuration.\n\nBut there's more! Purine's task dispatcher doesn't just parallelize computations; it optimizes data movement as well. By strategically overlapping memory transfers with computations, Purine minimizes communication overhead, a notorious bottleneck in parallel computing.  This meticulous orchestration empowers us to achieve near-linear acceleration, maximizing hardware utilization and slashing training times.\n\nPurine signifies a paradigm shift in deep learning frameworks, offering a potent combination of expressiveness, efficiency, and scalability.  With its elegant bi-graph abstraction and intelligent task dispatching, Purine empowers researchers to push the boundaries of deep learning, effortlessly harnessing the power of parallel computing. \n",
    "This paper introduces the Variational Recurrent Auto-Encoder (VRAE), a novel deep learning model that blends the strengths of Recurrent Neural Networks (RNNs) and Stochastic Gradient Variational Bayes (SGVB).  VRAE provides an effective framework for unsupervised learning on large time series datasets, mapping variable-length sequences to a compact latent space representation. As a generative model, VRAE can synthesize new time series data from samples drawn from this learned latent space. Importantly, VRAE can leverage readily available unlabeled data to enhance supervised RNN training, providing improved initialization for both network weights and hidden states. \n",
    "Imagine embarking on a quest to build the most efficient deep learning machines, machines that are both compact and energy-sipping. One obstacle quickly emerges: those pesky multipliers, the arithmetic workhorses of neural networks, are notorious for their insatiable appetite for space and power. \n\nUndeterred, we set out on an experiment. We gathered a team of cutting-edge Maxout networks, renowned for their performance, and challenged them with three classic image recognition tasks: deciphering handwritten digits in MNIST, classifying objects in CIFAR-10, and tackling house numbers in SVHN.\n\nBut here's the twist: we trained each network multiple times, varying the precision of their multiplication operations. We experimented with the full buffet: luxurious floating-point numbers, disciplined fixed-point numbers, and the innovative dynamic fixed-point numbers.\n\nTo our delight, a remarkable pattern emerged.  These powerful networks didn't require excessive precision to excel. In fact, they achieved impressive accuracy even when their multiplication operations were constrained to a surprisingly low bit width. Astonishingly, we trained Maxout networks using a mere 10-bit multiplications, achieving performance comparable to their high-precision counterparts!\n\nOur findings carry profound implications for the future of deep learning hardware.  By embracing low-precision arithmetic, we can unlock dramatic reductions in size and energy consumption, paving the way for efficient and sustainable deep learning on a massive scale. \n",
    "Semantic segmentation, while powerful, typically demands a lot of expensive, pixel-level annotation. Multiple instance learning (MIL) offers a way to cut down on this annotation burden by allowing models to learn from weaker, image-level labels.  \n\nWe introduce a new MIL approach to semantic segmentation using a fully convolutional network. Our method learns to segment images into different semantic classes, using only image-level labels as supervision. We train the network end-to-end, meaning it simultaneously learns to represent the image content and figure out which pixels belong to the assigned image-level labels.\n\nThe fully convolutional architecture provides several advantages: it can handle images of any size, doesn't need pre-processing steps like object proposals, and outputs a helpful pixel-wise \"loss map\" that guides the model towards identifying relevant image regions. Moreover, our approach effectively utilizes the extra information available in images with multiple labels, further boosting performance.  We demonstrate the promise of our method through experiments on the PASCAL VOC segmentation challenge. \n",
    "Nested dropout, a technique originally used for prioritizing units in unsupervised autoencoders, shows promise for optimizing convolutional neural networks (CNNs). We investigate whether applying nested dropout to convolutional layers during backpropagation training can systematically determine the optimal CNN size for a given task and dataset, balancing accuracy with computational efficiency. \n",
    "Okay, so imagine you're teaching a computer to learn from tons of data.  Stochastic gradient descent (SGD) is like the go-to method, right? But SGD can be a bit finicky – you need to carefully adjust the learning rate (how quickly it learns) and deal with noisy data.\n\nWell, we came up with a cool new trick: an adaptive learning rate algorithm that figures out the best learning rate automatically!  It's like having a self-tuning engine for your learning algorithm.\n\nHere's how it works: we use information about the \"curvature\" of the problem (think of it like the terrain the algorithm is trying to navigate) to guide the learning rate.  We even threw in a clever variance reduction technique to speed things up even more. \n\nWe tested it out on some deep neural networks, and guess what?  Our method outperformed some of the most popular SGD variations! It's like giving your learning algorithm a turbo boost! \n\n\n",
    "This work explores how to build visual representations that simplify the understanding of 3D object motion. We start with the premise that ideal representations should transform linearly under changes in viewpoint.  Using group representation theory, we demonstrate that any such representation can be decomposed into a combination of fundamental, \"irreducible\" representations.  \n\nFurthermore, we establish a connection between these irreducible representations and statistical independence, showing that they tend to be decorrelated.  However, under realistic conditions with partial observability (like a 2D image of a 3D scene), object motion no longer transforms linearly. To overcome this, we propose using latent representations that capture the underlying 3D transformations. We illustrate this concept through a model of rotating objects, where a latent representation based on the 3D rotation group (SO(3)) effectively captures object motion despite the limitations of 2D image data. \n",
    "Imagine searching for a needle in a haystack, but instead of a single needle, you're looking for the one that's most similar to the one you hold. This is the essence of Maximum Inner Product Search (MIPS), a crucial task powering recommendation systems and large-scale classification.\n\nWhile sophisticated techniques like locality-sensitive hashing (LSH) and intricate tree-based methods have been developed for fast approximate MIPS, we propose a surprisingly simple yet powerful alternative: k-means clustering.\n\nOur approach is elegantly straightforward. We first transform the MIPS problem into a Maximum Cosine Similarity Search (MCSS) problem. Then, we unleash the power of spherical k-means clustering, a variant specifically designed for directional data. \n\nThe results are astounding.  Across diverse datasets, from standard recommendation benchmarks to massive word embedding spaces, our k-means based approach dramatically outperforms state-of-the-art LSH and tree-based methods, achieving the same level of accuracy with significantly faster speeds.\n\nBut there's more! Our method exhibits exceptional robustness to noisy queries, gracefully handling imperfections in the search process.  This robustness is critical for real-world applications where data is often incomplete or imprecise. \n\nIn a world dominated by complexity, our work demonstrates that even a simple method like k-means clustering, when applied thoughtfully, can unlock extraordinary efficiency and resilience in tackling fundamental search problems.\n",
    "The variational autoencoder (VAE) is a generative model that combines a top-down generative network with a bottom-up recognition network to approximate posterior inference.  This study examines limitations inherent in the VAE's assumptions regarding posterior inference, specifically the assumptions of approximate posterior factoriality and the use of nonlinear regression for parameter estimation.  Empirical results indicate that these assumptions can lead to overly simplified latent representations, hindering the model's ability to utilize the full capacity of the network.\n\nTo address these limitations, this study introduces the importance weighted autoencoder (IWAE). Sharing the same architecture as the VAE, the IWAE employs a tighter log-likelihood lower bound derived from importance weighting.  This modification enables the recognition network to leverage multiple samples for approximating the posterior distribution, affording greater flexibility in modeling complex posteriors that may deviate from the VAE's assumptions. \n\nEmpirical evaluations demonstrate that IWAEs consistently learn richer latent space representations compared to VAEs. Consequently, IWAEs achieve improved test log-likelihood performance on density estimation benchmarks, highlighting their enhanced representational capacity and potential for generating higher-quality samples. \n",
    "Imagine a world where deep learning models could become incredibly efficient, shrinking their massive data footprints without sacrificing accuracy! That's the tantalizing possibility we explore by delving into the hidden world of reduced precision computing in Convolutional Neural Networks (CNNs). \n\nWe embarked on a quest to understand how using lower-precision numbers (think fewer bits, like using a smaller measuring stick) affects the accuracy of these powerful models.  And guess what we found? Not all layers in a CNN are created equal!  Some layers are remarkably tolerant to reduced precision, while others are more sensitive. \n\nThis discovery opens up exciting new possibilities! By tailoring the precision of each layer, we can optimize for both efficiency and accuracy.  Imagine it like a perfectly balanced recipe where each ingredient is used in just the right amount!\n\nOur research dives deep into this phenomenon, analyzing a diverse range of CNN architectures.  We present a method for finding the optimal low-precision configuration for each network, unlocking impressive results.  Get this – we can reduce the data footprint of these networks by an average of 74%, and in some cases, by a whopping 92%!  And the best part?  We maintain near-identical accuracy, losing less than 1% in relative performance. \n\nThis breakthrough paves the way for deploying powerful deep learning models on resource-constrained devices, bringing the magic of AI to even the smallest gadgets! \n",
    "Imagine you have a network of data points, all connected by invisible threads of similarity.  The strength of these connections – the key to unlocking the hidden patterns within the data – depends on how we measure the distance between these points.  \n\nTraditional graph-based learning often relies on the familiar Euclidean distance, like measuring the straight-line distance between points on a map. However, this one-size-fits-all approach might not always capture the most relevant relationships for a specific task.\n\nWe argue that tailoring the very way we measure distances between data points, effectively learning a new map where relevant similarities are amplified, can significantly enhance the performance of graph-based algorithms.  \n\nOur work introduces an innovative algorithm that discovers the optimal way to represent and connect data points for a given problem.  By learning a task-specific distance metric, we can build more informative graphs, leading to more accurate and insightful results. It's like giving graph-based learning a custom-designed compass, guiding it towards better solutions. \n",
    "Hypernymy, the relationship between a word and its broader category, textual entailment, the art of inferring meaning between sentences, and the captivating challenge of image captioning - these seemingly disparate concepts are interwoven by a single, elegant thread: the visual-semantic hierarchy. \n\nThis paper posits that explicitly modeling the intricate, hierarchical relationships between words, sentences, and images holds the key to unlocking deeper understanding in the realm of visual-semantic representation. \n\nTo this end, we unveil a novel method for learning ordered representations, capturing the essence of these hierarchical structures. Our approach, akin to arranging the pieces of a complex puzzle, reveals the inherent order within visual and textual data.\n\nThe elegance of this framework is reflected in its versatility. We demonstrate its efficacy across a spectrum of tasks, from predicting hypernym relationships to retrieving images guided by natural language descriptions.  In each domain, our ordered representations surpass existing methods, demonstrating the power of embracing the inherent hierarchy of visual and semantic information. \n\n\n",
    "Imagine training a deep learning model that not only excels at its given task but also exhibits remarkable resilience to slight perturbations in the data.  We introduce a novel regularization technique called Virtual Adversarial Training (VAT), inspired by the concept of \"local distributional smoothness\" (LDS). \n\nVAT encourages the model to produce similar outputs for similar inputs, effectively smoothing out its decision boundaries.  This approach draws inspiration from adversarial training but with a key distinction: VAT operates without relying on label information, making it remarkably effective for both supervised and semi-supervised learning scenarios. \n\nHere's how VAT works its magic: it cleverly crafts \"virtual\" adversarial examples by identifying directions in the input space where the model's predictions are most sensitive to small changes. By incorporating these virtual adversaries during training, VAT compels the model to learn more robust and generalized representations.\n\nThe beauty of VAT lies in its simplicity and efficiency.  Calculating the necessary gradients for training requires only a few extra forward and backward passes through the network. \n\nWe showcase VAT's exceptional performance on a series of image classification benchmarks, including MNIST, SVHN, and NORB. Our results demonstrate that VAT consistently surpasses existing semi-supervised learning methods, approaching the performance of state-of-the-art techniques that rely on complex generative models. VAT emerges as a powerful, versatile, and computationally efficient tool for enhancing the robustness and generalization ability of deep learning models across a diverse range of tasks and datasets. \n",
    "While deep learning thrives on large labeled datasets, manual annotation isn't always feasible. This work explores how Convolutional Neural Networks (CNNs) perform when trained on datasets with noisy labels.  We introduce a simple yet effective \"noise layer\" into the CNN architecture that adapts the network's output to better match the noisy label distribution. This layer is trained jointly with the rest of the network, requiring minimal modifications to existing training pipelines. Through experiments on various datasets, including ImageNet, we demonstrate the effectiveness of our approach for training CNNs on noisy labeled data.  \n",
    "Training deep neural networks with many connections is challenging, but what if we could strategically remove connections to make them more efficient? This work introduces new, guaranteed methods for training \"sparse\" feedforward neural networks – networks with fewer connections between layers.\n\nWe build upon techniques used for training simpler, linear networks and adapt them to handle the complexities of non-linear networks. Our approach focuses on analyzing specific mathematical relationships between input data and their corresponding labels. By cleverly factoring these relationships, we can directly calculate the optimal weights for the first layer of a deep network, under certain conditions. \n\nWhile our method primarily targets the first layer, its output serves as an excellent starting point for further training using traditional methods like gradient descent. This initialization strategy can significantly speed up training and lead to better overall performance.  \n\n\n",
    "Imagine language as a grand tapestry, woven together by intricate threads of meaning.  These threads, known as discourse relations, hold the key to understanding how individual sentences combine to form coherent and engaging narratives.\n\nUnraveling the mysteries of these relations is a formidable challenge for machines.  It demands a deep understanding of not just the individual sentences but also the subtle interplay between their underlying elements.  \n\nOur work embarks on this exciting journey, crafting a system that can discern the often-hidden threads of discourse. We build upon the power of distributional semantics, representing words and phrases as points in a vast space of meaning.  \n\nBut we don't stop there.  Our approach delves deeper, composing these representations upwards through the syntactic structure of sentences, mimicking the very process of human comprehension.  \n\nWe introduce a novel twist: a \"downward\" compositional pass that captures the crucial role of entity mentions, those vital threads that connect ideas across sentences. By considering the interplay between these elements, our model gains a richer, more nuanced understanding of discourse.\n\nThe results are remarkable! Our system achieves substantial improvements in predicting implicit discourse relations, surpassing previous state-of-the-art methods on the challenging Penn Discourse Treebank.  It's a leap forward in empowering machines to truly comprehend the intricate tapestry of human language. \n\n\n",
    "Imagine teaching a computer to understand the building blocks of meaning in language – those crucial semantic roles that words play in a sentence.  We've devised a novel approach that combines the power of two cutting-edge techniques: unsupervised semantic role labeling and tensor factorization.\n\nThink of it like this: we have two expert craftsmen working together. The first, our \"encoding component,\" is a skilled role labeler. It examines a sentence's syntactic structure and carefully assigns semantic roles to each word. The second, our \"reconstruction component,\" is a master builder. It uses these roles to predict the missing pieces of a sentence, like reconstructing a magnificent structure from its blueprint.\n\nHere's the key: these two experts don't work in isolation. We train them jointly, ensuring that their efforts harmonize to minimize errors in reconstructing the original sentence.  Astonishingly, this process guides the system to discover semantic roles that closely align with human-annotated linguistic resources.\n\nOur method achieves remarkable results, rivaling the accuracy of state-of-the-art role induction methods on English text.  And here's the truly groundbreaking part – we accomplish this without feeding the system any prior linguistic knowledge about the language! It's like witnessing a machine learn to decipher the language of meaning from scratch. \n",
    "Metrics, which measure the similarity between data points, are fundamental in machine learning tasks like classification and clustering.  However, we lack a deep understanding of how the choice of metric impacts the performance of the resulting classifier on new data.  \n\nPrevious work introduced the concept of \"$(\\epsilon, \\gamma, \\tau)$-good similarity functions\" to bridge this gap, linking a similarity function's properties to its performance in a linear classifier.  \n\nOur work expands on this theory, providing a new theoretical guarantee (a \"generalization bound\") for classifiers based on these similarity functions. We achieve this by leveraging the framework of algorithmic robustness, which analyzes how resilient an algorithm is to small changes in the input data. Our findings offer valuable insights into the relationship between metric selection and classifier performance, paving the way for more principled approaches to metric learning.  \n",
    "Ever tried to teach a computer the subtle art of sarcasm? It's not easy!  Sentiment analysis, especially the fine-grained kind, requires a deep appreciation for the nuances of language. \n\nEnter the multiplicative recurrent neural network, our champion for cracking the code of compositional meaning!  Imagine it as a linguistic detective, meticulously piecing together clues from individual words to unravel the sentiment of an entire sentence.\n\nWe discovered that our brainy network actually encompasses those previously popular matrix-space models – they're like the rookie detectives still learning the ropes. \n\nAnd the results? Let's just say our multiplicative RNN is a natural!  It goes toe-to-toe with those beefy Elman-type RNNs and even outperforms the matrix-space models on a standard sentiment analysis challenge. \n\nBut here's the real showstopper: our model achieves comparable results to those fancy structural deep models on the Stanford Sentiment Treebank – without needing any complicated parse trees! It's like solving a crime without having to dust for fingerprints – pure elegance! \n\n\n",
    "The challenge of finding the minimum values of complex, non-convex functions in high-dimensional spaces is a common problem in many scientific fields.  This work presents evidence suggesting that, contrary to the behavior observed in low dimensions, certain high-dimensional functions exhibit a narrow concentration of critical points within a restricted range of function values.\n\nThis phenomenon aligns with theoretical work on spin glasses, which predicts this narrowing as dimensionality increases. Our simulations support these theoretical findings. Furthermore, experiments using teacher-student networks trained on the MNIST dataset reveal a similar phenomenon in deep learning.  \n\nInterestingly, our results indicate that both gradient descent and stochastic gradient descent optimization algorithms reach this critical region within a comparable number of iterations. \n",
    "Esteemed colleagues,\n\nI present to you today a novel statistical model for photographic images. This model postulates that the responses elicited from a bank of linear filters, applied locally across an image, can be effectively characterized by a jointly Gaussian distribution. This distribution, we propose, possesses a zero mean and a covariance structure that exhibits gradual spatial variation.\n\nOur approach distinguishes itself through a unique optimization strategy. We seek to minimize the nuclear norms of matrices representing local filter activations, thereby promoting a flexible form of sparsity unconstrained by predefined dictionaries or coordinate systems.\n\nFilters subjected to this optimization process exhibit distinct orientation and frequency selectivity, and their responses reveal significant local correlations.  Remarkably, near-perfect image reconstruction is achievable using solely the estimated local filter response covariances.  Moreover, even low-rank approximations of these covariances yield reconstructions with minimal degradation in both visual fidelity and mean squared error.\n\nThese findings highlight the significant potential of our proposed representation for various applications, including image denoising, compression, and texture analysis. Furthermore, it holds promise as a foundation for constructing hierarchical image representations, enabling the extraction of increasingly abstract image features. \n\nThank you.\n",
    "The remarkable success of convolutional neural networks (CNNs) in object recognition is often attributed to their intricate architecture of convolutional, pooling, and fully connected layers. But is this complexity truly necessary?\n\nThis work embarks on a journey to re-examine the fundamental building blocks of CNNs for recognizing objects in small images.  Our exploration leads us to a surprising discovery: the ubiquitous max-pooling layer, often considered indispensable, can be seamlessly replaced by a simpler convolutional layer with an increased stride, without sacrificing accuracy.\n\nThis intriguing finding, coupled with recent advancements in simplifying network structures, inspired us to craft a novel CNN architecture of stark elegance. Our network dispenses with unnecessary complexity, relying solely on convolutional layers to achieve its remarkable performance.\n\nWe put our streamlined architecture to the test, evaluating its capabilities on several benchmark datasets, including CIFAR-10, CIFAR-100, and ImageNet. The results are compelling: our architecture achieves competitive or even state-of-the-art performance, demonstrating that simplicity can indeed beget accuracy.\n\nTo delve deeper into the inner workings of our creation, we introduce a novel variant of the \"deconvolution approach\" for visualizing learned features. This refined method allows us to peer into the network's \"mind,\" unveiling the hierarchical representations it has learned. \n\nOur work challenges conventional wisdom in CNN design, offering a fresh perspective on the trade-off between complexity and performance.  By embracing simplicity and revisiting fundamental assumptions, we unlock new possibilities for building more efficient and interpretable deep learning models. \n",
    "Imagine teaching a computer to understand the grammatical structure of a sentence, not word by word, but by cleverly piecing together meaningful chunks of information.  That's the essence of our novel greedy parser, powered by a unique compositional approach to representing sentence structure.\n\nOur parser doesn't rely on brute-force search but instead makes smart, locally optimal decisions at each step, guided by a clever neural network.  This network learns to represent sub-trees, those building blocks of sentences, using a combination of syntactic (part-of-speech tags) and semantic (word meanings) information.  \n\nHere's the key innovation:  we train the parser and the compositional procedure jointly, so they learn to work in perfect harmony. This tight integration allows the parser to leverage rich, continuous representations of both words and tags, making decisions based on a deeper understanding of the sentence's structure.\n\nAnd the results? Our speedy parser achieves accuracy comparable to state-of-the-art methods on standard parsing benchmarks, all thanks to its efficient greedy approach.  \n\nTo make our method easily accessible, we're sharing a fully functional implementation of our parser.  We believe this tool will empower researchers and developers to explore the exciting world of natural language parsing and unlock new possibilities in language understanding. \n",
    "Denoising autoencoders (dAEs) are powerful tools for learning useful representations of data, but how can we encourage them to discover more abstract and invariant features?  Our work reveals that by introducing carefully designed lateral connections between the encoder and decoder, we can guide dAEs towards this goal.\n\nThink of these lateral connections as shortcuts that allow information to flow directly between different levels of the network.  This bypasses the need for the highest layers to process every detail, freeing them to focus on capturing the most important, invariant features. \n\nBut here's the really clever part: we allow these invariant features to dynamically control the strength of the lateral connections, like a conductor directing an orchestra!  This \"modulation\" enables the network to translate abstract understanding into detailed reconstructions.\n\nWe put three different dAE architectures to the test: one with modulated lateral connections, one with basic additive connections, and one without any shortcuts.  The results were impressive! \n\nOur modulated lateral connections led to:\n\n1. **Improved accuracy:**  The model became better at understanding and reconstructing the input data, as shown by its enhanced denoising performance. \n2. **More invariant representations:** Higher layers in the network learned to focus on the most important, unchanging features of the data.\n3. **Diverse pooling mechanisms:**  The network developed more sophisticated ways of summarizing information, leading to better generalization.  \n\nThis simple yet powerful modification to the dAE architecture paves the way for learning richer, more meaningful representations of data, opening up exciting possibilities in various machine learning applications. \n",
    "Imagine stepping into a world where images can morph and shift seamlessly, guided by the invisible hand of mathematics. This is the world of representational geometry, where we explore how artificial intelligence understands the transformations that shape our visual world.\n\nOur journey begins with a quest for \"invariance\" – the ability of a model to recognize an object despite changes in its appearance, such as translation, rotation, or scaling.  We introduce a novel method for visualizing and refining this invariance, focusing on a powerful concept called \"linearization.\"\n\nPicture two images, linked by a specific transformation, like a photo of a cat gradually rotating.  Our method generates a series of images along the shortest path connecting these two images within the model's internal representation – a \"representational geodesic.\"  \n\nIf the model has truly grasped the essence of rotation, this geodesic should depict a smooth, continuous transformation, like a flipbook animation.\n\nHowever, when we applied our method to a state-of-the-art image recognition network, we stumbled upon a surprising flaw.  The geodesics for simple transformations like translation, rotation, and scaling were anything but smooth!  They were riddled with distortions and jumps, revealing the model's struggle to truly grasp these fundamental transformations.\n\nBut our journey didn't end there.  Our method also hinted at a solution.  By analyzing these \"broken\" geodesics, we identified ways to refine the model's internal representation, guiding it towards true invariance.  And indeed, with these modifications, our model learned to generate beautifully smooth, linearized representations of various geometric transformations.  \n\nThis journey into the heart of representational geometry not only reveals the hidden struggles of even our most advanced models but also illuminates a path towards building more robust and reliable AI systems. \n",
    "This study addresses the challenge of predicting patient survival in cancer using high-dimensional genomic data, a critical task for advancing personalized medicine. While genomic profiling holds immense promise for informing prognosis, existing survival analysis methods struggle to effectively leverage these complex datasets.  We propose a novel approach based on neural networks to learn informative representations of genomic data for accurate survival prediction.  Evaluating our method on brain tumor data, we demonstrate its superior performance compared to established survival analysis techniques.  Our findings underscore the potential of deep learning to unlock the power of genomic information for improving cancer prognosis and treatment strategies. \n",
    "This work addresses the challenge of integrating both additive and multiplicative operations within neural networks. While existing approaches either rely on predefined operation assignments or resort to computationally expensive discrete optimization techniques, we propose a novel solution based on a parameterized transfer function. \n\nOur approach leverages the mathematical concept of non-integer functional iteration to enable smooth and differentiable interpolation between addition and multiplication operations at the neuron level.  This differentiability ensures that the choice between addition and multiplication can be directly incorporated into the standard backpropagation algorithm, facilitating straightforward and efficient training. \n",
    "\"Hey, have you ever noticed how tricky it is to train those deep neural networks?  Like, sometimes the signals between layers get all out of whack?\"\n\n\"Yeah, totally!  It's those pesky scaling problems, causing exploding gradients and stuff.  People usually just try to initialize the weights carefully to avoid it.\"\n\n\"Right? But we were thinking, what if keeping things scaled properly – you know, maintaining \"isometry\" – was important throughout training, not just at the beginning?\"\n\n\"So, we tried a couple of new tricks, one super precise and another more random.  And get this – both of them made the network learn way faster!\"\n\n\"It seems like keeping those signals in check early on is key, and making sure they stay that way helps the network pick things up quicker. Pretty cool, huh?\" \n",
    "This paper introduces the Stick-Breaking Variational Autoencoder (SB-VAE), a novel generative model that extends the traditional variational autoencoder (VAE) to allow for a latent representation with varying dimensionality. By employing a Stick-Breaking process prior and leveraging Stochastic Gradient Variational Bayes inference, the SB-VAE learns more expressive and disentangled representations compared to the standard VAE, leading to improved performance on various tasks.\n\n\n",
    "Imagine a world of data where some stories are told a thousand times, while others barely whisper. This is the challenge of imbalanced data, where standard models often get swept away by the dominant narratives, neglecting the quiet truths hidden within. \n\nWe introduce a new model, a clever architect of latent spaces, that can navigate these uneven landscapes. Our model envisions a shared space where common ground is established, alongside private spaces where unique characteristics can flourish. \n\nDrawing inspiration from the elegance of Gaussian Processes, our model crafts a special kernel, a secret handshake that enables this separation of stories.  With a touch of variational magic, we breathe life into this model, allowing it to uncover hidden patterns even in the faintest whispers of data.  \n\nWe unveil the power of our creation using a challenging medical image dataset, a realm where rare diseases often go unnoticed.  Our model, with its keen eye for detail, uncovers valuable insights, proving that even in the most imbalanced worlds, every story deserves to be heard. \n",
    "Imagine you're learning to paint, but instead of a teacher, you have a partner who's also learning. You show each other your work, critique each other, and try to improve. That's kind of how Generative Adversarial Networks (GANs) work - they're like two AI artists in a friendly competition, pushing each other to create realistic images.\n\nBut sometimes, the feedback between these AI artists gets a bit muddled, making it hard for them to learn effectively.  We thought, \"What if we could give them a clearer way to understand each other's progress?\"\n\nOur solution?  It's like giving the AI artists a shared language and a better set of tools.  We combined two powerful techniques: density ratio estimation (figuring out how similar two sets of images are) and f-divergence minimization (making those sets of images even more alike).  \n\nThe result?  Our new algorithm helps GANs learn more effectively by providing clearer feedback and leveraging insights from years of research on comparing and refining data distributions.  It's like giving our AI artists the best possible environment to collaborate and create! \n",
    "Imagine a chemist and a linguist walk into a lab – it sounds like the start of a bad joke, but this collaboration is about to revolutionize drug discovery!\n\nYou see, chemists have long struggled to predict how well a molecule might work as a drug. It's a bit like trying to decipher a secret code hidden within the molecule's structure.  Meanwhile, linguists have been developing powerful tools to analyze and understand the complexities of human language. \n\nBut what if those molecular structures could be translated into a language that AI could understand? That's exactly what we did! We took SMILES, a standard way of representing molecules as text strings, and unleashed the power of natural language processing (NLP).\n\nThe results were astounding.  By treating molecules as language, our NLP-powered model not only outperformed existing methods in predicting molecular activity but also revealed the hidden logic behind its decisions – like a linguistic detective uncovering the secrets of drug interactions. \n\nThis groundbreaking approach opens up a whole new world of possibilities.  It's as if we've given scientists a new language to communicate with molecules, accelerating the development of life-saving drugs and transforming the future of medicine. \n",
    "This paper introduces a novel neural network architecture that learns to represent complex data using discrete, interpretable factors.  Our approach, inspired by the dynamics of sequential data, predicts future frames using a combination of past information and a small set of discrete \"gating units.\"  These gating units effectively capture distinct factors of variation, providing a symbolic representation of the underlying data dynamics.  We demonstrate the effectiveness of our method on datasets of 3D facial transformations and Atari games, showcasing its ability to learn meaningful and disentangled representations. \n\n\n",
    "Imagine the \"loss landscape\" of a neural network, a complex terrain of peaks and valleys representing different levels of error. Analyzing the curvature of this landscape, specifically the eigenvalues of the loss function's Hessian, provides valuable insights into the network's learning behavior.  \n\nOur work reveals a fascinating dichotomy in the distribution of these eigenvalues: a dense \"bulk\" concentrated around zero and a sparse scattering of \"edge\" eigenvalues further away. \n\nWe present empirical evidence suggesting that the bulk eigenvalues reflect the degree of over-parameterization in the network, while the edge eigenvalues capture information specific to the training data. This understanding of the Hessian's spectral properties offers a new lens for analyzing and optimizing deep learning models. \n",
    "This paper introduces a powerful new parametric nonlinear transformation explicitly designed for Gaussianizing data derived from natural images, a crucial preprocessing step for many computer vision algorithms.  Our method is elegantly simple yet remarkably effective. \n\nWe first apply a learned linear transformation to the data, followed by a novel normalization step. This normalization utilizes a \"pooled activity measure,\" computed by a weighted combination of rectified and exponentiated components, which captures dependencies within the data. \n\nThe parameters of our transformation, including the linear transformation matrix, exponents, weights, and constant, are optimized directly by minimizing the negentropy of the transformed responses. This direct optimization over a vast database of natural images yields a transformation with superior Gaussianization capabilities. \n\nOur method demonstrably surpasses existing techniques like ICA and radial Gaussianization, achieving significantly lower mutual information between transformed components. This reduction indicates a closer approximation to the desired independent Gaussian distribution. \n\nThe benefits extend beyond mere Gaussianization. Our transformation is fully differentiable and efficiently invertible, enabling its use for constructing a compelling generative model for images.  This model, induced by the transformation, produces visually plausible samples that closely resemble natural image patches. \n\nFurthermore, the inherent invertibility allows us to leverage the Gaussianized space for effective noise reduction by employing our transformation as a prior for denoising. \n\nFinally, we demonstrate the remarkable ability to cascade our transformation, creating a deep hierarchical architecture. Each layer is optimized using the same Gaussianization objective, offering a novel and entirely unsupervised approach for constructing deep networks tailored for image data. \n",
    "Figuring out the patterns in complex data, like robot movements, can be tricky.  Approximate variational inference is a powerful tool for this, letting us build models that uncover hidden structures in data.  Recent breakthroughs have made it even better, allowing us to work with data that unfolds over time, like a robot's actions. \n\nWe use a special kind of model called a Stochastic Recurrent Network (STORN) to learn the normal patterns in robot time series data.  This allows us to spot unusual activities – anomalies – both retrospectively and in real time.  Our results show that this approach is robust and effective in identifying deviations from expected behavior. \n",
    "Imagine training an AI agent to be a master detective, piecing together clues scattered throughout a partially observable environment.  That's the challenge we address by introducing a novel framework for testing and training agents to efficiently gather information. \n\nOur framework encompasses a diverse set of tasks where success hinges on the agent's ability to strategically explore its surroundings, uncover hidden information fragments, and assemble them to achieve specific goals. \n\nTo tackle these challenges, we combine the power of deep learning with reinforcement learning techniques. We train agents to actively seek out new information that reduces their uncertainty about the environment while also effectively utilizing the information they've already gathered.  \n\nOur experiments demonstrate that by carefully combining external rewards (for achieving goals) and internal rewards (for gaining new knowledge), we can shape agents that exhibit intelligent and efficient information-seeking behavior. \n",
    "In a groundbreaking development for natural language processing, researchers have unveiled a novel neural network architecture that significantly enhances language models' ability to leverage recent context.  This breakthrough, inspired by the concept of memory augmentation, enables networks to \"remember\" past information and seamlessly integrate it into their predictions.\n\nThe new model achieves this feat through a streamlined mechanism that efficiently stores and retrieves information from a vast external memory bank. Unlike previous memory-augmented networks, this approach scales gracefully to massive memory sizes, enabling the processing of extensive contextual information.\n\nThe researchers drew parallels between their approach and cache models employed in traditional count-based language models, highlighting the inherent connection between these seemingly distinct paradigms.\n\nRigorous evaluations on multiple language modeling benchmarks have yielded impressive results, with the new model consistently outperforming existing memory-augmented networks. This advancement promises to enhance a wide range of NLP applications, from machine translation to dialogue systems, by empowering AI models with an enhanced ability to understand and generate human-like text. \n",
    "This paper introduces a novel deep generative model for synthesizing images from natural language descriptions, capitalizing on recent advancements in generative adversarial networks (GANs) and attention mechanisms.  Our proposed model adopts an iterative generation process, sequentially drawing image patches on a canvas while attending to relevant words in the input text description. \n\nThe model comprises three key components:\n\n1. **Sentence Embedding Module:**  A recurrent neural network (RNN), such as a GRU or LSTM, encodes the input sentence into a fixed-length vector representation, capturing the global semantic context of the description. \n\n2. **Attention Mechanism:**  At each iteration, a dynamic attention mechanism attends to specific words or phrases in the input sentence based on the current state of the generated image. This focus on relevant words guides the generation process, ensuring semantic alignment between text and image.\n\n3. **Patch-based Image Generation:**  Conditioned on the attended word embeddings and the current canvas state, a deep convolutional decoder network generates the next image patch. This patch is then seamlessly integrated into the canvas, gradually refining the generated image. \n\nWe train our model on the Microsoft COCO dataset, leveraging its rich image-caption pairs.  Evaluation on both image generation and retrieval tasks demonstrates that our model surpasses several strong baseline generative models.  Notably, our model generates higher-quality images with greater visual fidelity and exhibits a remarkable ability to compose novel scenes that correspond to previously unseen captions in the training dataset. This capability highlights the model's capacity to learn meaningful relationships between language and visual concepts, going beyond simple memorization of training examples. \n\n\n",
    "This paper introduces a novel framework for multi-task learning with neural networks.  Unlike traditional approaches that predefine parameter sharing strategies, our method uses a tensor trace norm regularization to encourage automatic, data-driven parameter sharing across all layers of multiple networks trained jointly. This flexible approach allows networks to discover and leverage commonalities in the data, leading to more efficient and effective multi-task learning. \n\n\n",
    "This work takes a big step forward in the world of reinforcement learning, where agents learn to navigate complex environments by trial and error.  We introduce a powerful new agent that combines the strengths of actor-critic methods, deep learning, and experience replay – and the results are really something!\n\nOur agent is a master of learning from its past experiences. It stores those memories efficiently and re-uses them strategically to improve its performance over time.  And the best part? It's incredibly stable, meaning it learns smoothly and consistently, avoiding those frustrating crashes or plateaus that can happen during training. \n\nBut we didn't stop there! We incorporated several clever innovations, including a smarter way to learn from past experiences (truncated importance sampling with bias correction), a more efficient way to represent the value of different actions (stochastic dueling network architectures), and a new method for optimizing the agent's behavior (trust region policy optimization).  \n\nWe put our agent to the test on a variety of challenging tasks, from mastering classic Atari games to controlling robots in continuous action spaces.  The results were remarkable, demonstrating exceptional performance and sample efficiency. This means our agent can achieve impressive results with less training data, making it even more adaptable and versatile.  We're excited about the potential of our agent to accelerate progress in reinforcement learning and unlock new possibilities in AI! \n\n\n\n",
    "We introduce a novel framework for generating pop music using a hierarchical Recurrent Neural Network (RNN).  Our model's architecture reflects a deep understanding of musical composition, with lower layers generating catchy melodies and higher layers adding rhythmic drums and harmonious chords.  \n\nTo evaluate our approach, we conducted human listening studies, and the results were music to our ears!  Participants consistently preferred our generated tracks over those produced by a recent Google model.\n\nBut we didn't stop at just creating music - we explored its potential in exciting new applications. Our framework enables \"neural dancing\" and \"neural karaoke,\" where virtual dancers move seamlessly to the generated beats, and lyrics harmonize with the evolving melodies.  We even delved into the realm of \"neural story singing,\" where our model crafts compelling musical narratives. \n",
    "This work exposes a critical vulnerability in many machine learning classifiers: their susceptibility to adversarial perturbations—carefully crafted alterations designed to deceive while remaining imperceptible to humans. We directly address this vulnerability by introducing three powerful detection methods capable of unmasking these adversarial attacks. \n\nOur methods force attackers to make difficult choices:  either minimize the distortion introduced into the adversarial image, sacrificing attack effectiveness, or create more conspicuous perturbations detectable by our methods. \n\nOur most successful method leverages a key insight: adversarial perturbations leave telltale signatures in the principal component space. These perturbations abnormally emphasize lower-ranked principal components, a vulnerability we ruthlessly exploit for detection.  \n\nAdditional detection methods, along with a revealing analysis based on colorful saliency maps, are provided in the appendix, further solidifying our contribution to the fight against adversarial attacks. \n",
    "Get ready for a revolution in efficient deep learning!  We've developed a game-changing method for building incredibly powerful yet computationally frugal Convolutional Neural Networks (CNNs).  Our secret weapon? Low-rank representations of convolutional filters!\n\nInstead of relying on pre-trained networks, we start from scratch and learn a super-efficient set of \"basis filters.\"  Think of them like building blocks. During training, our network cleverly combines these basis filters into more complex and super-discriminative filters, all optimized for image classification.\n\nAnd here's the kicker: we've also crafted a novel weight initialization scheme that plays beautifully with our low-rank approach, especially for networks with diverse filter shapes.  \n\nBut enough talk, let's see the results!  We put our method to the test on a range of popular CNN architectures and image datasets (CIFAR, ILSVRC, and MIT Places). The outcome? Mind-blowing! \n\nOur method consistently achieves similar or even better accuracy compared to conventional CNNs, all while using significantly less computation.  For example, on ImageNet, our improved VGG-11 network with global max-pooling achieves comparable accuracy using a jaw-dropping 41% less compute and only 24% of the parameters! \n\nBut wait, there's more! We even managed to squeeze out a 1 percentage point accuracy boost on ImageNet compared to our own improved VGG-11, achieving an incredible 89.7% top-5 accuracy while using 16% less computation.\n\nThis is a game-changer, folks! Our method unlocks the potential for deploying state-of-the-art CNNs on devices with limited resources, without compromising on accuracy.  Get ready for a future where powerful AI is accessible everywhere! \n\n\n",
    "Proper weight initialization is crucial for effective training of deep neural networks.  Poor initialization can hinder convergence and lead to sub-optimal performance, especially in very deep networks. This paper introduces Layer-sequential unit-variance (LSUV), a straightforward yet powerful weight initialization method designed to address these challenges. \n\nLSUV consists of two key steps:\n\n1. **Orthonormal Pre-Initialization:**  Each convolutional and fully connected layer is initialized with weights drawn from an orthonormal distribution. This ensures that signals propagate effectively through the network from the outset, mitigating issues like vanishing or exploding gradients.\n\n2. **Layer-wise Variance Normalization:**  Proceeding sequentially from the input to the output layer, the variance of each layer's output is normalized to 1. This critical step maintains a consistent signal strength throughout the network, further contributing to stable and efficient training. \n\nWe validated LSUV's efficacy across a range of activation functions, including maxout, ReLU variants, and tanh. Our experiments demonstrate that LSUV consistently leads to:\n\n* **Fast Convergence:** LSUV achieves training speeds comparable to or exceeding complex initialization schemes specifically designed for very deep networks, such as FitNets and Highway Networks.\n\n* **Competitive or Superior Accuracy:** LSUV matches or surpasses the performance of standard initialization methods, achieving state-of-the-art or near state-of-the-art results on benchmark datasets (MNIST, CIFAR-10/100, and ImageNet).\n\nWe evaluated LSUV on various popular architectures, including GoogLeNet, CaffeNet, FitNets, and Residual Networks, highlighting its broad applicability.  LSUV's combination of simplicity, efficiency, and effectiveness makes it a compelling choice for weight initialization in deep learning. \n\n\n",
    "This work presents a high-performance graph-based dependency parser that achieves state-of-the-art or near state-of-the-art accuracy on six diverse languages. Our parser builds upon the neural attention mechanism of Kiperwasser & Goldberg (2016), employing a larger model with enhanced regularization and biaffine classifiers for arc and label prediction.  Notably, we achieve 95.7% UAS and 94.1% LAS on the English Penn Treebank, surpassing Kiperwasser & Goldberg (2016) by 1.8% and 2.2% respectively. Our results establish a new performance benchmark for graph-based parsers, rivaling the accuracy of the best transition-based systems.  Through rigorous hyperparameter analysis, we identify key factors contributing to these performance gains. \n",
    "Unlocking the secrets of true artificial intelligence hinges on developing machines capable of sophisticated reasoning – a feat that requires comprehending not just the obvious but also the subtle, unspoken relationships within data.  \n\nEnter Dynamic Adaptive Network Intelligence (DANI), our novel approach to efficiently extracting these intricate dependencies, even with minimal guidance. Like a budding Sherlock Holmes, DANI adeptly navigates the world of information, its internal representations dynamically evolving to uncover hidden patterns and subtle connections. \n\nWe've put DANI's impressive skills to the test on the bAbI dataset, a formidable collection of question-answering challenges specifically designed to assess reasoning abilities in machines. These are puzzles that have stumped even the most sophisticated learning representations, as highlighted in Weston et al. (2015).  \n\nYet, DANI shines, surpassing the current state-of-the-art with remarkable performance. It gracefully leaps over hurdles that have tripped others, showcasing its capacity for complex reasoning and signifying a pivotal step towards a future where machines can truly think.  \n\n\n",
    "Convolutional Neural Networks (CNNs), while achieving remarkable success in various domains, often come with a hefty computational cost, limiting their deployment on resource-constrained devices like mobile phones.  Hardware accelerators offer a promising solution by speeding up computation and reducing energy consumption.  However, designing efficient accelerators requires carefully adapting CNN models for optimal hardware utilization.\n\nThis paper introduces Ristretto, a novel model approximation framework specifically designed to bridge the gap between CNNs and hardware accelerators. Ristretto meticulously analyzes a given CNN, focusing on the numerical precision required to represent weights and activations in convolutional and fully connected layers.  \n\nUnlike conventional approaches that rely on resource-intensive floating-point representations, Ristretto explores the potential of fixed-point arithmetic, which is significantly more hardware-friendly.  Ristretto goes beyond simply converting models to fixed-point; it employs a sophisticated analysis to determine the optimal bit-width for each layer while adhering to a user-defined accuracy threshold.  \n\nMoreover, Ristretto incorporates a fine-tuning step to further optimize the condensed fixed-point network, ensuring minimal performance degradation.  Our experiments demonstrate Ristretto's efficacy in reducing the computational demands of popular CNN models like CaffeNet and SqueezeNet.  Remarkably, Ristretto can successfully condense these networks to use only 8-bit fixed-point representations with a minimal accuracy loss of less than 1%.  \n\nTo foster wider adoption and further research in this domain, we have made the source code for Ristretto publicly available.  This framework provides a valuable tool for developers and researchers seeking to deploy powerful CNNs on resource-constrained hardware, paving the way for ubiquitous deep learning on mobile and embedded devices.\n",
    "The world of painting offers a captivating glimpse into the diverse ways artists capture their creative visions.  This rich tapestry of styles presents a unique opportunity to explore the very essence of visual representation.  Can we develop a model that captures this vast vocabulary of artistic expression?\n\nThis work delves into the heart of this challenge, introducing a single, scalable deep network capable of learning and parsimoniously representing diverse painting styles.  We demonstrate that our network can distill the essence of a painting, reducing it to a single point within a learned embedding space. This compressed representation allows us to navigate the world of artistic styles with unprecedented ease. \n\nRemarkably, our model goes beyond simply capturing existing styles; it empowers users to embark on a creative journey by seamlessly blending and combining learned styles, leading to the emergence of entirely new forms of artistic expression. \n\nThis work represents a significant step towards building comprehensive models of paintings and offers a fascinating window into the intricate structure of artistic style as learned by a machine. \n",
    "Sum-Product Networks (SPNs) are powerful but complex models for representing probability distributions.  This work introduces MiniSPN, a fast and practical algorithm for learning the structure of SPNs from data, even with missing values and a mix of continuous and discrete features. We demonstrate MiniSPN's effectiveness on standard benchmarks and challenging real-world datasets from Google's Knowledge Graph. \n\n\n",
    "While the pursuit of ever-increasing accuracy in deep neural networks (DNNs) has yielded remarkable progress, it has often come at the expense of model complexity.  This paper argues that for a desired accuracy level, prioritizing DNN architectures with reduced parameter counts offers significant advantages across multiple domains:\n\n1. **Distributed Training Efficiency:** Smaller DNNs inherently require less inter-server communication during distributed training, leading to faster convergence and reduced communication overhead.\n\n2. **Model Deployment Bandwidth:**  Exporting compact DNN models from centralized cloud environments to edge devices, such as autonomous vehicles, necessitates significantly less bandwidth, facilitating rapid deployment and updates. \n\n3. **Hardware Resource Utilization:**  Memory-constrained hardware platforms, such as FPGAs, benefit immensely from smaller DNNs, enabling efficient on-device inference and broader applicability.\n\nTo concretize these advantages, we introduce SqueezeNet, a novel DNN architecture meticulously designed for parsimony.  SqueezeNet attains accuracy comparable to AlexNet on the ImageNet benchmark while utilizing a staggering 50x fewer parameters.  Furthermore, through the application of model compression techniques, SqueezeNet can be compressed to an astoundingly small footprint of less than 0.5MB, representing a 510x reduction compared to AlexNet. \n\nThe availability of the SqueezeNet architecture at [https://github.com/DeepScale/SqueezeNet](https://github.com/DeepScale/SqueezeNet) facilitates further research and practical applications of this efficient architecture. \n",
    "Reasoning over multiple pieces of information to answer a question is a hallmark of human intelligence.  How can we instill this ability into machines?  This work explores the nuances of multi-hop question answering, where deriving an answer requires synthesizing information from various interconnected facts.\n\nWe introduce the Query-Reduction Network (QRN), a novel architecture that reimagines the traditional Recurrent Neural Network (RNN).  QRN elegantly handles both local and global dependencies within a sequence of facts, mimicking the way humans progressively refine their understanding as they encounter new information.\n\nImagine QRN as a detective piecing together clues. It treats each sentence as a \"state-changing trigger\" that provides a new perspective on the original question.  With every trigger, QRN refines its internal query, gradually honing in on the answer. \n\nOur experiments demonstrate that QRN's unique approach achieves state-of-the-art performance on established question-answering benchmarks like bAbI and excels in complex, real-world goal-oriented dialogues. \n\nFurthermore, QRN's design enables a remarkable degree of parallelization, significantly reducing training and inference time. This efficiency makes QRN a promising candidate for tackling increasingly sophisticated question-answering tasks that demand reasoning over vast amounts of information. \n",
    "Recurrent Neural Networks (RNNs), while demonstrating proficiency in sequence modeling, may benefit from incorporating mechanisms that facilitate long-range contextual integration.  This paper hypothesizes that introducing top-down feedback, enabling predictions to be informed by past errors, can enhance RNN performance, particularly in disambiguating similar local patterns based on broader context. \n\nWe present Surprisal-Driven Recurrent Networks (SDRNNs), a novel architecture that integrates a continuous monitoring of prediction errors into the RNN's hidden state dynamics. SDRNNs utilize a \"surprisal\" signal, derived from the discrepancy between recent predictions and actual observations, to dynamically modulate the network's internal state.  This surprisal-driven feedback mechanism enables the model to adjust its predictions based on both its past performance and the current input. \n\nTo validate our hypothesis, we evaluate SDRNNs on the enwik8 character-level language modeling benchmark. Our results demonstrate that SDRNNs outperform both stochastic and fully deterministic RNN variants, achieving a state-of-the-art test perplexity of 1.37 bits per character. This significant improvement underscores the efficacy of incorporating surprisal-driven feedback for enhanced language modeling performance. \n\n\n",
    "Generative Adversarial Networks (GANs), while powerful, are notorious for their instability and tendency to miss modes in the data distribution.  We identify the root cause of these issues: the peculiar functional shape of GAN discriminators in high-dimensional spaces. This can lead to training stagnation or, worse, push the generated distribution towards regions of higher concentration than the true data distribution, resulting in mode collapse.\n\nOur solution? Introduce novel regularization techniques that dramatically stabilize GAN training. These regularizers not only prevent catastrophic collapse but also promote a more faithful representation of the data distribution by encouraging the generator to cover all modes, even in the early stages of training.  This unified approach addresses both the instability and mode collapse problems that have long plagued GANs. \n",
    "Deploying reinforcement learning in real-world scenarios presents significant challenges, particularly when using complex models like deep neural networks.  Two major hurdles are high sample complexity (requiring lots of real-world data) and safety concerns (ensuring the learning process doesn't lead to undesirable outcomes).\n\nModel-based methods, where a simulated environment is used to pre-train policies before transferring to the real world, offer a promising solution.  However, discrepancies between the simulation and reality can hinder performance.\n\nWe introduce EPOpt, an algorithm that tackles this \"reality gap\" through two key innovations:\n\n1. **Ensemble of Simulated Domains:** EPOpt leverages multiple, diverse simulations rather than a single one. This encourages the agent to learn policies that are robust to variations in the environment.\n\n2. **Adversarial Training:** EPOpt employs a form of adversarial training, where the simulated environments are subtly adjusted to challenge the agent and promote generalization to unseen scenarios.\n\nFurthermore, EPOpt incorporates a mechanism to adapt the distribution over the simulation ensemble using real-world data and Bayesian techniques. This allows the simulations to progressively better approximate the real-world environment, further boosting performance. \n\nBy combining an ensemble of simulations with targeted adaptation, EPOpt achieves both robustness and efficient learning, paving the way for safer and more practical reinforcement learning in real-world applications. \n",
    "Imagine a team of specialists, each bringing unique skills to the table. Now, envision building a neural network that mirrors this principle, embracing diversity to enhance its capabilities.  That's the driving force behind Divnet, our novel approach for creating more efficient and powerful networks.\n\nDivnet doesn't treat neurons as interchangeable cogs. Instead, it views them as individuals with unique strengths.  It employs a powerful mathematical tool called a Determinantal Point Process (DPP) to identify the most diverse and complementary neurons within a layer. \n\nThink of it like a talent scout handpicking the best team for the job. Divnet selects the most promising neurons, fusing the \"redundant\" ones into this elite group. This process not only streamlines the network but also acts as a natural regularizer, preventing overfitting and improving generalization. \n\nThe beauty of Divnet lies in its flexibility. It seamlessly integrates with other model compression techniques, allowing for even greater reductions in memory footprint without sacrificing performance. \n\nAnd the results speak for themselves! When applied to the task of neural network pruning, Divnet emerges as a clear winner, outperforming competing approaches by a significant margin. It's a testament to the power of diversity, even in the realm of artificial intelligence. \n",
    "In the depths of data's embrace, where neural networks learn and grow,\nLurks the shadow of overfitting, a silent, insidious foe. \n\nAugmentation and dropout, valiant attempts to tame its might,\nBut still, it lingers, obscuring the path to pure, predictive light.\n\nFrom this struggle, DeCov emerges, a guardian of balance new,\nEncouraging diversity, a tapestry of perspectives, fresh and true.\n\nNo more shall hidden layers echo with redundant refrain,\nDeCov's gentle guidance breaks the chains, setting representations free again.\n\nCross-covariance, the measure of similarity's sway,\nMinimized with grace, leading overfitting astray.\n\nThrough datasets vast and architectures deep, its influence flows,\nA subtle whisper, reducing overfitting's woes.\n\nGeneralization soars, performance takes flight,\nDeCov's elegant touch, illuminating the path from darkness to light. \n",
    "Deep neural networks rely on stochastic optimization, using batches of data to estimate gradients and update model parameters. While batch size is a well-known hyperparameter, dynamically selecting data points for each batch during training remains an underexplored avenue for potential optimization.\n\nIntuitively, prioritizing data points with higher loss values for inclusion in batches seems beneficial. However, the optimal degree of this prioritization and its evolution throughout training are unclear.\n\nWe propose a straightforward yet effective online batch selection strategy: rank data points based on their most recently computed loss values and assign selection probabilities that decrease exponentially with rank. This strategy ensures that \"harder\" examples contribute more frequently to gradient updates while maintaining some diversity in the batches.\n\nOur experiments on the MNIST dataset demonstrate remarkable speedups for both AdaDelta and Adam optimization algorithms, achieving convergence roughly 5 times faster than conventional fixed-batch training.  These results highlight the significant potential of online batch selection for accelerating deep learning training. \n",
    "Semi-supervised learning on graph-structured data presents a unique challenge in effectively leveraging both labeled and unlabeled data to learn meaningful representations.  This paper introduces a scalable approach to address this challenge by employing an efficient variant of convolutional neural networks specifically designed to operate on graphs.\n\nOur proposed architecture stems from a localized first-order approximation of spectral graph convolutions. This approximation allows our model to retain the expressiveness of spectral methods while achieving linear scalability with respect to the number of edges in the graph. \n\nUnlike traditional graph-based methods that rely solely on node features, our convolutional approach captures both local graph structure and node features within its hidden layer representations.  This combination enables the network to learn richer and more informative representations, improving performance on downstream tasks. \n\nWe demonstrate the efficacy of our method on a variety of benchmark datasets, including citation networks and a large knowledge graph.  Our results consistently outperform existing semi-supervised methods by a substantial margin, highlighting the power and scalability of our approach for learning from graph-structured data. \n\nMore specifically, our contributions include:\n\n* **Localized Spectral Convolution Approximation:** We derive an efficient convolutional operation for graphs by approximating spectral graph convolutions within a localized neighborhood. This simplification reduces computational complexity while preserving the ability to capture local graph structures. \n\n* **Joint Learning of Node Features and Graph Structure:** Our model architecture effectively integrates both local graph structure and node features into its learned representations, leading to improved performance compared to methods that consider these aspects independently.\n\n* **Scalability and Empirical Validation:** Our approach scales linearly with the number of edges in the graph, enabling application to large-scale datasets. We provide thorough experimental validation on benchmark citation networks and a real-world knowledge graph, demonstrating superior performance compared to existing state-of-the-art methods. \n\n\n",
    "Distinguished guests, esteemed colleagues, \n\nI present to you today a novel generative adversarial network framework, which we have termed the \"Energy-based Generative Adversarial Network\" or EBGAN.  \n\nEBGAN departs from conventional GANs by reinterpreting the discriminator as an energy function.  This function is trained to associate low energy values with regions proximate to the data manifold, while assigning progressively higher energies to regions farther from this manifold.\n\nWithin the EBGAN framework, the generator's objective remains akin to that of probabilistic GANs: to produce contrastive samples that minimize the discriminator's assigned energy.  However, conceptualizing the discriminator as an energy function unlocks a wider spectrum of architectural and loss function possibilities beyond the traditional binary classifier with a logistic output. \n\nTo illustrate the versatility of EBGAN, we present a compelling instantiation that employs an auto-encoder architecture as the discriminator. In this configuration, the reconstruction error serves as a proxy for the energy function. Our empirical evaluations reveal that this particular form of EBGAN exhibits enhanced training stability compared to conventional GANs. \n\nFurthermore, we demonstrate that a single-scale EBGAN architecture can be successfully trained to generate high-resolution images, highlighting the potential of this framework for complex generative tasks. \n\nThank you. \n",
    "The world of deep learning is abuzz with innovation, a whirlwind of new architectures emerging at a dizzying pace.  It's like trying to choose from an ever-expanding menu of exotic dishes, each more enticing than the last.  But for newcomers to this exciting field, this abundance of choice can be overwhelming, often leading to a reliance on familiar but perhaps outdated options. \n\nImagine a master chef, sifting through countless recipes, unlocking the secrets behind culinary masterpieces.  That's precisely what we've done with our exploration of CNN architectural design patterns. We've delved into the collective knowledge of the deep learning community, extracting the fundamental principles that underpin successful architectures. \n\nOur exploration has yielded a treasure trove of insights, leading to the development of novel and intriguing architectures: \n\n* **FractalNet:** A network that mirrors the self-similarity found in nature, building complex structures from simple, repeating patterns.\n* **Stagewise Boosting Networks:**  Inspired by the power of ensemble learning, these networks combine multiple weaker models into a robust and highly accurate system.\n* **Taylor Series Networks:**  Like a mathematical symphony, these networks leverage the elegance of Taylor series approximations to efficiently represent complex functions.\n\nTo empower fellow explorers on their deep learning journeys, we've generously shared our Caffe code and prototxt files for these architectures at [https://github.com/iPhysicist/CNNDesignPatterns](https://github.com/iPhysicist/CNNDesignPatterns).  We invite the community to join us in this grand culinary adventure, to experiment with these novel recipes, and to further unlock the secrets of deep learning architectures. \n\n\n",
    "Machine comprehension, the task of answering questions about a given text passage, demands a sophisticated understanding of the intricate interplay between the question and the context. While attention mechanisms have shown promise in tackling this challenge, existing approaches often rely on simplifying assumptions, such as summarizing the context into a fixed-size vector or using unidirectional attention.\n\nThis work introduces the Bi-Directional Attention Flow (BIDAF) network, a multi-stage hierarchical model that captures the context at different levels of granularity.  BIDAF's key innovation lies in its bidirectional attention mechanism, allowing the question and context to interact seamlessly and refine each other's representations without premature summarization.  \n\nThrough rigorous evaluation on the challenging Stanford Question Answering Dataset (SQuAD) and the CNN/DailyMail cloze test, we demonstrate that BIDAF achieves state-of-the-art performance.  Our results underscore the importance of capturing the rich, bidirectional flow of information between questions and context for accurate and robust machine comprehension. \n",
    "Imagine a world where machines can not only learn patterns from data but also dream up entirely new possibilities.  This is the promise of deep generative models, but unlocking their full potential, especially when dealing with discrete data, has remained a challenging quest.\n\nHelmholtz machines, with their intriguing pairing of generative and inference models, have beckoned researchers seeking to unravel the mysteries of generative learning.  Yet, previous attempts to train these machines have often relied on indirect methods, optimizing approximations rather than tackling the true objective head-on.\n\nOur work marks a significant leap forward in this quest. We introduce a novel class of algorithms, inspired by the elegant principles of stochastic approximation, that directly optimize the very heart of the problem: the marginal log-likelihood. This bold approach, aptly named Joint Stochastic Approximation (JSA), simultaneously minimizes the KL-divergence, ensuring that the inference model faithfully captures the essence of the generative process.\n\nTo further empower JSA, we equip it with a powerful Markov Chain Monte Carlo (MCMC) operator, allowing it to explore the vast landscape of possibilities with newfound efficiency.\n\nOur experiments on the MNIST dataset reveal JSA's exceptional prowess.  It consistently outperforms competing algorithms like RWS, gracefully mastering a variety of challenging models. JSA illuminates a path towards a future where machines can not only learn from the world but also imagine new worlds, expanding the boundaries of creativity and discovery. \n\n\n",
    "Object detection with deep neural networks often involves evaluating thousands of potential bounding boxes per image, which is computationally expensive.  We propose a new method to make this process more efficient by analyzing the entire image first to identify and remove unnecessary parts of the network before processing individual bounding boxes.\n\nOur approach focuses on removing neurons that show little to no activation when applied to the whole image. This \"pruning\" technique significantly reduces the number of parameters in the network, leading to faster computation.\n\nExperiments on the PASCAL 2007 object detection benchmark demonstrate that we can eliminate up to 40% of the neurons in some layers without affecting detection accuracy, showcasing the potential for substantial speedups in object detection. \n\n\n",
    "You know how sometimes combining different pieces of information can be way more powerful than just looking at each one separately?  Like, figuring out what movie someone will enjoy by considering not just their favorite genres but also the actors, directors, *and* the time of day they're watching! \n\nWell, we built a cool new model called Exponential Machines (ExM) that's a master at capturing these complex interactions between features.  It can handle interactions of any order, meaning it can consider combinations of two, three, four, or even hundreds of features! \n\nThe secret sauce? We use a clever trick called Tensor Train (TT) decomposition to represent a massive table of parameters in a super compact way. This keeps the model from getting too complicated and lets us control how many parameters it uses. \n\nTo train this beast, we developed a special optimization algorithm that works on weird, curved spaces (think of it like navigating a maze on a roller coaster!).  This lets us train models with an insane number of parameters – like 2^160, which is basically a gazillion!\n\nWe tested ExM on some tough datasets, and it crushed it! It achieved top-notch performance on synthetic data with crazy high-order interactions and performed just as well as those fancy factorization machines on a movie recommendation dataset.  \n\nSo yeah, our ExM model is pretty awesome at understanding how different pieces of information work together to make better predictions. \n\n\n",
    "Imagine stepping into a world of hidden patterns, where complex sequences of events unfold, driven by unseen forces.  This is the realm of latent state-space models, where understanding the underlying dynamics can unlock a deeper understanding of the world around us.\n\nBut deciphering these hidden states, especially from intricate data like image sequences, has long posed a formidable challenge.  Enter Deep Variational Bayes Filters (DVBF), our groundbreaking method for unraveling the mysteries of these complex systems.\n\nDVBF harnesses the power of deep learning and variational inference, a clever technique for approximating intractable probability distributions.  Like a master detective, DVBF meticulously analyzes the data, its neural networks learning to represent the hidden states that govern the observed sequences.\n\nBut DVBF goes a step further, introducing a crucial innovation: backpropagation through transitions.  This allows the model to learn not only the states themselves but also the intricate rules that govern their evolution over time. It's like discovering the secret code that dictates the unfolding narrative.\n\nThe results are remarkable! DVBF not only learns informative latent representations but also enforces the crucial state-space assumptions, ensuring that the learned model accurately reflects the underlying dynamics.  This newfound understanding empowers DVBF to perform realistic long-term predictions, extending its vision far beyond the immediate past.\n\nWith DVBF, we've opened a door to a world of possibilities, empowering researchers to delve into the hidden mechanisms behind complex systems, from financial markets to climate patterns, unlocking a deeper understanding of the world's intricate tapestry. \n\n\n",
    "Building dialog systems that can seamlessly navigate complex real-world tasks, like making a restaurant reservation, has long been a pursuit of artificial intelligence.  Traditional approaches, while effective in limited domains, often rely on extensive handcrafting, making them difficult to scale to new tasks.  \n\nThe allure of end-to-end dialog systems, where all components are learned directly from data, is undeniable.  Their recent success in casual chit-chat raises the question: can these data-driven approaches effectively handle the intricacies of goal-oriented dialogues?\n\nOur work delves into this question, proposing a new testbed to rigorously evaluate the strengths and weaknesses of end-to-end dialog systems in goal-oriented settings.  We chose the domain of restaurant reservations, a task that demands a nuanced understanding of language, the ability to interact with APIs, and the capacity to seamlessly blend sentence-level understanding with symbolic reasoning.\n\nOur experiments reveal a promising yet nuanced picture.  We show that an end-to-end system based on Memory Networks can achieve encouraging results, learning to perform non-trivial operations like extracting information from sentences and issuing API calls.  However, it also highlights the inherent challenges of purely data-driven approaches in handling the complexities of real-world tasks.\n\nComparisons with a hand-crafted slot-filling baseline on benchmark datasets, including the Dialog State Tracking Challenge and data from a real-world online concierge service, confirm these findings. While end-to-end systems hold great promise, bridging the gap to human-level performance in goal-oriented dialogue remains an open and intriguing challenge. \n\n\n\n\n",
    "Adversarial training and virtual adversarial training are regularization techniques used in supervised and semi-supervised learning, respectively. These methods typically involve perturbing input vectors, which may be unsuitable for sparse, high-dimensional data like one-hot word representations. \n\nThis study explores adapting adversarial and virtual adversarial training to text classification by applying perturbations to word embeddings within a recurrent neural network framework.  Experimental results demonstrate state-of-the-art performance on several benchmark datasets for both semi-supervised and supervised learning tasks.  \n\nFurther analysis indicates that the proposed method improves word embedding quality and mitigates overfitting during training. Code for the implementation is publicly available at: https://github.com/tensorflow/models/tree/master/research/adversarial_text. \n",
    "So, teaching computers to learn patterns from data without any labels is a big deal in AI, right?  But finding the right tools for the job can be tricky. We want models that are easy to train, generate samples from, understand, and evaluate.\n\nWe took on this challenge and expanded the toolbox with something called \"real NVP\" transformations. These are like magic spells that can warp and twist data in reversible ways, and we can even teach them to do it automatically! \n\nBy incorporating these real NVP transformations, we created a super cool unsupervised learning algorithm that's a real overachiever:\n\n- **Exact log-likelihood:** It can precisely calculate how well it understands the data. \n- **Exact sampling:** It can create new, realistic samples that look like they came from the original data.\n- **Exact inference:** It can figure out the hidden structure behind the data.\n- **Interpretable latent space:** It organizes the data in a way that makes sense to humans. \n\nWe put our method to the test on four different image datasets and it totally rocked! It generated awesome images, aced the log-likelihood test, and even let us play around with the hidden structure to see how it affects the generated images.  Talk about a versatile learner! \n",
    "Imagine a neural network learning to recognize an object, like a chair, from different angles.  We want to understand how the network's internal representation of that object changes as the viewpoint shifts.  Does it become invariant to viewpoint, recognizing the chair regardless of the angle?  And if so, how does it achieve this feat? \n\nThis work delves into the fascinating world of \"view manifolds\" – the geometric representations of an object as seen from various viewpoints within the network's layers.  We ask several key questions:\n\n* Does the network learn to ignore viewpoint variations?\n* How does it handle these variations? Does it merge different viewpoints into a single representation or separate them while still preserving their structure?\n* At which layer does this viewpoint invariance emerge?\n* How can we measure and quantify the shape and structure of these view manifolds within each layer?\n* What happens to these representations when we fine-tune a pre-trained network on a dataset with multiple viewpoints?\n\nTo answer these questions, we introduce a method for measuring the \"deformation\" and \"degeneracy\" of view manifolds within a CNN. Our analysis reveals insightful answers to the questions above, shedding light on how deep convolutional neural networks learn to represent and recognize objects in a viewpoint-invariant manner. \n\n\n",
    "In the realm of vision, where pixels dance and patterns weave,\nBilinear models rise, with representations rich and deep. \n\nThey capture subtle nuances, connections interleaved,\nUnlocking visual secrets, where answers lie conceived.\n\nBut high dimensions loom, a computational cost so steep,\nLimiting their embrace, in tasks where efficiency we seek.\n\nNow, a new path unfolds, with Hadamard's guiding light,\nLow-rank bilinear pooling, a fusion of power and might. \n\nAttention's gentle touch, a whispered guiding hand,\nSelecting salient features, across modalities grand. \n\nVisual questions posed, their answers we implore,\nOur model stands triumphant, surpassing those before.\n\nOn VQA's challenging ground, its brilliance takes the stage,\nWith parsimonious grace, a new chapter we engage. \n",
    "Importance-weighted autoencoders (IWAEs) are often understood as maximizing a tighter lower bound on the marginal likelihood compared to the standard variational lower bound (ELBO).  This study offers an alternative perspective: IWAEs can be interpreted as optimizing the standard ELBO, but with a more expressive variational distribution implicitly defined by the importance weighting procedure. \n\nWe formally derive this interpretation, presenting a new, even tighter lower bound on the marginal likelihood that further elucidates the relationship between IWAEs and the ELBO.  Additionally, we provide visualizations of the implicit importance-weighted distribution, offering insights into its structure and complexity. \n",
    "This study derives a novel generalization bound for feedforward neural networks, providing theoretical guarantees on their performance on unseen data. The bound is expressed as a function of the product of the spectral norms of the weight matrices and the Frobenius norm of the weights across all layers.  This result is obtained through a PAC-Bayes analysis, a powerful framework for deriving generalization bounds in machine learning. \n",
    "This paper establishes a novel framework for empowering Generative Adversarial Networks (GANs) with the ability to directly estimate the energy of generated samples. We introduce a flexible adversarial training paradigm and rigorously prove that this framework guarantees convergence of the generator to the true data distribution while simultaneously ensuring the discriminator accurately captures the underlying data density at the global optimum.\n\nWe derive the analytical form of the solution induced by our framework and provide a comprehensive analysis of its properties.  To facilitate practical implementation, we introduce two effective approximation techniques that maintain the theoretical guarantees while enabling efficient training. \n\nEmpirical results strongly support our theoretical analysis, demonstrating that the discriminator successfully learns to recover the energy function associated with the true data distribution. This advancement unlocks new possibilities for GAN-based applications, including anomaly detection, energy-based modeling, and enhanced sample quality evaluation. \n",
    "This study explores outlier detection using ensembles of neural networks derived from a Bayesian perspective.  Instead of relying on traditional ensembling techniques, we leverage variational inference to approximate the posterior distribution of weights in a Bayesian neural network.  \n\nOur method employs gradient descent to sample from this approximate posterior, effectively generating an ensemble of diverse neural networks. We demonstrate that our approach achieves outlier detection performance comparable to established ensembling methods. This highlights the potential of Bayesian approaches for creating robust and effective ensembles for anomaly detection. \n",
    "This paper introduces two efficient techniques for reducing the parameter count and training time of large Long Short-Term Memory (LSTM) networks:\n\n1. **Factorized LSTM Matrices:** Decompose the LSTM weight matrices into products of smaller matrices, effectively reducing the number of parameters.\n\n2. **Partitioned LSTMs:** Divide the LSTM weight matrices, input vectors, and hidden states into independent groups, enabling parallel processing and faster training. \n\nBoth approaches significantly accelerate training while achieving near state-of-the-art perplexity on language modeling tasks, demonstrating their effectiveness in compressing and optimizing large LSTM models. \n",
    "Training deep neural networks, especially those with many layers like residual networks, can be a bit of a mystery. We stumbled upon some surprising and previously undocumented behaviors while experimenting with different training techniques.\n\nOur goal is to shed light on these curious phenomena and gain a better understanding of how these complex models actually learn. We used two techniques in particular: Cyclical Learning Rates (CLR), where the learning rate goes up and down during training, and linear network interpolation, where we gradually morph one network into another. \n\nWe observed some really counterintuitive things, like unexpected jumps and drops in the training loss and instances where the network learned incredibly quickly.  For instance, we found that CLR could achieve better results than traditional methods even with very high learning rates!\n\nWe believe these findings are valuable for anyone trying to unlock the secrets of deep learning. To encourage further exploration, we've made our code publicly available so others can replicate our results and dig deeper into these intriguing behaviors. \n\n\n",
    "Machine learning models often face different constraints during deployment compared to training.  For example, models deployed on resource-constrained devices may need to prioritize speed or energy efficiency.\n\nThis work presents a mixture-of-experts model designed to adapt its computational cost at test time based on individual inputs.  We achieve this by leveraging reinforcement learning to dynamically select the appropriate expert for each input, allowing the model to balance accuracy with resource consumption.\n\nOur approach is evaluated on a simplified MNIST-based task, demonstrating the feasibility of using reinforcement learning for dynamic resource allocation in mixture-of-experts models. \n",
    "In the depths of learning deep, where agents roam and seek,\nA shadow lurks, adversarial, a challenge to the meek.\n\nNeural networks, vast and intricate, a tapestry of might,\nYet vulnerable they stand, to perturbations, a ghostly blight.\n\nAdversarial whispers, unseen to human eyes,\nCan twist perception's truth, and lead deep learning astray.\n\nAgainst this subtle foe, a novel quest we undertake,\nTo probe the depths of resilience, for reinforcement learning's sake.\n\nAdversarial examples, crafted with cunning art,\nCompared to random noise, a test to tear defenses apart.\n\nA new defense emerges, guided by value's hand,\nReducing adversarial strikes, a strategic, measured stand.\n\nRe-training's crucible, with noise and perturbations imbued,\nA forge to strengthen agents, against attacks renewed.\n\n\n",
    "Continual learning, the ability to learn new tasks without forgetting previously acquired knowledge, remains a major challenge in artificial intelligence. This paper introduces Variational Continual Learning (VCL), a simple yet powerful framework that addresses this challenge by combining online variational inference with cutting-edge Monte Carlo techniques for neural networks.\n\nVCL seamlessly handles complex scenarios where existing tasks evolve over time and entirely new tasks emerge.  It excels at training both discriminative and generative deep learning models, consistently outperforming state-of-the-art continual learning methods on a variety of tasks.  Impressively, VCL achieves this without the need for manual task boundaries or complex heuristics, automatically mitigating catastrophic forgetting. \n\n\n",
    "Designing neural networks often involves a tedious search for the optimal architecture – a balancing act between model complexity and performance.  Existing methods for automatically determining the ideal network size typically require extensive trial and error, training numerous networks from scratch. \n\nThis work explores a more elegant and efficient approach to this architectural search problem. We introduce *nonparametric neural networks*, a framework that shifts the perspective from discrete architecture selection to continuous optimization over all possible network sizes.  \n\nOur framework, grounded in a sound theoretical foundation, leverages an Lp penalty to constrain network growth.  During training, new units are continuously introduced while redundant units are pruned away through an L2 penalty, dynamically shaping the network architecture.\n\nTo navigate this unique optimization landscape, we introduce *AdaRad*, a novel gradient descent algorithm tailored for handling the radial and angular aspects of network growth.  Preliminary results are encouraging, suggesting that our approach offers a promising avenue for automatically discovering effective neural network architectures within a single training run. \n\n\n",
    "Natural Language Inference (NLI) tasks challenge AI systems to decipher the logical connection between a premise and a hypothesis expressed in natural language.  We propose a new class of neural network architectures called Interactive Inference Networks (IINs) that excel at this complex reasoning task.\n\nIINs achieve a deep understanding of sentence pairs by hierarchically extracting semantic features from an \"interaction space,\" which captures the interplay between the premise and hypothesis. Our key insight is that the attention weights, represented as an interaction tensor, encode crucial semantic information for solving NLI.  Moreover, a denser interaction tensor leads to richer semantic understanding.\n\nOne specific instantiation of our IIN architecture, the Densely Interactive Inference Network (DIIN), achieves state-of-the-art performance on both traditional NLI benchmarks and challenging datasets like MultiNLI.  Impressively, DIIN reduces the error rate on MultiNLI by over 20% compared to the previous best-performing system.  Our results highlight the power of dense, hierarchical interaction modeling for robust and accurate natural language inference. \n",
    "Deep learning models, despite their remarkable capabilities, have a troubling weakness: adversarial examples. These are inputs that have been subtly altered in a way that's imperceptible to humans but causes the model to make incorrect predictions. This vulnerability poses a major obstacle for deploying deep learning in safety-critical applications, where reliability is paramount.\n\nMany techniques have been proposed to defend against adversarial attacks, but most have been quickly circumvented by new attack strategies.  This constant cat-and-mouse game highlights the need for more robust and reliable defense mechanisms.\n\nWe propose a solution based on formal verification, a rigorous mathematical approach to proving the correctness of systems. Our method allows us to construct adversarial examples that are guaranteed to be minimally distorted – meaning they represent the smallest possible change to the original input that still causes the model to fail.\n\nUsing this powerful tool, we can rigorously evaluate the effectiveness of different defenses.  We demonstrate that \"adversarial retraining,\" a popular defense strategy, provably increases the difficulty of constructing adversarial examples by a factor of 4.2.  This provides strong evidence that adversarial retraining truly enhances model robustness, unlike many other defenses that have proven to be superficial.\n\nOur work represents a significant step towards building reliable and trustworthy deep learning systems by leveraging the power of formal verification to analyze and strengthen their resistance to adversarial attacks. \n\n\n",
    "The limitations of traditional Variational Autoencoders (VAEs) with fixed-dimensional Gaussian latent spaces are increasingly evident.  We argue that allowing for a latent space with flexible dimensionality can significantly enhance the expressiveness and representation learning capabilities of VAEs.\n\nThis paper introduces the Stick-Breaking Variational Autoencoder (SB-VAE), a novel generative model that overcomes this limitation.  By extending Stochastic Gradient Variational Bayes to handle the weights of Stick-Breaking processes, we enable the SB-VAE to learn latent representations with *stochastic dimensionality*. This nonparametric approach empowers the model to adapt the complexity of its latent space to the intricacies of the data, leading to more powerful and disentangled representations.\n\nOur empirical results unequivocally demonstrate the advantages of the SB-VAE.  Both the unsupervised and semi-supervised variants consistently outperform standard Gaussian VAEs, achieving superior performance on a variety of tasks. This underscores the crucial role of flexible latent dimensionality in unlocking the full potential of variational autoencoders for representation learning and generative modeling. \n",
    "Imagine training a bunch of neural networks at the same time, but instead of them working in isolation, they can actually learn from each other!  That's the idea behind our new framework for multi-task learning.\n\nThink of it like a group project where everyone has their own task, but they can also share resources and knowledge to do a better job overall.  In our framework, the different neural networks can automatically figure out which parts of their \"brains\" (parameters) are useful for other tasks and reuse them!\n\nWhat's cool about our approach is that we don't tell the networks in advance how to share their knowledge.  Instead, they learn the best sharing strategy directly from the data itself. This means they can be super flexible and adapt to whatever tasks they're given.  It's like having a team of AI experts who can figure out the best way to collaborate on their own! \n",
    "Building agents that can learn to navigate complex environments through trial and error is a central challenge in reinforcement learning.  This work pushes the boundaries of what's possible by introducing a novel deep reinforcement learning agent that combines the strengths of actor-critic architectures, experience replay, and several key innovations.\n\nOur agent tackles the crucial trade-off between exploration and exploitation by efficiently storing and replaying past experiences. This experience replay not only improves sample efficiency but also enhances stability during training, preventing the agent from getting stuck in local optima.\n\nBut we didn't stop there. We incorporated several key innovations to further enhance our agent's performance:\n\n* **Truncated Importance Sampling with Bias Correction:**  This technique improves the accuracy of learning from past experiences by carefully weighting and correcting for bias in the replayed data.\n\n* **Stochastic Dueling Network Architectures:**  This architecture allows for more efficient representation and estimation of the value of different actions, speeding up learning.\n\n* **Trust Region Policy Optimization:**  This new optimization method ensures that the agent's policy updates remain within a safe and controlled region, preventing catastrophic performance drops during training.\n\nThe culmination of these advancements is an agent that achieves remarkable results on a variety of challenging tasks, including mastering classic Atari games and solving complex continuous control problems.  Our work signifies a significant step towards developing more robust, efficient, and capable reinforcement learning agents for real-world applications. \n",
    "Many machine learning models are susceptible to adversarial examples – carefully crafted inputs designed to fool the model while appearing normal to humans.  This vulnerability poses a serious threat to the reliability of these models.\n\nWe address this challenge by introducing three methods for detecting adversarial images.  Our detectors force attackers to make a trade-off: either create less effective adversarial examples that are harder to detect or generate more easily detectable examples that are less likely to fool the original classifier. \n\nOur most effective detection method leverages a key insight: adversarial examples tend to rely heavily on lower-ranked principal components, creating an abnormal pattern that our detector can identify. \n\nWe provide further analysis, including additional detectors and a visually informative saliency map, in the appendix.  Our work contributes to the development of more robust and secure machine learning systems by providing effective tools for detecting and mitigating adversarial attacks. \n",
    "This paper introduces a novel and theoretically sound method for kernel learning, leveraging a Fourier-analytic framework to characterize translation-invariant and rotation-invariant kernels. Our approach generates a sequence of feature maps that iteratively refine the support vector machine (SVM) margin, optimizing the separation between classes.\n\nWe establish rigorous guarantees for both optimality and generalization, interpreting our algorithm as an online equilibrium-finding process within a two-player min-max game.  Empirical evaluations on synthetic and real-world datasets showcase the scalability and superior performance of our method compared to existing random features-based approaches. \n\nOur contributions advance the field of kernel learning by providing a principled and efficient algorithm with strong theoretical foundations and demonstrable empirical advantages. \n",
    "Recurrent neural networks, while dominant in deep reading comprehension, suffer from limited parallelization and slow inference, especially for long texts. This paper proposes a novel convolutional architecture that achieves comparable accuracy to state-of-the-art recurrent models on question answering tasks while enabling up to 100x speedups due to increased parallelization. \n\n\n",
    "This report meticulously examines the reproducibility of the research presented in the paper \"On the regularization of Wasserstein GANs\" (2018). Our investigation focuses on rigorously replicating the key findings and evaluating the computational resources required for successful reproduction.\n\nWe prioritize five crucial aspects of the original work:\n\n1. **Learning Speed:**  We assess the convergence rate of the proposed regularized Wasserstein GAN training procedure compared to the original WGAN.\n\n2. **Stability:**  We evaluate the stability of the training process, focusing on the consistency of performance across multiple runs with different random initializations.\n\n3. **Hyperparameter Robustness:**  We investigate the sensitivity of the proposed method to variations in hyperparameters, such as regularization strength and learning rates.\n\n4. **Wasserstein Distance Estimation:** We reproduce the experiments aimed at estimating the Wasserstein distance between the generated and real data distributions.\n\n5. **Sampling Methods:** We explore the impact of different sampling strategies on the performance of the regularized WGAN.\n\nOur findings provide a detailed assessment of the reproducibility of each aspect, highlighting any discrepancies observed between our results and those reported in the original paper.  We also meticulously document the computational resources required for each experiment, providing valuable insights for researchers seeking to reproduce or build upon this work.  \n\nTo foster transparency and encourage further investigation, we have made the complete source code used for our reproduction study publicly available. This open-source contribution facilitates community engagement and promotes rigorous scientific validation in the field of generative adversarial networks. \n",
    "This study revisits the rate-distortion trade-off inherent in Variational Autoencoders (VAEs), particularly within the context of hierarchical VAEs, which utilize multiple layers of latent variables.  While β-VAEs generalize VAEs beyond probabilistic generative modeling by enabling explicit control over this trade-off, their application to hierarchical models reveals further nuances.\n\nWe identify a general class of hierarchical VAE inference models for which the overall information content (\"rate\") can be decomposed into contributions from individual latent layers. This decomposition allows for independent control over the rate at each layer, facilitating more targeted optimization for specific downstream tasks. \n\nWe derive theoretical bounds on downstream task performance as a function of the individual layer rates, establishing a formal relationship between latent representation complexity and task-specific success.  Extensive empirical evaluations validate our theoretical findings, demonstrating the practical implications of this layer-wise rate control. \n\nOur results provide valuable guidance for practitioners seeking to optimize hierarchical VAEs for specific applications. By understanding the interplay between layer-specific rates and downstream performance, researchers can effectively navigate the rate-distortion trade-off to achieve desired outcomes. \n",
    "Think of a social network like Facebook. You've got all these people connected in different ways, and sometimes you want to figure out who's similar to whom, or predict who might become friends in the future.  That's where node embeddings come in – they're like creating a secret code for each person that captures their connections and characteristics.\n\nOur method, Graph2Gauss, is a super smart way to learn these embeddings, even for massive networks.  But here's the twist: instead of just assigning a single code to each person, we represent them as a fuzzy cloud of possibilities, like acknowledging that we can't know everything about someone with absolute certainty.  \n\nThis \"Gaussian distribution\" embedding lets us capture uncertainty and reveals interesting things about the network.  For example, we can figure out how diverse someone's friend group is or discover hidden communities within the network. \n\nGraph2Gauss is also a master of adaptation.  It can handle different types of networks, with or without extra information about each person (like their interests or age).  And here's the best part: it can even figure out the code for brand new people who join the network, without needing to re-learn everything from scratch!\n\nWe put Graph2Gauss to the test on real-world networks and it crushed it!  It outperformed other methods in predicting links, classifying nodes, and even uncovering hidden network structures.  Turns out, embracing a bit of uncertainty can lead to some pretty amazing insights! \n",
    "Bridging the gap between different visual domains, where images exhibit distinct characteristics, poses a constant challenge in computer vision.  This work explores the potential of self-ensembling, a technique that leverages the model's own predictions to enhance its learning and generalization capabilities. \n\nInspired by the success of temporal ensembling and its mean teacher variant in semi-supervised learning, we adapt these ideas to the realm of visual domain adaptation.  Through careful modifications and refinements, we tailor our approach to address the unique challenges posed by domain shifts. \n\nThe results of our exploration are compelling. Our method achieves state-of-the-art performance on various benchmark datasets, including a winning entry in the VISDA-2017 visual domain adaptation challenge.  Remarkably, on smaller image benchmarks, our approach not only surpasses previous methods but also approaches the accuracy of models trained with full supervision, highlighting the power of self-ensembling for bridging the gap between domains.\n\nThis work underscores the potential of leveraging a model's own internal consistency to improve its understanding and generalization across diverse visual environments. \n\n\n",
    "This paper presents a definitive theoretical framework for understanding the fundamental nature of adversarial examples, a pervasive vulnerability in machine learning classifiers, including deep neural networks. We move beyond ad hoc defense mechanisms and delve into the core topological properties that govern the susceptibility of classifiers to adversarial attacks.\n\nOur analysis, grounded in rigorous mathematical foundations, elucidates the precise conditions under which a classifier (denoted as *f<sub>1</sub>*) can be fooled by an adversarial example, explicitly incorporating the role of an oracle (*f<sub>2</sub>*, analogous to human perception) in this vulnerability.\n\nBy examining the topological relationship between the (pseudo)metric spaces induced by *f<sub>1</sub>* and *f<sub>2</sub>*, we establish necessary and sufficient conditions for determining the robustness of *f<sub>1</sub>* against adversarial examples, as judged by *f<sub>2</sub>*.  Our theorems reveal a critical insight: even a single extraneous feature can render *f<sub>1</sub>* vulnerable to attack. This highlights the crucial importance of appropriate feature representation learning for achieving both accurate and robust classification. \n\nOur work provides a theoretical foundation for developing principled defenses against adversarial examples, shifting the focus from reactive measures to a deep understanding of the underlying geometric and topological properties that govern classifier robustness. \n",
    "This study proposes a framework for evaluating and training agents on their ability to efficiently gather information within partially observable environments. We introduce a set of tasks designed to assess an agent's capacity for strategic exploration, information retrieval, and knowledge integration.  Success in these tasks requires agents to search for fragmented information, synthesize gathered knowledge, and leverage it to achieve specific goals.\n\nOur approach combines deep learning architectures with reinforcement learning techniques.  We utilize a combination of extrinsic rewards, tied to task completion, and intrinsic rewards, encouraging exploration and information acquisition.  Empirical evaluations demonstrate that our trained agents exhibit effective information-seeking behavior. They actively explore the environment to reduce uncertainty, strategically prioritize information gathering, and effectively utilize acquired knowledge for task completion. \n\n\n",
    "Imagine a language model that can not only process words but also remember and recall relevant information from its past encounters.  We introduce a novel extension to neural network language models that bestows upon them this remarkable ability to adapt their predictions based on recent context.\n\nOur model, inspired by the concept of memory augmentation, operates like a highly efficient librarian. It stores a vast collection of past experiences – represented by hidden activations – in a readily accessible memory bank.  When faced with a new word, the model cleverly retrieves relevant memories through a simple dot product operation, seamlessly weaving past context into its predictions.\n\nThis mechanism, both elegant and scalable, enables the model to effortlessly handle vast memory sizes, ensuring that even distant past experiences can contribute to its understanding of the present. We draw a compelling parallel between this external memory mechanism and the cache models employed in traditional count-based language models, highlighting the underlying connections between these seemingly disparate approaches.\n\nOur experiments on various language modeling benchmarks reveal a resounding success.  Our model significantly outperforms recently proposed memory-augmented networks, demonstrating the effectiveness of our streamlined memory access and retrieval mechanism. This advancement paves the way for more contextually aware and sophisticated language models capable of generating human-like text with an enhanced understanding of the past.\n\n\n",
    "This paper proposes a novel training algorithm for Generative Adversarial Networks (GANs) that addresses limitations in the original GAN objective function. Our algorithm iteratively performs density ratio estimation and f-divergence minimization, leading to stronger gradients for the generator and improved training stability.  This approach leverages insights from density ratio estimation research, offering a new perspective on GAN training and opening avenues for incorporating diverse divergence measures and relative density ratios. \n",
    "Ever wondered how AI could compose catchy pop songs?  We've created a cool new system that does just that!  It uses a special kind of neural network, a hierarchical RNN, that's structured like a musical recipe. The lower layers of the network focus on creating memorable melodies, while the higher layers add in the beats and harmonies that make pop music so irresistible.\n\nWe even tested our music on real people, and they loved it!  They preferred our tunes over those generated by another AI system from Google.\n\nBut we didn't stop there. We also used our system to create some fun applications:\n\n* **Neural Dancing:**  Imagine virtual dancers grooving perfectly to AI-generated music! \n* **Neural Karaoke:** Sing along with lyrics that are automatically generated to match the melody.\n* **Neural Story Singing:**  Our system can even compose music that tells a story!\n\nIt's like having an AI band that can write, perform, and even inspire new dance moves! \n\n\n",
    "Understanding the loss landscape of deep neural networks is crucial for analyzing their training dynamics and generalization capabilities.  This work investigates the spectral properties of the Hessian matrix, which captures the curvature of the loss function, providing insights into the model's behavior around local optima.\n\nWe analyze the eigenvalue distribution of the Hessian both before and after training, observing a consistent pattern: a dense concentration of eigenvalues forming a \"bulk\" around zero and a sparser set of \"edge\" eigenvalues located further away from zero. \n\nOur empirical evidence suggests that the bulk eigenvalues reflect the degree of over-parameterization in the network.  A wider bulk, with more eigenvalues clustered near zero, indicates a higher degree of redundancy in the model's parameters. \n\nConversely, the edge eigenvalues, which are more dispersed and influenced by the training data, capture information specific to the learning task. These eigenvalues reflect the curvature along directions that are relevant for separating different classes or fitting the target function. \n\nBy decoupling the effects of over-parameterization and data-dependent learning through the analysis of the Hessian's eigenvalue spectrum, we gain a deeper understanding of how neural networks learn and generalize.  This insight can inform the design of more efficient and robust deep learning models. \n",
    "Imagine peering into the intricate workings of a computer program, deciphering its hidden language to unveil its true intent.  That's the essence of our work, where we unlock the secrets hidden within program execution logs.\n\nOur approach is akin to an artist extracting meaning from abstract shapes.  We first identify complex patterns within a program's behavior graph, those telltale signs that reveal its underlying purpose.  Then, like a sculptor molding clay, we transform these patterns into a continuous, flowing representation using the power of an autoencoder.  \n\nThis transformation unlocks a hidden world of structure and meaning.  We demonstrate the effectiveness of our approach on a real-world challenge: detecting malicious software. Our learned representations not only distinguish between benign and harmful programs but also reveal interpretable structures within the patterns themselves, offering insights into the very nature of malicious behavior.\n\nThis work opens up exciting new possibilities for understanding and analyzing software, empowering us to build more secure systems and unlock the full potential of program analysis. \n\n\n\n\n",
    "Inspired by the remarkable efficiency of insect brains, we put the FlyHash model – a novel sparse neural network – to the test in a challenging embodied navigation task.  Imagine an AI agent learning to navigate a maze by comparing its current view with memories of its previous journey. \n\nOur results were truly exciting!  FlyHash consistently outshone other, non-sparse models, demonstrating exceptional efficiency, particularly in how it encodes and processes visual information.  \n\nThis success highlights the incredible potential of biologically-inspired, sparse architectures for building lean and powerful AI systems.  By emulating the elegance and efficiency found in nature, we can create more capable and resource-aware AI, paving the way for exciting new applications in robotics, autonomous vehicles, and beyond! \n",
    "## Integrating Ranking Information into Peer Review Scores: A Principled Approach\n\n**Problem:**\n\n* Traditional peer review relies on quantized scores from reviewers, leading to a high number of ties and information loss.\n* Conferences are increasingly requesting paper rankings from reviewers to address this, but face challenges:\n    * **Arbitrariness:** Lack of standardized procedures for using ranking information leads to inconsistent application by Area Chairs.\n    * **Inefficiency:**  Existing interfaces and workflows are not designed to effectively incorporate rankings.\n\n**Our Solution:**\n\nWe propose a principled method to integrate ranking information directly into the review scores, producing updated scores that reflect both quantitative and ordinal assessments. \n\n**Advantages:**\n\n* **Mitigates Arbitrariness:** Our method ensures consistent application of ranking information across all papers.\n* **Seamless Integration:** Updated scores can be used within existing peer review interfaces and workflows.\n\n**Empirical Evaluation:**\n\n* Evaluations on synthetic datasets and ICLR 2017 peer review data demonstrate a 30% error reduction compared to the best-performing baseline.\n\n**Conclusion:**\n\nOur method offers a principled and practical solution for enhancing peer review by effectively incorporating ranking information, leading to more accurate and informative evaluation of scientific contributions. \n",
    "Does prestige truly influence academic peer review?  We dove deep into this intriguing question by analyzing a massive dataset of over 5,000 \"borderline\" submissions to the prestigious International Conference on Learning Representations (ICLR) from 2017 to 2022. \n\nOur investigation focused on uncovering any hidden associations between author metadata – those subtle signals of academic pedigree – and the ultimate fate of a paper: acceptance or rejection.  \n\nTo ensure a rigorous and unbiased analysis, we adopted the gold-standard framework of causal inference, carefully defining elements like treatment, timing, and potential outcomes.  Think of it like a detective meticulously examining every clue to piece together a compelling narrative.\n\nOur findings, while subtle, offer a glimpse into the complex dynamics of peer review. We uncovered weak evidence suggesting that author metadata might indeed play a role in decision-making.  \n\nFurther analysis, under a reasonable stability assumption, revealed a more intriguing pattern.  Borderline papers from prestigious institutions, those ranked within the top 30% or 20%, appeared to be slightly *less* favored by area chairs compared to similar papers from less prestigious institutions.  \n\nThis counterintuitive finding sparks further questions about the intricate interplay between human judgment and institutional prestige in the world of academic publishing.  Our work sheds light on the complex, often opaque, world of peer review, urging further exploration and a deeper understanding of the factors that influence scientific discourse. \n",
    "The Information Bottleneck principle, a powerful framework for extracting relevant information from data, has long been hampered by computational challenges.  We introduce Deep Variational Information Bottleneck (Deep VIB), a novel approach that bridges this gap by using variational inference to approximate the Information Bottleneck objective.\n\nThis variational formulation allows us to seamlessly integrate the Information Bottleneck into deep neural networks, enabling efficient training using the reparameterization trick.  Our experiments demonstrate the compelling advantages of Deep VIB. Models trained with this method exhibit superior generalization performance and enhanced robustness to adversarial attacks compared to those trained with traditional regularization techniques. \n\nDeep VIB unlocks the potential of the Information Bottleneck for deep learning, providing a powerful tool for learning robust and informative representations.\n",
    "Imagine you're trying to understand a sentence.  You don't just focus on every single word equally, right? You pay attention to the important parts, like the subject, verb, and object, and how they relate to each other.  \n\nThat's what attention networks do in deep learning – they help models focus on the most relevant parts of the input data. But sometimes, we need to understand more complex relationships, like the grammatical structure of a sentence or the connections between objects in an image.  \n\nWe explored a cool new way to incorporate this \"structural\" knowledge into attention networks.  It's like giving them a grammar guide or a map to help them understand the connections between things.  \n\nWe experimented with two types of structural attention:\n\n- **Linear Chain:** Think of it like highlighting words in a sentence based on their part of speech (noun, verb, adjective, etc.).\n- **Graph-based:** This is like drawing a diagram of how the words in a sentence connect to each other to form meaning.\n\nAnd guess what? It totally worked! Our structured attention networks outperformed regular attention models on all sorts of tasks, from translating languages to answering questions and even understanding the logic behind arguments.\n\nPlus, our models learned some really interesting stuff on their own, even without being explicitly told what to look for. It's like they figured out some hidden grammar rules just by paying attention to the structure of the data!  Pretty cool, right? \n",
    "Imagine a team of expert detectives, each specializing in a different type of crime. When faced with a particularly cunning criminal, their combined expertise and unique perspectives are crucial for cracking the case.  \n\nThat's the inspiration behind our novel approach to defending against adversarial examples – those sneaky inputs designed to fool AI systems.  We propose an ensemble of \"specialist\" models, each trained to excel in distinguishing between specific classes.  \n\nOur key insight is that adversarial examples tend to exploit predictable patterns of confusion.  By analyzing the confusion matrix, we can identify these weaknesses and create specialized models to address them. \n\nWhen these specialists work together, they can effectively identify and flag suspicious inputs.  If the experts disagree – resulting in high entropy or uncertainty – it's a strong signal that the input might be an adversarial example, and we can choose to reject it rather than risk a misclassification.\n\nOur experimental results confirm the power of this \"wisdom of the crowds\" approach.  Instead of trying to correctly classify every single input, which can be a losing battle against crafty adversaries, our ensemble focuses on identifying and rejecting the most suspicious cases.  This strategy significantly enhances the robustness of our system, providing a more secure and reliable defense against adversarial attacks. \n",
    "This paper introduces Neural Phrase-based Machine Translation (NPMT), a novel approach that leverages Sleep-WAke Networks (SWAN) to explicitly model phrase structures in the target language.  To address the monotonic alignment limitations of SWAN, NPMT incorporates a layer for soft local reordering of the input sequence.\n\nUnlike prevalent attention-based neural machine translation (NMT) models, NPMT generates translations by directly outputting phrases in sequential order, enabling linear-time decoding.  Experimental results on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese translation tasks show that NPMT achieves comparable or superior performance to strong NMT baselines.  Furthermore, analysis of the generated output indicates that NPMT produces meaningful and coherent phrases. \n",
    "Generating realistic images with AI is a hot topic, and Generative Adversarial Networks (GANs) are leading the charge. But most GANs struggle to create images that capture the relationships between objects and their backgrounds.  \n\nWe introduce LR-GAN, a new GAN that's smarter about scene composition. It breaks down the image generation process into two key steps:\n\n1. **Background Generation:**  First, LR-GAN learns to create realistic backgrounds.\n2. **Foreground Generation and Integration:**  Then, it generates foreground objects (like people, cars, or trees) one by one, learning their appearance, shape, and position.  The clever part is that LR-GAN places these objects onto the background in a way that makes sense – ensuring they fit naturally into the scene.\n\nThe entire process is unsupervised, meaning LR-GAN figures everything out just by looking at images.  We train it using standard gradient descent methods, and the results are impressive! LR-GAN generates images that look more natural and contain objects that are easier for humans to recognize compared to other GANs like DCGAN. \n\n\n\n\n",
    "This paper introduces a novel and surprisingly effective approach for unsupervised skill acquisition in reinforcement learning agents.  Our method, inspired by the concept of self-play, pits two identical agents – Alice and Bob – against each other in a game of exploration and mastery.\n\nAlice, acting as the \"task proposer,\" performs a sequence of actions within the environment.  Bob's challenge is to either undo these actions (in reversible environments) or repeat them (in resettable environments).  This simple yet elegant framework, driven by a carefully crafted reward structure, creates an automatic curriculum of exploration, enabling the agents to learn valuable skills without any external supervision.\n\nThe benefits of this unsupervised training are undeniable. When Bob is subsequently tasked with solving a traditional reinforcement learning problem within the same environment, he exhibits a remarkable advantage.  Not only does he require significantly fewer supervised training episodes to achieve competence, but in some cases, he even surpasses the performance of agents trained solely with supervision.\n\nOur work demonstrates the power of self-play and intrinsic motivation in unlocking unsupervised skill learning, paving the way for more efficient and adaptable reinforcement learning agents capable of tackling complex tasks with minimal external guidance. \n\n\n",
    "Maximum entropy modeling, a powerful framework for building statistical models from limited information, has traditionally relied on optimizing directly over complex probability distributions. This paper introduces a novel and elegant alternative: learning a smooth and invertible transformation that maps a simple, known distribution to the desired maximum entropy distribution. \n\nImagine sculpting a complex shape from a block of clay. Instead of painstakingly molding the clay directly, we learn a series of smooth deformations that transform a simple sphere into the desired intricate form.\n\nThis approach, however, presents a unique challenge: the objective function, entropy, depends on the density itself, making optimization non-trivial. We overcome this hurdle by leveraging the expressive power of normalizing flow networks, a class of models that learn invertible transformations. \n\nThis transformation allows us to recast the maximum entropy problem into a finite-dimensional constrained optimization problem, making it amenable to efficient solutions. We employ a combination of stochastic optimization and the augmented Lagrangian method to navigate this constrained space and arrive at the optimal transformation.\n\nOur approach is not just theoretically appealing; it also delivers impressive results.  Simulations showcase the effectiveness of our method, while applications in finance and computer vision demonstrate its versatility and accuracy.  From modeling financial time series to capturing the intricate patterns in images, maximum entropy flow networks emerge as a powerful and flexible tool for statistical modeling. \n\n\n",
    "Okay, so AI is killing it these days, tackling all sorts of tough problems.  General AI, like the super-smart kind we see in movies, feels within reach, right? But here's the thing: most researchers are laser-focused on specific tasks like image recognition or translation, which is awesome, but kind of limited. \n\nWe think it's because we don't really have a good way to measure how close we are to that truly \"smart\" AI.  So, we're proposing a checklist for general AI – the things a truly intelligent machine should be able to do.  \n\nAnd to make it even better, we built a platform where we can actually test AI systems against this checklist.  We kept things simple and focused on the core abilities of a general AI, without getting bogged down in all the bells and whistles.  Think of it like a training ground for future super-smart AI! \n",
    "Okay, so you know how neural networks are great at processing data like images and text?  Well, some super smart folks figured out how to make them work on graphs too – those networks of nodes and edges.  This is awesome for things like understanding sentence structure (parse trees) or analyzing molecules (molecular graphs). \n\nBut there's a catch: every graph is different, with its own unique shape and size.  This makes it tricky to train and run these graph neural networks efficiently, especially with those popular deep learning libraries that like things neat and predictable.\n\nSo, we came up with a cool trick called \"dynamic batching.\"  It's like taking a bunch of different puzzles, each with different pieces, and figuring out how to fit them together into a bigger picture.  We can even batch operations within a single graph, like grouping similar tasks together. \n\nThe best part?  We can use this dynamic batching to create regular, static graphs that can run smoothly on those popular libraries. It's like translating a messy, ever-changing map into a clear, organized grid!\n\nTo make things even easier, we built a library of building blocks that you can use to create your own dynamic graph models.  It's like having a set of Lego bricks specifically designed for graph neural networks! We even show how to use it to build some popular models from research papers, all running efficiently and in parallel.  Pretty neat, huh? \n",
    "Deep learning models have achieved remarkable success in natural language processing, but their inner workings often remain shrouded in mystery. This lack of transparency makes it difficult to understand the reasoning behind their decisions, treating them as inscrutable black boxes.\n\nThis paper sheds light on the decision-making process of Long Short-Term Memory networks (LSTMs), a popular type of deep learning model for language tasks. We introduce a novel method for tracking the importance of specific input words to the LSTM's final output. \n\nBy identifying consistently influential word patterns, we can distill the knowledge embedded within state-of-the-art LSTMs trained on sentiment analysis and question answering tasks.  This process results in a set of representative phrases that encapsulate the model's learned understanding.\n\nTo validate the effectiveness of our approach, we construct a simple, rule-based classifier using these extracted phrases. This interpretable model mimics the LSTM's behavior, demonstrating that our method successfully captures the essential knowledge learned by the deep learning model.  \n\nOur work contributes to a deeper understanding of deep learning in natural language processing, moving beyond black-box predictions towards a more transparent and interpretable approach. \n",
    "Imagine teaching a robot to play a complex video game.  It's tough because the robot only gets rewards for completing specific tasks, and sometimes those tasks take a really long time to figure out!\n\nWe came up with a cool new way to help robots learn these challenging games faster. It's like giving them a \"training camp\" where they can practice different skills before tackling the actual game.  \n\nHere's how it works:\n\n1. **Skill School:** We create a special practice environment where the robot can learn useful skills, like jumping, grabbing objects, or navigating obstacles.  We give the robot a simple reward for exploring and mastering these skills, without even needing to know the details of the final game.\n\n2. **Game Time:**  Once the robot has learned a bunch of skills, we train a \"boss\" in its brain that can choose which skills to use at the right time.  This boss helps the robot explore the game world more effectively and figure out how to solve those tricky, long-term tasks.\n\nTo teach the robot all these skills quickly, we use special neural networks that are really good at learning from limited experience.  We also add a special ingredient called an \"information-theoretic regularizer,\" which helps the robot learn a diverse set of skills that are easy to understand. \n\nWe tested our method on a bunch of different games, and it worked like a charm! Our robots learned a wide range of skills really quickly and then used them to conquer the games much faster than robots that didn't go to skill school.  It's like giving them a secret cheat code for learning! \n\n\n",
    "The world of deep generative models is buzzing with two rising stars: Generative Adversarial Networks (GANs), known for their stunningly realistic creations, and Variational Autoencoders (VAEs), celebrated for their elegant probabilistic framework.  For years, these two approaches have been seen as distinct, their respective communities exploring separate paths. \n\nBut what if these seemingly disparate paradigms were, in fact, two sides of the same coin?  This paper embarks on a journey to bridge the gap between GANs and VAEs, unveiling their hidden connections through a novel unified formulation.\n\nImagine GANs engaging in a clever game of deception, learning to generate samples that mimic real data.  We recast this generation process as a form of \"posterior inference,\" where the generator seeks to uncover the hidden distribution that gave rise to the data.  \n\nOur key insight is that both GANs and VAEs involve minimizing KL divergences, a measure of dissimilarity between probability distributions.  However, they do so in opposite directions, mirroring the two phases of the classic wake-sleep algorithm.  \n\nThis unified perspective unlocks a treasure trove of possibilities.  We can now analyze a wide range of existing GAN and VAE variants through a shared lens, transferring techniques and insights between these previously separate worlds.  \n\nFor instance, we borrow the \"importance weighting\" trick from the VAE toolbox to boost GAN training, leading to more stable and realistic sample generation. Conversely, we infuse VAEs with an adversarial mechanism, leveraging generated samples to enhance their learning capabilities.\n\nOur experiments confirm the power of this unified view, demonstrating the effectiveness of these cross-pollination techniques.  This harmonious marriage of GANs and VAEs opens up exciting new avenues for research, paving the way for even more powerful and expressive deep generative models. \n",
    "This study addresses the problem of out-of-distribution (OOD) image detection in neural networks. The proposed method, ODIN, is designed to work with pre-trained networks without requiring any modifications to their architecture or training process. \n\nODIN leverages temperature scaling and input perturbations to enhance the separation between softmax score distributions of in-distribution and OOD images. Experiments demonstrate that ODIN consistently outperforms baseline approaches across various network architectures and datasets, achieving state-of-the-art results.  \n\nFor instance, when applied to a DenseNet trained on CIFAR-10, ODIN reduces the false positive rate from 34.7% to 4.3% at a true positive rate of 95%.  These findings highlight the effectiveness of ODIN as a simple yet powerful method for OOD detection. \n",
    "This study presents a framework for unsupervised representation learning in large-scale neural networks based on the infomax principle. Utilizing an asymptotic approximation of Shannon's mutual information, the study demonstrates that a hierarchical infomax approach provides a strong initialization for optimizing the global information-theoretic objective. \n\nThe proposed method employs gradient descent on the objective function to refine the initial solution, enabling the learning of representations for complete, overcomplete, and undercomplete bases. Numerical experiments indicate that the algorithm effectively extracts salient features from datasets, exhibiting robustness and efficiency.\n\nComparisons with existing methods suggest that the proposed algorithm offers advantages in both training speed and robustness for unsupervised representation learning. Additionally, the framework is readily adaptable to supervised and unsupervised learning of deep network architectures. \n",
    "Recurrent Neural Networks (RNNs) are amazing at handling sequential data, but dealing with really long sequences can be a bit of a headache. Things like slow processing, vanishing gradients, and difficulty remembering information from way back in the sequence can make training a real challenge.  \n\nWe've come up with a clever solution: the Skip RNN!  It's like giving your RNN a shortcut button.  Our model learns to strategically skip unnecessary computations, essentially making the processing chain shorter and more efficient.  \n\nYou can even set a \"budget\" for the Skip RNN, encouraging it to be super frugal with its computations.  We put our model through its paces on a bunch of different tasks and were blown away by the results!  It significantly reduced the number of steps the RNN needed to take while maintaining or even improving accuracy compared to regular RNNs.\n\nWe're so excited about the potential of Skip RNNs to make sequence modeling faster and more powerful. And to share the love, we've made our code available for everyone to use and explore! \n\n\n",
    "This paper introduces a simple yet effective \"warm restart\" technique for Stochastic Gradient Descent (SGD) optimization, designed to improve the performance of deep neural network training.  Our method, called SGDR, periodically resets the learning rate during training, allowing the optimizer to escape local optima and converge faster. \n\nWe demonstrate state-of-the-art results on CIFAR-10 (3.14% error) and CIFAR-100 (16.21% error), showcasing the effectiveness of SGDR.  Further experiments on EEG data and a downsampled ImageNet dataset confirm its benefits across diverse tasks.  The source code for our method is publicly available at: https://github.com/loshchil/SGDR. \n\n\n",
    "Imagine an AI agent learning to navigate a complex world, making decisions and learning from its successes and failures. This is the realm of reinforcement learning, where policy gradient methods have emerged as powerful tools for training these intelligent agents.  \n\nHowever, these methods often struggle with a frustrating problem: noisy estimates of how good their actions are, leading to slow and inefficient learning.  It's like trying to learn a new skill with a coach who gives inconsistent and unreliable feedback. \n\nOur work introduces a clever solution inspired by a mathematical tool called Stein's identity.  We call it a \"control variate\" method, and it acts like a noise-canceling headphone for the agent's learning process.  \n\nPrevious control variate methods were limited in their ability to filter out noise. We've overcome this limitation by introducing more flexible and adaptable baseline functions that can better account for the complexity of the agent's actions.  \n\nThe result?  A dramatic boost in learning speed and efficiency!  Our method significantly improves the sample efficiency of state-of-the-art policy gradient algorithms, allowing agents to learn more effectively from their experiences.  It's like giving our AI agents a clearer path to mastery, enabling them to navigate the world with greater confidence and skill. \n",
    "Skip connections have revolutionized deep learning, enabling the training of exceptionally deep networks and becoming a cornerstone of modern architectures. However, a comprehensive understanding of their success remains elusive. This paper presents a novel explanation for the benefits of skip connections, focusing on their ability to mitigate singularities in the loss landscape that hinder deep network training.\n\nOur analysis identifies three primary types of singularities:\n\n1. **Overlap Singularities:**  Arise from the permutation symmetry of nodes within a layer, where different permutations of nodes result in the same function, creating a degenerate manifold in the loss landscape.\n\n2. **Elimination Singularities:**  Occur when nodes are consistently deactivated, essentially eliminating them from the network and leading to a loss of representational capacity.\n\n3. **Linear Dependence Singularities:**  Arise from linear dependencies between nodes, reducing the effective dimensionality of the learned representation.\n\nThese singularities create challenging regions in the loss landscape, characterized by flat or poorly conditioned areas that impede gradient-based optimization algorithms. \n\nWe argue that skip connections effectively address these singularities through multiple mechanisms:\n\n* **Symmetry Breaking:** Skip connections break the permutation symmetry of nodes, reducing the prevalence of overlap singularities.\n\n* **Node Elimination Prevention:** By providing alternative pathways for information flow, skip connections reduce the likelihood of node elimination, mitigating elimination singularities.\n\n* **Reduced Linear Dependence:** Skip connections encourage diversity in node activations, reducing linear dependence and alleviating associated singularities.\n\nFurthermore, for common initialization schemes, skip connections shift the network away from the regions most affected by these singularities. They effectively \"sculpt\" the loss landscape, smoothing out problematic areas and facilitating smoother optimization.\n\nWe provide evidence for these hypotheses through a combination of theoretical analysis using simplified models and empirical validation on deep networks trained on real-world datasets. Our findings offer a new perspective on the role of skip connections in deep learning, emphasizing their ability to reshape the loss landscape and enable efficient training of very deep architectures. \n",
    "Embarking on a journey to replicate the groundbreaking results of the \"Natural Language Inference over Interaction Space\" paper for the ICLR 2018 Reproducibility Challenge, we initially faced the task of recreating the model from scratch, unaware of the availability of the original code.  \n\nOur independent implementation, a testament to the paper's clear and insightful description, achieved a commendable 86.38% accuracy on the Stanford NLI dataset.  While this fell slightly short of the 88.0% accuracy reported in the paper, our investigation revealed the likely culprits: differences in optimization techniques and model selection strategies. \n\nThis exercise highlights the subtle yet significant impact of these often-overlooked aspects of deep learning research.  While our reproduction closely approached the original results, it underscores the importance of transparency and detailed documentation for ensuring complete reproducibility in the rapidly evolving field of AI. \n",
    "Replicating existing research is a fundamental aspect of scientific progress, ensuring the robustness and reliability of published findings.  We undertook the task of reproducing the \"Learn to Pay Attention\" model, an innovative approach for integrating attention mechanisms within convolutional neural networks. \n\nOur implementation faithfully recreated the model's architecture and training procedures. We then carefully evaluated its performance on the core tasks of image classification and fine-grained recognition, mirroring the experimental setup of the original paper.  \n\nOur results successfully replicated the key findings reported in the original study, providing independent validation of the \"Learn to Pay Attention\" model's effectiveness. This exercise not only reinforces the original contribution but also deepens our own understanding of the intricacies of attention mechanisms and their application in computer vision. \n\n\n",
    "Imagine capturing the essence of a sentence, its meaning distilled into a powerful code that unlocks a world of possibilities.  That's the quest for universal sentence representations, a fundamental challenge in natural language processing.\n\nOur work introduces a novel approach to learning these powerful representations, focusing on the subtle cues hidden within the suffixes of word sequences.  Think of it like deciphering a secret language where the endings of words hold the key to unlocking deeper meaning.\n\nWe trained our model on the massive Stanford Natural Language Inference dataset, teaching it to discern subtle relationships between sentences.  And the results are truly inspiring! \n\nOur approach surpasses existing methods on several transfer tasks in the SentEval benchmark, a testament to its ability to capture the core meaning of a sentence.  This breakthrough paves the way for a new era of natural language understanding, where AI can grasp the nuances of human communication with unprecedented accuracy. \n\n\n",
    "We're always looking for ways to help neural networks learn more effectively, and one common strategy is to enrich their understanding by creating new features from existing ones.  Think of it like giving them extra building blocks to work with!\n\nWe decided to investigate the impact of using polynomial features – think of them as combinations of existing features raised to different powers – in the context of natural language inference. This task involves understanding the relationship between two sentences, which can be quite challenging!\n\nWe experimented with different polynomial degrees and, interestingly, discovered that scaling up those degree 2 features had the most significant positive impact. It's like finding the sweet spot for combining information!  In our best models, this simple adjustment led to a 5% reduction in classification error, showing that even small tweaks can make a big difference in helping neural networks learn more effectively. \n",
    "Here are the main points of the text as bullet points:\n\n* This study presents a generalization bound for feedforward neural networks. \n* The bound is expressed in terms of:\n    * The product of the spectral norm of the weight matrices for each layer.\n    * The Frobenius norm of the weights across all layers.\n* The generalization bound is derived using a PAC-Bayes analysis. \n",
    "Batch Normalization, a ubiquitous technique in deep learning, has long been shrouded in empirical success, its theoretical underpinnings remaining somewhat elusive.  This work embarks on a journey to unveil the probabilistic foundations of Batch Normalization, illuminating its inner workings through a novel probabilistic interpretation. \n\nWe introduce an elegant probabilistic model and demonstrate that Batch Normalization can be viewed as maximizing a lower bound on its marginal log-likelihood.  This insight provides a fresh perspective on this widely adopted technique, grounding it in the principles of probabilistic inference.\n\nGuided by this new probabilistic understanding, we devise an algorithm that operates consistently during both training and testing, eliminating the discrepancies that often plague traditional Batch Normalization.  However, this ideal algorithm comes at the cost of computational efficiency.  \n\nTo address this challenge, we introduce Stochastic Batch Normalization, an efficient approximation that faithfully captures the essence of the ideal inference procedure while maintaining practical computational demands.  This innovation unlocks a valuable tool for estimating uncertainty in deep learning models, a crucial aspect for building robust and reliable AI systems.\n\nWe showcase the prowess of Stochastic Batch Normalization on a variety of popular architectures, including convolutional behemoths like VGG and ResNets, achieving compelling results on benchmark datasets such as MNIST and CIFAR-10. Our work illuminates the path towards a deeper understanding of Batch Normalization, empowering practitioners with a more principled and robust approach to training deep neural networks. \n",
    "In deep learning's realm, where convolutions reign supreme,\nA belief persists, information lost, a fleeting dream.\n\nUninformative variance, cast aside with every stride,\nLeaving behind a trace, where input secrets hide.\n\nBut what if knowledge held, could flow in both directions, true?\nAn invertible path, where information's dance anew.\n\nWe challenge dogma's hold, with i-RevNet's graceful art,\nA cascade of homeomorphisms, a masterpiece, set apart.\n\nNo information discarded, a mirror to reflect,\nFrom input to representation, a journey we dissect.\n\nIll-conditioned inversions, a hurdle to surmount,\nWith explicit inverses, a solution we recount.\n\nContraction and separation, a dance of form so deep,\nAn alternate explanation, for deep networks' secrets to keep.\n\nLinear interpolations, like brushstrokes on a canvas bright,\nRevealing i-RevNet's vision, in shades of purest light. \n",
    "Deep latent variable models are really good at learning meaningful representations of data, like uncovering the hidden factors that explain what we observe.  We took a close look at a popular model called the \"deep information bottleneck\" and noticed some areas where it could be improved.  \n\nWe introduced a clever trick called a \"copula transformation,\" which essentially reshapes the data in a way that makes it easier for the model to separate out the underlying factors.  Think of it like untangling a messy ball of yarn!  \n\nThis transformation has a cool side effect: it encourages the model to use only a small number of latent variables to represent the data, making the representation more compact and efficient.  We call this \"sparsity.\"\n\nWe put our new model to the test on both artificial and real datasets and the results were impressive!  It learned more disentangled and meaningful representations than the original deep information bottleneck model, demonstrating the power of our copula transformation for improving representation learning. \n\n\n",
    "Building upon the success of the MAC model for visual question answering, we introduce a streamlined variant that achieves comparable accuracy while boasting faster training times.  Our simplified architecture, a testament to efficient design, retains the core strengths of MAC while reducing computational complexity. \n\nWe put both models through their paces on the challenging CLEVR and CoGenT datasets, visual question answering benchmarks that test a model's ability to reason about complex scenes.  Our results showcase the power of transfer learning.  By fine-tuning our models on these datasets, we achieve a remarkable 15-point accuracy boost, matching state-of-the-art performance. \n\nHowever, our exploration also reveals a cautionary tale.  We demonstrate that improper fine-tuning can actually lead to a *decrease* in accuracy, highlighting the importance of careful consideration and meticulous implementation when adapting pre-trained models to new tasks. \n\nOur work underscores the delicate balance between model complexity and performance in visual question answering.  By simplifying the MAC architecture while retaining its core principles, we unlock faster training and maintain competitive accuracy, paving the way for more efficient and effective visual reasoning systems. \n\n\n",
    "Adaptive Computation Time (ACT) for Recurrent Neural Networks (RNNs) has emerged as a powerful approach for enabling dynamic computation in sequence modeling.  ACT allows RNNs to process individual input elements multiple times, adaptively determining the optimal number of computational steps based on the input complexity.\n\nThis paper investigates an alternative approach to variable computation in RNNs, which we call Repeat-RNN.  Unlike ACT, which dynamically adjusts the number of repetitions, Repeat-RNN processes each input element a fixed number of times, determined as a hyperparameter during training. \n\nWe conduct a comparative analysis of ACT and Repeat-RNN on a range of sequence modeling tasks.  Surprisingly, our results reveal that Repeat-RNN achieves comparable performance to ACT across these tasks, suggesting that the dynamic halting mechanism of ACT may not be essential for achieving strong performance in certain scenarios. \n\nOur findings challenge the prevailing assumption that adaptive computation is always superior to fixed computation in RNNs.  Repeat-RNN's simplicity and competitive performance suggest it may be a viable alternative to ACT in situations where computational resources are limited or deterministic computation is preferred. \n\nTo encourage further investigation and facilitate reproducible research, we provide open-source implementations of both ACT and Repeat-RNN in popular deep learning frameworks, TensorFlow and PyTorch, at: https://imatge-upc.github.io/danifojo-2018-repeatrnn/ \n\n\n",
    "This study investigates the potential of Generative Adversarial Networks (GANs) for anomaly detection, a task where their ability to model complex, high-dimensional data distributions is particularly advantageous. We leverage recent advancements in GAN architectures to develop a novel anomaly detection method that achieves state-of-the-art performance on benchmark image and network intrusion datasets.  \n\nSignificantly, our approach boasts a dramatic speedup of several hundred-fold during test time compared to the only previously published GAN-based anomaly detection method.  This efficiency, coupled with superior accuracy, positions GANs as a powerful and practical tool for anomaly detection in diverse domains. \n\n\n",
    "This paper addresses the Natural Language Inference (NLI) task, which involves determining the logical relationship between a premise and a hypothesis expressed in natural language.  The authors propose Interactive Inference Network (IIN), a class of neural network architectures designed for NLI.\n\nIINs extract hierarchical semantic features from an \"interaction space,\" representing the relationship between the premise and hypothesis. The study highlights that the interaction tensor, corresponding to attention weights, contains semantic information relevant to NLI, with denser tensors encoding richer information.\n\nOne specific IIN architecture, Densely Interactive Inference Network (DIIN), is evaluated on several large-scale NLI datasets.  Results indicate that DIIN achieves state-of-the-art performance, including a greater than 20% error reduction on the Multi-Genre NLI (MultiNLI) dataset compared to the previous best-performing system. \n\n\n",
    "## Enhancing Neural Network Robustness through Formal Verification\n\n**Problem:**\n\n- Adversarial examples, subtly modified inputs designed to cause misclassification, severely limit the deployment of neural networks in safety-critical applications.\n- Numerous proposed defenses against adversarial attacks have proven ineffective, with many quickly circumvented by new attack strategies. \n\n**Our Approach:**\n\n- We leverage formal verification techniques to rigorously analyze and enhance neural network robustness. \n\n**Key Contributions:**\n\n* **Provably Minimal Adversarial Examples:** We develop a method for constructing adversarial examples with provable minimal distortion, providing a powerful tool for evaluating defense mechanisms. \n* **Formal Verification of Adversarial Retraining:**  We demonstrate that adversarial retraining, a popular defense technique, provably increases the distortion required to generate successful adversarial examples by a factor of 4.2, providing strong evidence for its effectiveness. \n\n**Impact:**\n\nOur work signifies a crucial step towards building trustworthy and reliable neural networks for real-world applications by employing formal verification to create robust defenses against adversarial attacks.  \n",
    "Deep neural networks are like brilliant but mysterious artists, capable of producing amazing results but often leaving us wondering, \"How did they do that?\"  Their complex inner workings have earned them the label of \"black boxes,\" limiting our trust and understanding of their decisions.\n\nThis work unveils the secrets behind these enigmatic creations, introducing a powerful tool called Agglomerative Contextual Decomposition (ACD).  Imagine peering into the mind of a neural network, uncovering the hidden connections and patterns it uses to make predictions.\n\nACD acts like an art critic, meticulously dissecting a neural network's prediction by creating a hierarchical clustering of the input features.  Think of it like grouping brushstrokes on a canvas based on their contribution to the overall composition. Each cluster represents a meaningful grouping of features that the network has learned are important for making accurate predictions. \n\nBut ACD goes beyond mere visualization; it empowers us to:\n\n* **Diagnose Errors:** Uncover why a network might be making mistakes, like identifying biases in the data or highlighting inconsistencies in the model's reasoning.\n* **Compare Models:** Determine which of two networks is more accurate and trustworthy, providing insights for model selection.\n* **Build Trust:** Help users understand and trust a network's predictions by revealing its underlying logic.\n\nRemarkably, ACD's hierarchical interpretations are robust to adversarial attacks – those subtle manipulations designed to fool AI systems. This resilience suggests that ACD captures the true essence of the input, ignoring superficial noise and focusing on fundamental patterns.  \n\nOur work illuminates the path towards more transparent and trustworthy deep learning, empowering us to harness the power of these complex models with greater confidence and understanding. \n\n\n",
    "In music's vibrant tapestry, where timbre's hues entwine,\nWe seek to shift its colors, while melody remains divine.\n\nFrom instrument to instrument, a sonic transformation,\nPreserving pitch and rhythm, a delicate orchestration.\n\nLike brushstrokes on a canvas, in time-frequency's embrace,\nWe paint with style transfer, in this audio space.\n\nTimbreTron, our creation, a fusion of art and might,\nUnveiling timbre's secrets, in waves of sound and light.\n\nCQT, our chosen palette, with pitch equivariance bright,\nA canvas for convolutions, to sculpt sounds just right.\n\nHuman ears, our judges, in a symphony of tests,\nConfirm TimbreTron's mastery, exceeding all requests.\n\nMonophonic whispers, polyphonic grand displays,\nTimbre transformed, yet music's soul, in every note, it stays. \n",
    "The art of language modeling, predicting the intricate dance of words, has long relied on capturing both the immediate context and the broader narrative flow.  We explore a novel approach that harmoniously blends these two temporal perspectives, enriching the representation of language with a touch of dynamic adaptation.\n\nOur work delves into the realm of language models with dynamically evolving weights, extending this paradigm by casting language modeling as an \"online learning-to-learn\" challenge.  Imagine a meta-learner, a conductor orchestrating the evolution of a language model's weights through the elegant guidance of gradient descent.\n\nThis meta-learner, a master of adaptation, continuously refines the model's internal representation, allowing it to seamlessly integrate both short-term, hidden-state-based memories and medium-term knowledge encoded within the dynamic weights. This harmonious fusion of temporal scales paves the way for more expressive and contextually aware language models, capable of capturing the subtle nuances of human communication. \n",
    "Okay, so GANs are these awesome AI models that can learn to create super realistic images, like they're figuring out the secret recipe for making pictures that look like the real world. \n\nWe thought, \"Hey, since GANs are so good at understanding the structure of images, could we use them to make our models even smarter?\"\n\nSo, we came up with this cool trick called \"manifold regularization.\"  It's like adding a special ingredient that helps the GAN learn smoother and more consistent representations of the data.  \n\nWe used a clever technique to approximate something called the Laplacian norm, which basically measures how smooth the GAN's understanding of the image manifold is. And the best part is, it's super easy to calculate using the GAN itself!\n\nWe combined this with another awesome GAN called Improved GAN, and bam!  We achieved state-of-the-art results for semi-supervised learning on the CIFAR-10 image dataset.  That means our model can learn from just a little bit of labeled data and a lot of unlabeled data, which is super helpful when labeling data is expensive. \n\nPlus, our method is way easier to use than other fancy techniques out there. It's like giving you a superpower for training better GANs without breaking a sweat! \n",
    "Certain over-parameterized deep neural networks, utilizing standard activation functions and trained with cross-entropy loss, exhibit a remarkable property: the absence of detrimental local valleys in their loss landscapes.  We prove that for these networks, a continuous path always exists from any point in parameter space along which the cross-entropy loss monotonically decreases, approaching zero arbitrarily closely.  This finding implies that such networks are devoid of sub-optimal strict local minima, ensuring that gradient-based optimization algorithms can consistently converge to globally optimal solutions. \n",
    "Imagine asking an AI, \"How many zebras are in this picture?\" It sounds simple, but counting objects in images has actually been a tough challenge for Visual Question Answering (VQA) models.\n\nWe discovered that the way these models use \"soft attention\" – like gently focusing on different parts of the image – makes it hard for them to count accurately. \n\nSo, we designed a special neural network component that's a counting whiz! It works by looking at object proposals – basically, guesses about where objects might be in the picture – and then counting them up reliably.  \n\nWe tested our component on a simple counting task, and it aced it!  Then we incorporated it into a VQA model and saw some amazing results. Our model achieved top-notch accuracy on counting questions in the VQA v2 dataset, even beating out those bulky ensemble models!  Plus, it didn't mess up the accuracy on other types of questions, which is super important.\n\nOur counting component also made a huge difference on a really tough metric called \"balanced pair accuracy,\" boosting performance by a whopping 6.6%.  It's like giving VQA models a superpowered counting lens! \n\n\n\n",
    "## Spectral Normalization for Generative Adversarial Networks\n\n**Challenge:**\n\n* Generative Adversarial Networks (GANs) are known for their unstable training dynamics.\n\n**Solution:**\n\n* We introduce Spectral Normalization (SN), a novel weight normalization technique specifically designed to stabilize GAN training. \n\n**Advantages:**\n\n* **Computationally Efficient:** SN is lightweight and easy to integrate into existing GAN implementations.\n* **Improved Image Quality:**  SN-GANs (GANs with spectral normalization) generate images of comparable or superior quality compared to previous stabilization techniques.\n\n**Empirical Validation:**\n\n* Extensive experiments on CIFAR-10, STL-10, and ILSVRC2012 datasets demonstrate the effectiveness of SN in stabilizing GAN training and enhancing image generation quality. \n\n\n",
    "Imagine turning complex networks, like social media connections or molecular structures, into a language that AI can understand! That's the power of node embedding algorithms, which represent each node in a graph as a point in a multi-dimensional space. \n\nWhile this field is relatively new compared to the well-established world of natural language processing, we're excited to explore the potential of these algorithms and shed light on their unique characteristics. \n\nWe conducted a comprehensive study, examining the performance of four popular node embedding algorithms across diverse graphs, characterized by different centrality measures.  Think of centrality as a measure of a node's importance within the network. \n\nOur experiments, spanning six datasets and a range of graph centralities, revealed fascinating insights into the strengths and weaknesses of different embedding algorithms.  This newfound knowledge provides a valuable foundation for further research and development, paving the way for more effective and insightful network analysis.\n\nThe future of graph representation learning is bright, and we're eager to continue uncovering its potential to unlock the hidden patterns and knowledge within complex networks. \n",
    "This paper introduces a novel dataset for evaluating logical entailment in AI models.  We benchmark a range of popular sequence processing architectures, including convolutional networks, LSTM RNNs, and tree-structured networks, against a new model class called PossibleWorldNets, which computes entailment via a \"convolution over possible worlds.\"  \n\nOur findings reveal that:\n\n* Convolutional networks lack the appropriate inductive bias for logical reasoning tasks.\n* Tree-structured networks outperform LSTMs due to their ability to exploit syntactic structure.\n* PossibleWorldNets achieve superior performance, demonstrating the effectiveness of our proposed approach for capturing the nuances of logical entailment. \n\nThis dataset and our analysis provide valuable insights for developing AI systems capable of robust and accurate logical reasoning. \n\n\n",
    "Imagine discovering a \"winning lottery ticket\" hidden within a massive neural network – a smaller, more efficient subnetwork that's just as capable as its larger counterpart!  That's the exciting discovery we unveil in this paper.\n\nWe reveal that a standard pruning technique, typically used to shrink trained networks, can actually uncover these \"winning tickets\" – subnetworks with exceptional learning abilities due to their lucky initializations.  \n\nThis finding leads to the \"lottery ticket hypothesis\": dense, randomly-initialized networks contain subnetworks that, when trained in isolation, achieve comparable accuracy to the original network in a similar number of iterations.  These winning tickets have hit the jackpot of initialization, their initial weights setting them on a path to rapid and effective learning.\n\nWe present a simple algorithm for identifying these winning tickets and provide compelling evidence to support the lottery ticket hypothesis through a series of experiments. Our results consistently demonstrate the existence of winning tickets that are a mere 10-20% the size of the original networks, across various architectures and datasets, including MNIST and CIFAR10.\n\nAmazingly, above this threshold, these winning tickets not only learn faster but also achieve *higher* test accuracy than their original, larger counterparts. This discovery opens up exciting possibilities for creating more efficient and powerful deep learning models, harnessing the power of these fortunate initializations. \n\n\n",
    "This work provides a novel analysis of the singular values of the linear transformation represented by a typical 2D multi-channel convolutional layer. We derive a characterization that enables efficient computation of these singular values, a crucial step for understanding the layer's spectral properties.\n\nBuilding upon this characterization, we develop an algorithm for projecting a convolutional layer onto a ball defined by the operator norm. This projection serves as a powerful regularization technique, constraining the layer's transformation to prevent excessive amplification of input signals.\n\nEmpirical evaluations demonstrate the effectiveness of our proposed regularization method.  For instance, applying it to a deep residual network with batch normalization on the CIFAR-10 dataset improves the test error from 6.2% to 5.3%. This result highlights the potential of our approach for enhancing the generalization performance of deep convolutional neural networks. \n",
    "Deep convolutional neural networks (DCNNs) work incredibly well in practice, but we still struggle to fully understand why. This paper introduces a new theoretical framework for analyzing these complex networks, specifically those using the popular ReLU activation function.\n\nOur framework, based on a \"teacher-student\" setup, allows us to analyze a student network's learning process by comparing it to a more knowledgeable \"teacher\" network. Unlike previous approaches, our method avoids unrealistic assumptions about the data and is compatible with common techniques like Batch Normalization.\n\nThis framework offers a powerful tool for investigating key aspects of deep learning, such as overfitting, generalization, and how networks learn to separate different features in the data. We believe this work will pave the way for a deeper theoretical understanding of DCNNs and their remarkable success.  \n",
    "This paper introduces Neural Program Search, a novel algorithm for synthesizing programs from natural language descriptions and a limited set of input-output examples. Our approach synergistically combines advancements in deep learning and program synthesis by leveraging a carefully designed domain-specific language (DSL) and a sophisticated search algorithm guided by a Seq2Tree model.\n\nTo rigorously evaluate the efficacy of Neural Program Search, we introduce a semi-synthetic dataset comprising natural language descriptions, corresponding programs, and accompanying test cases.  Empirical evaluations demonstrate that our algorithm significantly outperforms a strong sequence-to-sequence baseline with attention, highlighting its superior capacity for program synthesis from natural language specifications.\n\nOur contributions advance the field of program synthesis by demonstrating the feasibility of generating complex programs from natural language descriptions, paving the way for more intuitive and accessible programming paradigms. \n",
    "Attention mechanisms have become a cornerstone of modern neural machine translation systems, enabling them to focus on relevant parts of the input sentence during translation. However, most attention models operate at the word level, neglecting the importance of phrasal alignments that were crucial for the success of earlier statistical machine translation techniques.\n\nThis paper introduces novel phrase-based attention mechanisms that consider groups of words (n-grams) as attention units.  We integrate these phrase-based attentions into the powerful Transformer architecture and demonstrate significant improvements in translation quality.  \n\nOur experiments on the WMT newstest2014 English-German and German-English tasks show that incorporating phrase-level information leads to gains of up to 1.3 BLEU points. These results underscore the importance of capturing phrasal relationships for achieving high-quality machine translation. \n",
    "Imagine an AI that can not only understand language but also learn the subtle art of editing, capturing the essence of changes and applying them to new text.  That's the exciting frontier we explore with our novel approach to learning distributed representations of edits.\n\nOur system comprises two key components: a \"neural editor\" that learns to make edits based on desired outcomes and an \"edit encoder\" that distills the essence of these edits into a compact, meaningful representation.  Think of it like a master editor working alongside a meticulous note-taker, capturing the nuances of each revision. \n\nWe trained our models on a rich tapestry of edits, encompassing both natural language and source code, pushing them to decipher the underlying structure and semantics of changes.  \n\nThe results are captivating! Our models exhibit a remarkable ability to learn the art of editing, capturing the essence of changes and applying them to new inputs with promising accuracy. It's like witnessing a machine grasp the subtle dance of revision, understanding not just *what* has changed but *why*.  \n\nWe believe this intriguing task opens up a world of possibilities for AI-assisted writing, code refactoring, and beyond.  We invite the research community to join us on this exciting journey, to further explore the potential of learning from the art of editing and unlock new frontiers in machine intelligence. \n",
    "Unlocking the power of kernel learning, a cornerstone of machine learning, often involves a challenging search for the optimal kernel function. This work presents an elegant and principled solution, grounded in the rich mathematical framework of Fourier analysis. \n\nOur method leverages a deep understanding of translation-invariant and rotation-invariant kernels, allowing us to systematically construct a sequence of increasingly powerful feature maps.  These maps, like skilled artisans, iteratively refine the decision boundary of a Support Vector Machine (SVM), maximizing the separation between different classes.\n\nWe provide strong theoretical guarantees for both optimality and generalization, demonstrating the soundness of our approach.  Our algorithm, interpreted as a dynamic game between two players seeking equilibrium, elegantly navigates the complex landscape of kernel learning.\n\nBut our method doesn't just shine in theory; it excels in practice too!  Evaluations on diverse datasets, both synthetic and real-world, showcase its impressive scalability and consistent superiority over existing methods that rely on random features.  \n\nThis work opens up exciting possibilities for applying kernel methods to a wider range of challenging machine learning tasks, offering a robust and efficient solution for discovering the optimal kernel function. \n",
    "Imagine a mind that never stops learning, effortlessly absorbing new knowledge without forgetting the lessons of the past. This is the dream of continual learning, a quest to build artificial intelligence that mirrors the boundless adaptability of the human brain.\n\nThis paper introduces Variational Continual Learning (VCL), a framework that breathes life into this dream.  Like a master weaver, VCL intertwines the threads of online variational inference and cutting-edge Monte Carlo techniques, creating a tapestry of continuous learning.\n\nVCL empowers deep learning models, both the discriminative kind that classify and the generative kind that imagine, to navigate the ever-changing landscape of knowledge. Existing tasks may evolve, their contours shifting with time, while entirely new challenges emerge on the horizon.  VCL embraces this dynamic flow, gracefully adapting to the shifting tides of knowledge.\n\nAnd the results are magical!  VCL surpasses the limitations of its predecessors, outperforming state-of-the-art continual learning methods on a diverse array of tasks.  The dreaded curse of catastrophic forgetting, where old knowledge is washed away by new experiences, is banished.  VCL achieves this feat with an effortless grace, a testament to its inherent elegance and power. \n\nWith VCL, we take a bold step toward a future where AI systems can learn and grow continuously, expanding their horizons without losing sight of their past, just as we humans do. \n",
    "This report investigates the reproducibility of the paper \"On the regularization of Wasserstein GANs\" (2018), focusing on five key aspects: learning speed, training stability, hyperparameter robustness, Wasserstein distance estimation, and sampling methods. We assess the reproducibility of each aspect and detail the computational resources required. All source code is publicly available to facilitate transparency and further research. \n",
    "We've developed a new way to analyze computer programs and detect malicious software!  Our method works by:\n\n1. **Extracting Patterns:**  We automatically identify complex patterns in the way a program behaves.\n2. **Creating Embeddings:**  We use a neural network called an autoencoder to represent these patterns as points in a continuous space.  Think of it like translating the program's behavior into a secret code.\n\nWe tested our method on a real-world task of identifying malicious software and achieved great results.  Plus, the \"secret code\" we learned actually captures meaningful information about the different parts of the patterns, allowing us to better understand how malicious software behaves. \n\n\n",
    "This paper introduces a powerful and versatile generative model capable of conditional data synthesis, addressing a critical gap in existing deep learning approaches.  Our proposed model, based on a variational autoencoder architecture,  exhibits a remarkable ability to generate diverse and realistic samples conditioned on arbitrary subsets of observed features, encompassing both continuous and categorical data.\n\nTrained via stochastic variational Bayes, our model learns a rich latent representation that captures the underlying relationships between different features. This enables it to perform \"one-shot\" generation, seamlessly imputing missing values or completing partially observed data points with unprecedented fidelity.\n\nExtensive empirical evaluations on synthetic data, feature imputation benchmarks, and image inpainting tasks unequivocally demonstrate the efficacy and versatility of our approach.  The generated samples exhibit remarkable diversity and realism, highlighting the model's capacity to capture the underlying data distribution and generate plausible completions for partially observed inputs. \n\nOur work represents a significant advancement in conditional generative modeling, providing a robust and flexible framework for tackling a wide range of applications, including data imputation, image editing, and creative content generation. \n",
    "In a breakthrough for the field of deep learning, researchers have unveiled a new method for optimizing hierarchical Variational Autoencoders (VAEs), a powerful class of generative models. This innovation promises to enhance the performance of VAEs in a wide range of applications, from representation learning to data compression.\n\nWhile traditional VAEs have been primarily used for generating new data, recent advancements, such as the introduction of β-VAEs, have expanded their utility to encompass tasks like clustering and lossy data compression. These models achieve this versatility by allowing users to fine-tune the trade-off between the information content of the model's internal representation and the accuracy of data reconstruction. \n\nThis new research delves deeper into this trade-off, focusing on hierarchical VAEs, which employ multiple layers of latent variables. The researchers have identified a way to precisely control the information flow through each layer, enabling more targeted optimization for specific tasks.\n\nThrough rigorous mathematical analysis and large-scale experiments, they have established a clear link between the information content of each layer and the performance on downstream tasks. This understanding provides valuable guidance for practitioners, enabling them to fine-tune their hierarchical VAEs for optimal results.  \n\nThis breakthrough promises to unlock the full potential of hierarchical VAEs, paving the way for more efficient and effective applications across diverse domains. \n",
    "Deep within the hidden layers of neural networks, a silent battle unfolds. Adversarial examples, those cunningly crafted inputs designed to deceive, lurk in the shadows, threatening to unravel the very fabric of artificial intelligence.\n\nUnderstanding these adversarial subspaces, the treacherous regions where models falter, is crucial for building robust and reliable AI systems.  Enter Local Intrinsic Dimensionality (LID), a promising tool for illuminating these shadowy corners.  \n\nBut is LID truly the silver bullet we've been searching for?  \n\nOur investigation unveils a darker truth.  We subject LID to a series of rigorous tests, probing its limits in the face of two formidable challenges:\n\n1. **Oblivious Attacks:**  When confronted with attacks that operate without knowledge of the target model, LID's effectiveness crumbles.  Even more alarming, its performance proves highly sensitive to the attacker's confidence level.  Ensembles of adversarial examples with varying confidence levels, a seemingly robust strategy, surprisingly cripple LID's ability to discern the adversarial subspace.\n\n2. **Black-box Transfer Attacks:**  When adversarial examples are crafted from a different model, LID's defenses are rendered utterly ineffective.  It stumbles blindly, unable to penetrate the veil of these transferred attacks.\n\nThese findings cast a shadow of doubt on LID's ability to truly grasp the nature of adversarial subspaces.  Our work serves as a stark warning, a call to arms for the development of more sophisticated and robust tools to combat the ever-evolving threat of adversarial attacks.  The battle for trustworthy AI has only just begun. \n\n\n",
    "Generative Adversarial Networks (GANs) are powerful generative models, but training them remains notoriously challenging.  While much research has focused on improving the GAN objective function, the optimization algorithms used for training have received less attention. \n\nThis paper addresses this gap by:\n\n* **Framing GAN optimization as a variational inequality problem**, drawing connections to the rich mathematical programming literature.\n* **Debunking common misconceptions about saddle point optimization.**\n* **Adapting advanced optimization techniques for variational inequalities to GAN training.**\n* **Applying averaging, extrapolation, and a novel \"extrapolation from the past\" technique to both SGD and Adam optimizers.**\n\nOur work highlights the importance of considering specialized optimization methods for improving GAN training stability and efficiency. \n\n\n",
    "Recent advancements in neural message passing algorithms have significantly improved semi-supervised classification on graphs. However, these methods typically rely on a limited neighborhood around the target node for classification, hindering their ability to capture long-range dependencies.\n\nThis paper leverages the connection between graph convolutional networks (GCNs) and PageRank to introduce an enhanced propagation scheme based on personalized PageRank.  This scheme forms the basis for two new models: personalized propagation of neural predictions (PPNP) and its computationally efficient approximation, APPNP. \n\nThese models exhibit comparable or faster training times and utilize a similar or smaller number of parameters compared to existing methods.  Crucially, they leverage a larger and adjustable neighborhood for classification, enabling them to capture more global graph information.  Furthermore, PPNP and APPNP are modular and can be easily integrated with any neural network architecture. \n\nExtensive evaluations demonstrate that PPNP and APPNP outperform several recently proposed methods for semi-supervised classification, establishing a new state-of-the-art for GCN-based models. An implementation of these models is publicly available. \n",
    "Defenses against adversarial examples, those subtle manipulations that can fool AI systems, often create a deceptive sense of security.  We expose a phenomenon called \"obfuscated gradients\" – a form of gradient masking that tricks attackers into believing a defense is effective when it's not. \n\nWhile defenses exhibiting obfuscated gradients appear to thwart iterative optimization-based attacks, we demonstrate that they are ultimately vulnerable.  We identify three distinct types of obfuscated gradients and develop targeted attack techniques to overcome each one.\n\nOur investigation reveals that obfuscated gradients are surprisingly prevalent.  In a case study of defenses presented at ICLR 2018, we found that 7 out of 9 \"secure\" defenses relied on this deceptive tactic.  Our new attack methods successfully circumvented 6 of these defenses entirely and partially broke another one, all within the original threat model claimed by the authors.\n\nThese findings highlight the need for a more rigorous evaluation of defenses against adversarial examples. Obfuscated gradients can create a false sense of security, leading to the deployment of vulnerable systems.  Our work provides valuable insights for developing more robust and reliable defenses by exposing the limitations of gradient masking techniques. \n",
    "Imagine a world where AI can decipher the hidden language of networks, from social connections to intricate biological pathways. That's the promise of node embedding, where we transform each node in a graph into a meaningful code, unlocking a treasure trove of insights.\n\nOur approach, Graph2Gauss, takes this a step further, embracing the inherent uncertainty in real-world networks. Instead of representing nodes as rigid points, we embrace a more nuanced perspective, capturing each node as a cloud of possibilities – a Gaussian distribution.\n\nThis \"fuzzy\" representation allows Graph2Gauss to:\n\n* **Master diverse networks:** It effortlessly handles various graph types, whether it's a simple network of friends or a complex map of protein interactions.\n* **Learn from both structure and attributes:** It cleverly combines information about connections and individual node characteristics, painting a richer picture of each node's role.\n* **Adapt to new arrivals:**  It effortlessly welcomes new nodes to the network, instantly understanding their place without needing a lengthy retraining process.\n* **Estimate uncertainty:**  It quantifies the fuzziness of each node's representation, revealing fascinating insights about neighborhood diversity and hidden network structures.\n\nWe put Graph2Gauss to the test on real-world networks, and the results were astounding! It outperformed state-of-the-art methods in predicting links, classifying nodes, and uncovering hidden network dimensions.  By embracing uncertainty, Graph2Gauss achieves a new level of accuracy and reveals a deeper understanding of the complex relationships within networks. \n\n\n",
    "Convolutional Neural Networks (CNNs) have taken the world of 2D image analysis by storm! Now, get ready for a new revolution: Spherical CNNs are here to tackle the exciting challenges of analyzing data on a sphere. \n\nThink of omnidirectional vision for drones and robots, understanding the intricate shapes of molecules, or modeling global weather patterns – these are just a few of the fascinating applications that demand a new approach to deep learning. \n\nWe introduce the building blocks for constructing these powerful Spherical CNNs. Our key innovation is a clever definition of spherical cross-correlation that's both expressive and rotation-equivariant, meaning it can capture complex patterns regardless of how the sphere is rotated.  \n\nWe also unlock the power of efficient computation by leveraging a generalized Fourier theorem and a super-fast non-commutative Fast Fourier Transform (FFT) algorithm. It's like giving Spherical CNNs a turbo boost!\n\nOur experiments on 3D model recognition and atomization energy regression showcase the incredible accuracy, speed, and versatility of Spherical CNNs.  They're poised to revolutionize a wide range of fields, opening up a whole new dimension for deep learning! \n\n\n",
    "Imagine teaching a computer to understand the language of molecules!  That's what we did by combining the power of natural language processing (NLP) with the world of chemistry.\n\nYou see, molecules can be represented as text using something called SMILES notation.  It's like a secret code that describes the molecule's structure. We realized that if we treat these SMILES strings like sentences, we could use NLP techniques to analyze them.\n\nWe focused on a crucial task in drug discovery: predicting how well a molecule will interact with a target protein.  Think of it like figuring out if a key will fit a specific lock.  \n\nBy applying NLP methods to the SMILES strings, our model not only achieved better results than previous methods but also revealed the hidden logic behind its predictions.  It's like the AI learned to \"read\" the molecular language and tell us why certain molecules are good candidates for drugs.\n\nThis breakthrough opens up exciting new possibilities for accelerating drug discovery and designing better medicines! \n\n\n",
    "Here are the main points from the text:\n\n* Computer Vision and Deep Learning are being used in agriculture to improve harvest quality and productivity.\n* Sorting fruits and vegetables after harvest is important for export and quality control.\n* Apples are prone to various defects that can occur during or after harvesting.\n* This paper aims to assist farmers in post-harvest handling.\n* The study explores using YOLOv3, a computer vision and deep learning model, to detect defects in apples. \n",
    "Training large LSTM networks, renowned for their ability to model complex sequences, often comes at the cost of significant computational resources and time.  We present two straightforward yet effective strategies for addressing this challenge, enabling faster training and more efficient models without compromising performance.\n\nOur first approach reimagines the structure of the LSTM matrices themselves, decomposing them into products of smaller matrices. This \"matrix factorization by design\" reduces the overall parameter count, leading to leaner and more computationally efficient models.\n\nThe second strategy, partitioning, focuses on dividing the LSTM's core components – the weight matrices, input vectors, and hidden states – into independent groups. This partitioning allows for parallel processing, dramatically accelerating training by distributing the workload across multiple computational units.\n\nBoth methods, while conceptually simple, yield impressive results. They enable the training of large LSTM networks to near state-of-the-art perplexity levels, a measure of language modeling performance, while significantly reducing both the number of parameters and training time.  \n\nOur work underscores the importance of continually exploring new avenues for optimizing deep learning models, seeking elegant solutions that balance computational efficiency with expressive power. \n",
    "While recurrent neural networks have become the dominant force in deep reading comprehension, their inherently sequential nature poses a critical limitation:  a lack of parallelization that hinders both training efficiency and deployment in latency-sensitive applications.  This bottleneck becomes especially pronounced when processing long texts, where the sequential processing inherent to RNNs becomes prohibitively slow.\n\nWe argue that convolutional architectures, with their inherent parallelism and ability to capture long-range dependencies, offer a compelling alternative for deep reading comprehension. This paper introduces a novel convolutional architecture based on dilated convolutional units, demonstrating that it can achieve comparable accuracy to state-of-the-art recurrent models on benchmark question answering tasks.\n\nCrucially, our approach unlocks significant speedups of up to two orders of magnitude during inference. This dramatic improvement in computational efficiency paves the way for deploying sophisticated reading comprehension models in real-time applications where low latency is paramount.  Our findings challenge the dominance of recurrent networks in this domain, highlighting the potential of convolutional architectures for achieving both accuracy and efficiency in deep reading comprehension. \n",
    "This study investigates the reinstatement mechanism proposed by Ritter et al. (2018) in the context of episodic meta-reinforcement learning (meta-RL). We analyze the neuronal representations within an episodic Long Short-Term Memory (epLSTM) cell, the agent's working memory, during training on an episodic variant of the Harlow visual fixation task.\n\nOur analysis reveals the emergence of two distinct classes of neurons:\n\n1. **Abstract Neurons:** These neurons encode task-agnostic knowledge, representing information relevant across multiple episodes and tasks.\n\n2. **Episodic Neurons:** These neurons exhibit task-specific representations, encoding information pertinent to the current episode's unique task demands.\n\nThese findings provide insights into the functional organization of working memory in episodic meta-RL agents, highlighting the distinct roles of abstract and episodic representations in supporting flexible and adaptive behavior. \n",
    "The rate-distortion-perception function (RDPF) provides a valuable framework for evaluating the trade-off between compression rate, distortion, and perceptual quality in lossy compression.  However, a key question has remained unanswered: can practical encoders and decoders achieve the theoretical limits suggested by the RDPF?\n\nThis work addresses this fundamental question, building upon the theoretical foundation laid by Li and El Gamal (2018).  We demonstrate that the RDPF is indeed achievable using a specific class of codes: stochastic, variable-length codes.  \n\nOur contributions are twofold:\n\n1. **Achievability Proof:** We prove the existence of stochastic, variable-length codes that can achieve the rate specified by the RDPF, bridging the gap between theory and practice.\n\n2. **Lower Bound:** We further establish that the RDPF serves as a lower bound on the achievable rate for this class of codes, meaning that no code within this class can compress data at a rate lower than that dictated by the RDPF while maintaining the desired perceptual quality.\n\nThese results significantly advance our understanding of the RDPF and its practical implications for lossy compression.  By proving the achievability of the RDPF and establishing its role as a lower bound, we provide a strong theoretical foundation for developing new and efficient compression algorithms that optimize for both rate and perceptual quality.  \n",
    "This paper presents Neural Phrase-based Machine Translation (NPMT), a machine translation model that explicitly models phrase structures in the target language using Sleep-WAke Networks (SWAN).  To overcome the monotonic alignment limitation of SWAN, NPMT incorporates a layer for local reordering of the input sequence.\n\nUnlike most neural machine translation (NMT) systems that rely on attention mechanisms, NPMT generates translations by sequentially outputting phrases, enabling linear-time decoding. Experiments on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese datasets show that NPMT achieves comparable or better performance than strong NMT baselines. Analysis of the generated output indicates that NPMT produces semantically coherent phrases. \n\n\n",
    "This paper establishes the critical role of sparse representations in enhancing the robustness of deep neural networks (DNNs) against adversarial attacks. We present a compelling theoretical and empirical case for the efficacy of sparsity as a defense mechanism, demonstrating its ability to significantly mitigate the impact of adversarial perturbations.\n\nFor linear classifiers, we provide rigorous mathematical proofs demonstrating that a sparsifying front end provably reduces the output distortion caused by  ℓ∞-bounded attacks. This reduction in distortion is proportional to K/N, where N represents the data dimensionality and K denotes the sparsity level.\n\nExtending this concept to the realm of DNNs, we introduce a \"locally linear\" model that provides a theoretical foundation for understanding and analyzing adversarial attacks and defenses.  Our framework enables the development of principled strategies for both crafting more effective attacks and designing robust defenses. \n\nEmpirical evaluations on the MNIST dataset validate our theoretical findings, showcasing the efficacy of the proposed sparsifying front end in mitigating the impact of adversarial perturbations.  Our work lays a strong foundation for leveraging sparsity as a powerful tool for enhancing the robustness and reliability of deep learning systems in the face of adversarial attacks. \n",
    "Imagine training a robot to walk, but instead of stumbling through countless trial-and-error attempts, it learns from a carefully crafted set of instructions. That's the essence of Supervised Policy Update (SPU), our novel approach for teaching AI agents new skills with remarkable efficiency.\n\nSPU begins by observing the agent's current behavior, gathering data on its successes and missteps. Then, it formulates a plan for improvement, solving a constrained optimization problem in a simplified \"policy space.\" Think of it like creating a blueprint for better actions.\n\nUsing the power of supervised learning, SPU translates this blueprint into a set of actionable instructions for the agent, guiding it towards more effective behavior.  This process is repeated, creating a cycle of observation, optimization, and refinement.\n\nThe beauty of SPU lies in its versatility.  It works seamlessly with both discrete actions (like pressing buttons) and continuous actions (like smoothly controlling a robot arm). It can also handle various constraints, ensuring the agent's learning process remains safe and efficient.\n\nWe've shown that SPU can tackle even the most challenging reinforcement learning problems, outperforming established methods like TRPO and PPO on complex robotic control tasks and classic Atari games.  And the best part? It's surprisingly simple to implement!\n\nSPU opens up exciting new possibilities for training AI agents, allowing them to learn complex skills with fewer stumbles and greater efficiency.  It's like giving robots a shortcut to mastery!\n\n\n",
    "This paper introduces Moving Symbols, a parameterized synthetic video dataset for evaluating video prediction models.  We demonstrate how controlled variations within the dataset can expose limitations in existing approaches.  We also propose a new semantically meaningful performance metric to enhance the interpretability of experimental results.  Moving Symbols provides standardized test cases to facilitate better understanding and development of video prediction models.  Code is available at: https://github.com/rszeto/moving-symbols \n\n\n"
  ]
}