Why Size Matters: Feature Coding as Nystrom Sampling
Oriol Vinyals, Yangqing Jia, Trevor Darrell
23 Jan 2013 arXiv 3 Comments ICLR 2013 Workshop Track
ICLR 2013 Workshop Track
Recently, the computer vision and machine learning community has been in favor of feature extraction pipelines that rely on a coding step followed by a linear classifier, due to their overall simplicity, well understood properties of linear classifiers, and their computational efficiency. In this paper we propose a novel view of this pipeline based on kernel methods and Nystrom sampling. In particular, we focus on the coding of a data point with a local representation based on a dictionary with fewer elements than the number of data points, and view it as an approximation to the actual function that would compute pair-wise similarity to all data points (often too many to compute in practice), followed by a Nystrom sampling step to select a subset of all data points. Furthermore, since bounds are known on the approximation power of Nystrom sampling as a function of how many samples (i.e. dictionary size) we consider, we can derive bounds on the approximation of the exact (but expensive to compute) kernel matrix, and use it as a proxy to predict accuracy as a function of the dictionary size, which has been observed to increase but also to saturate as we increase its size. This model may help explaining the positive effect of the codebook size and justifying the need to stack more layers (often referred to as deep learning), as flat models empirically saturate as we add more complexity.
State From To ( Cc) Subject Date Due Action
Completed
Yangqing Jia ICLR 2013 Workshop Track
Request for Endorsed for oral presentation: Why Size Matters: Feature Coding as...

23 Jan 2013
Fulfill
Yangqing Jia ICLR 2013 Workshop Track
Fulfilled: ICLR 2013 call for workshop papers

23 Jan 2013
Reveal: document
Yangqing Jia
Revealed: document: Why Size Matters: Feature Coding as Nystrom Sampling

05 Feb 2013
Completed
Aaron Courville Anonymous 998c
Request for review of Why Size Matters: Feature Coding as Nystrom Sampling

05 Feb 2013 01 Mar 2013
Completed
Aaron Courville Anonymous 1024
Request for review of Why Size Matters: Feature Coding as Nystrom Sampling

05 Feb 2013 01 Mar 2013
Withdrawn
Aaron Courville Anonymous ecba
Request for review of Why Size Matters: Feature Coding as Nystrom Sampling

05 Feb 2013 01 Mar 2013
Withdrawn
Aaron Courville Anonymous 1540
Request for review of Why Size Matters: Feature Coding as Nystrom Sampling

05 Feb 2013 01 Mar 2013
Withdraw
Aaron Courville Anonymous ecba
Request withdrawn: Request for review of Why Size Matters: Feature Coding as...

06 Mar 2013
Withdraw
Aaron Courville Anonymous 1540
Request withdrawn: Request for review of Why Size Matters: Feature Coding as...

06 Mar 2013
Reveal: document
ICLR 2013 Workshop Track
Revealed: document: Endorsed for oral presentation: Why Size Matters: Feature...

27 Mar 2013
Fulfill
ICLR 2013 Workshop Track Yangqing Jia
Fulfilled: Request for Endorsed for oral presentation: Why Size Matters: Feature...

27 Mar 2013

3 Comments

Anonymous 998c 01 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 998c
Revealed: document: review of Why Size Matters: Feature Coding as Nystrom...

01 Mar 2013
Fulfill
Anonymous 998c Aaron Courville
Fulfilled: Request for review of Why Size Matters: Feature Coding as Nystrom...

01 Mar 2013
This paper presents a theoretical analysis and empirical validation of a novel view of feature extraction systems based on the idea of Nystrom sampling for kernel methods. The main idea is to analyze the kernel matrix for a feature space defined by an off-the-shelf feature extraction system. In such a system, a bound is identified for the error in representing the "full" dictionary composed of all data points by a Nystrom approximated version (i.e., represented by subsampling the data points randomly). The bound is then extended to show that the approximate kernel matrix obtained using the Nystrom-sampled dictionary is close to the true kernel matrix, and it is argued that the quality of the approximation is a reasonable proxy for the classification error we can expect after training. It is shown that this approximation model qualitatively predicts the monotonic rise in accuracy of feature extraction with larger dictionaries and saturation of performance in experiments. This is a short paper, but the main idea and analysis are interesting. It is nice to have some theoretical machinery to talk about the empirical finding of rising, saturating performance. In some places I think more detail could have been useful. One undiscussed point is the fact that many dictionary-learning methods do more than populate the dictionary with exemplars so it's possible that a "learning" method might do substantially better (perhaps reaching top performance much sooner). This doesn't appear to be terribly important in low-dimensional spaces where sampling strategies work about as well as learning, but could be critical for high-dimensional spaces (where sampling might asymptote much more slowly than learning). It seems worth explaining the limitations of this analysis and how it relates to learning. A few other questions / comments: The calibration of constants for the bound in the experiments was not clear to me. How is the mapping from the bound (Eq. 2) to classification accuracy actually done? The empirical validation of the lower bound relies on a calibration procedure that, as I understand it, effectively ends up rescaling a fixed-shape curve to fit observed trend in accuracy on the real problem. As a result, it seems like we could come up with a "nonsense" bound that happened to have such a shape and then make a similar empirical claim. Is there a way to extend the analysis to rule this out? Or perhaps I misunderstand the origin of the shape of this curve. Pros: (1) A novel view of feature extraction that appears to yield a reasonable explanation for the widely observed performance curves of these methods is presented. I don't know how much profit this view might yield, but perhaps that will be made clear by the "overshooting" method foreshadowed in the conclusion. (2) A pleasingly short read adequate to cover the main idea. (Though a few more details might be nice.) Cons: (1) How this bound relates to the more common case of "trained" dictionaries is unclear. (2) The empirical validation shows the basic relationship qualitatively, but it is possible that this does not adequately validate the theoretical ideas and their connection to the observed phenomenon.
Please log in to comment.
Anonymous 1024 01 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 1024
Revealed: document: review of Why Size Matters: Feature Coding as Nystrom...

01 Mar 2013
Fulfill
Anonymous 1024 Aaron Courville
Fulfilled: Request for review of Why Size Matters: Feature Coding as Nystrom...

01 Mar 2013
The authors provide an analysis of the accuracy bounds of feature coding + linear classifier pipelines. They predict an approximate accuracy bound given the dictionary size and correctly estimate the phenomenon observed in the literature where accuracy increases with dictionary size but also saturates. Pros: - Demonstrates limitations of shallow models and analytically justifies the use of deeper models.
Please log in to comment.
Oriol Vinyals, Yangqing Jia, Trevor Darrell 14 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Oriol Vinyals, Yangqing Jia, Trevor Darrell
Revealed: document:

14 Mar 2013
We agree with the reviewer regarding the existence of better dictionary learning methods, and note that many of these are also related to corresponding advanced Nystrom sampling methods, such as [Zhang et al. Improved Nystrom low-rank approximation and error analysis. ICML 08]. These methods could improve performance in absolute terms, but that is an orthogonal issue to our main results. Nonetheless, we think this is a valuable observation, and will include a discussion of these points in the final version of this paper. The relationship between a kernel error bound and classification accuracy is discussed in more detail in [Cortes et al. On the Impact of Kernel Approximation on Learning Accuracy. AISTATS 2010]. The main result is that the bounds are proportional, verifying our empirical claims. We will add this reference to the paper. Regarding the comment on fitting the shape of the curve, we are only using the first two points to fit the "constants" given in the bound, so the fact that it extrapolates well in many tasks gives us confidence that the bound is accurate.
Please log in to comment.
ICLR 2013 Workshop Track 27 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
ICLR 2013 Workshop Track
Revealed: document: Endorsed for oral presentation: Why Size Matters: Feature...

27 Mar 2013
Fulfill
ICLR 2013 Workshop Track Yangqing Jia
Fulfilled: Request for Endorsed for oral presentation: Why Size Matters: Feature...

27 Mar 2013
Endorsed for oral presentation: Why Size Matters: Feature Coding as Nystrom Sampling
Please log in to comment.

Please log in to comment.