﻿Semester,Question Number,Part,Points,Topic,Type,Question,Solution
Cornell Spring 2017,1,1,1,Decision Tree,Text,(T/F) Random forests is one of the few machine learning algorithms that makes no assumptions on the data.,"False, every machine learning algorithm makes assumptions. RF assumes that similar inputs have similar labels."
Cornell Spring 2017,1,2,1,Optimization,Text,"(T/F) One implication of the curse of dimensionality is that if you sample $n$ data points uniformly at random within a hyper cube of dimensionality $d$, all pairwise distances converge to 0 as $n \rightarrow \infty$.","False, they become concentrated around the average distance as $d\to\infty$."
Cornell Spring 2017,1,3,1,Neural Networks,Text,"(T/F) During training, in a linearly separable data set, the perceptron algorithm never misclassifies the same input twice.","False, it can iterate many times over the data set and get the same points wrong repeatedly."
Cornell Spring 2017,1,4,1,Optimization,Text,"(T/F) You have a biased coin and toss it $n$ times. The MAP estimate with $+1$ smoothing of the probability of getting ""head"" is $\frac{n_{H}+1}{n+1}$, where $n_{H}$ is the number of occurrences of ""head"" amongst your $n$ throws.","False, it is $\frac{n_{H}+1}{n+2}$. 5. (T/F) The multinomial Naive Bayes algorithm is a linear classifier."
Cornell Spring 2017,1,5,1,Classifiers,Text,(T/F) The multinomial Naive Bayes algorithm is a linear classifier.,True.
Cornell Spring 2017,1,6,1,Optimization,Text,"(T/F) MAP inference maximizes $P(\mathbf{w} \mid D a t a)$ whereas MLE maximizes $P($ Data; $\mathbf{w})$, where $\mathbf{w}$ represents the model parameters.",True.
Cornell Spring 2017,1,7,1,Optimization,Text,(T/F) Newton's Method diverges only if the Hessian matrix is not invertible.,"False, it can also diverge with invertible Hessian matrices."
Cornell Spring 2017,1,8,1,Regression,Text,"(T/F) Linear (ordinary least squares) regression can be solved in closed form, although sometimes that is computationally impractical or even infeasible.",True.
Cornell Spring 2017,1,9,1,Classifiers,Text,(T/F) SVMs maximize the margin between the training and testing data.,"False, they maximize the margin between the training data and the separating hyperplane."
Cornell Spring 2017,1,10,1,Optimization,Text,"(T/F) In order for gradient descent to converge, the loss function has to be convex and differentiable everywhere.","False, if it is not convex it will still converge, but to a local minimum."
Cornell Spring 2017,1,11,1,Model Selection,Text,"(T/F) The bias variance trade-off decomposes the error obtained by a classifier into (squared) bias, variance, and noise. The noise term cannot possibly be addressed, even by changing the feature representation of the data.","False, changing the feature representation of the data will affect the noise. For example, if all features are removed the error is only noise (which would be very large)."
Cornell Spring 2017,1,12,1,Model Selection,Text,"(T/F) In a setting of high bias, a great remedy is to add more training data.","False, more training data does not help with bias."
Cornell Spring 2017,1,13,1,Ensemble Methods,Text,(T/F) Bagging reduces variance.,True.
Cornell Spring 2017,1,14,1,Ensemble Methods,Text,(T/F) Boosting reduces noise.,"False, it reduces bias (and sometimes even variance a little)."
Cornell Spring 2017,1,15,1,Classifiers,Text,"(T/F) Learning with kernels is expensive, because the data is mapped into a very high dimensional space and therefore storing the transformed data consumes a lot of storage.","False, the mapping is performed implicitly."
Cornell Spring 2017,1,16,1,Regression,Text,(T/F) The mean prediction of Gaussian processes is identical to kernelized linear regression.,True.
Cornell Spring 2017,1,17,1,Optimization,Text,(T/F) One popular application of Gaussian Processes is to find hyper-parameters of machine learning algorithms.,True.
Cornell Spring 2017,1,18,1,Optimization,Text,(T/F) Ball-Trees are a data structure to speed up the perceptron algorithm.,"False, they can speed up nearest neighbor searchers, but that is never performed in the Perceptron."
Cornell Spring 2017,1,19,1,Decision Tree,Text,(T/F) Decision Trees stop splitting when the impurity function can no longer be improved with a single split.,"False, e.g. in the XOR data set the first split does not improve the impurity function. The splitting stops if the maximum depth (or number of nodes) is reached, or all inputs are identical."
Cornell Spring 2017,1,20,1,Decision Tree,Text,(T/F) Random Forests are bagged decision trees with one additional modification: Each splitting dimension is chosen completely uniformly at random.,"False, the best splitting dimension is selected amongst $k$ random dimensions."
Cornell Spring 2017,1,21,1,Ensemble Methods,Text,"(T/F) Provided each weak learner can classify a weighted version of the training data set with better than 0.5 accuracy, in AdaBoost the training error reduces exponentially.",True.
Cornell Spring 2017,1,22,1,Neural Networks,Text,"(T/F) Deep neural networks are great on many data sets, but do not work competitively on image classification tasks.","False, they are particularly good at image classification tasks."
Cornell Spring 2017,1,23,1,Neural Networks,Text,(T/F) The optimization of deep neural networks is a convex minimization problem.,"False, it is non-convex because of the non-linear transition functions."
Cornell Spring 2017,2,1,3,Model Selection,Text,"Write down the bias (squared), variance, noise decomposition of the expected test error $\mathbb{E}_{x, y, D}\left[\left(h_{D}(x)-y\right)^{2}\right]$.","Variance: $E_{x, D}\left[\left(h_{D}(x)-\bar{h}(x)\right)^{2}\right]$ Bias squared: $E_{x}\left[(\bar{h}(x)-\bar{y}(x))^{2}\right]$ Noise: $E_{x, y}\left[(\bar{y}(x)-y(x))^{2}\right]$"
Cornell Spring 2017,2,2,5,Model Selection,Text,Describe how to detect settings with high bias and provide three approaches that could help reduce the bias.,"Detect high bias if training error is above goal error (plot training and testing error vs number of data points). Reduce bias by decreasing model complexity, using boosting, and 2 points for correct detection. 3x1 point for correct remedies."
Cornell Spring 2017,2,3a.,4.333333333,Model Selection,Text,"For each of the following scenarios, determine if the model has low/high bias and variance. Explain your choice. Logistic regression on linearly separable data and non-linearly separable data.","Linearly separable: low bias, low variance. Non-linearly separable: high bias, low variance."
Cornell Spring 2017,2,3b.,4.333333333,Model Selection,Text,"For each of the following scenarios, determine if the model has low/high bias and variance. Explain your choice. kNN with small k and large k.","Small k: low bias, high variance. Large k: high bias, low variance."
Cornell Spring 2017,2,3c.,4.333333333,Model Selection,Text,"For each of the following scenarios, determine if the model has low/high bias and variance. Explain your choice. Uniform random labeling.","low bias, high variance"
Cornell Spring 2017,3,1a.,5,Classifiers,Text,"Suppose you are using a kernel SVM with the $\mathrm{RBF}$ kernel $k(\mathbf{x}, \mathbf{z})=\exp \left(-\frac{\|\mathbf{x}-\mathbf{z}\|_{2}^{2}}{\sigma^{2}}\right)$ to do classification. Recall that the kernel SVM is trained by solving the dual optimization problem: $$
\begin{aligned}
\min _{\alpha_{1}, \ldots, \alpha_{n}} & \frac{1}{2} \sum_{i, j} \alpha_{i} \alpha_{j} y_{i} y_{j} \mathbf{K}_{i j}-\sum_{i} \alpha_{i} \\
\text { s.t. } & 0 \leq \alpha_{i} \leq C \\
& \sum_{i} \alpha_{i} y_{i}=0
\end{aligned}
$$ Assume you can either set $C$ and $\sigma^{2}$ to a very large value $(\gg 0)$ or a very small value $(\epsilon)$. Provide a setting with high bias and one with high variance. Briefly explain your answers.",Large \sigma and large C lead to high variance as decision boundary is smaller; small \sigma and small C lead to high bias as decision boundary is very large
Cornell Spring 2017,3,1b.,3,Classifiers,Text,$\mathbf{x}_{1}$ turns out to be a support vector. What can you say about its corresponding optimal value $\alpha_{i}^{*}$ and the margin between the hyperplane and $\mathbf{x}_{1} ?$,$\alpha_{i}^{*} > 0$ and the normalized margin must be 1
Cornell Spring 2017,3,1c.,5,Classifiers,Text,"In order to apply the classifier to a test point, we need the hyper-plane bias $b$. Show how $b$ can be recovered from $\alpha_{1}^{*}, \ldots, \alpha_{n}^{*}$ with the help of the support vector $\mathbf{x}_{i}$ and label $y_{i} \in\{-1 .+1\}$.",The bias can be retrieved by average difference between the weighted labels (weighted by the \alpha's) and the inner product of the features with the true weight parameters.
Cornell Spring 2017,3,2,8,Classifiers,Text,"For this question, you will find the following rules about recursively building kernels helpful. Given kernels $k_{1}(\mathbf{x}, \mathbf{z})$ and $k_{2}(\mathbf{x}, \mathbf{z})$, the following are well-defined kernels:

$$
\begin{aligned}
k(\mathbf{x}, \mathbf{z}) &=\mathbf{x}^{\top} A \mathbf{z}, A \succeq 0 \\
k(\mathbf{x}, \mathbf{z}) &=c k_{1}(\mathbf{x}, \mathbf{z}) \\
k(\mathbf{x}, \mathbf{z}) &=\exp \left(k_{1}(\mathbf{x}, \mathbf{z})\right) \\
k(\mathbf{x}, \mathbf{z}) &=f(\mathbf{x}) k_{1}(\mathbf{x}, \mathbf{z}) f(\mathbf{z}) \\
k(\mathbf{x}, \mathbf{z}) &=k_{1}(\mathbf{x}, \mathbf{z})+k_{2}(\mathbf{x}, \mathbf{z})
\end{aligned}
$$

Suppose that $\mathbf{x}, \mathbf{z} \in \mathbb{R}^{2}$. Let $[\mathbf{x}]_{1}$ and $[\mathbf{x}]_{2}$ denote the first and second coordinates of $\mathbf{x}$, respectively. Show that

$$
k(\mathbf{x}, \mathbf{z})=\exp \left(-\frac{\left\|[\mathbf{x}]_{1}-[\mathbf{z}]_{1}\right\|_{2}^{2}}{\sigma^{2}}\right)+\exp \left(-\frac{\left\|[\mathbf{x}]_{2}-[\mathbf{z}]_{2}\right\|_{2}^{2}}{\sigma^{2}}\right)
$$

is a kernel.

Hint: You may find the following two matrices helpful:

$$
A_{1}=\left[\begin{array}{ll}
1 & 0 \\
0 & 0
\end{array}\right], \quad A_{2}=\left[\begin{array}{ll}
0 & 0 \\
0 & 1
\end{array}\right] .
$$

You can assume they are positive semi-definite (i.e. $A_{1} \succeq 0, A_{2} \succeq 0$ ).","The trick here is to define

$$
\begin{aligned}
A_{1} &=\left[\begin{array}{ll}
1 & 0 \\
0 & 0
\end{array}\right] \\
A_{2} &=\left[\begin{array}{ll}
0 & 0 \\
0 & 1
\end{array}\right]
\end{aligned}
$$

so that $\mathbf{x}^{\top} A_{1} \mathbf{z}=[\mathbf{x}]_{1}[\mathbf{z}]_{1}$ and $\mathbf{x}^{\top} A_{2} \mathbf{z}=[\mathbf{x}]_{2}[\mathbf{z}]_{2}$ (these matrices are psd with eigenvalues 0 and 1). The rest of the proof is identical to the proof for the RBF kernel in class:

(a) $k_{1}(\mathbf{x}, \mathbf{z})=\mathbf{x}^{\top} A_{1} \mathbf{z}=[\mathbf{x}]_{1}[\mathbf{z}]_{1}$, rule (1)

(b) $k_{2}(\mathbf{x}, \mathbf{z})=\frac{2}{\sigma^{2}} k_{1}(\mathbf{x}, \mathbf{z})=\frac{2}{\sigma^{2}}[\mathbf{x}]_{1}[\mathbf{z}]_{1}$, rule $(2)$

(c) $k_{3}(\mathbf{x}, \mathbf{z})=\exp \left(k_{2}(\mathbf{x}, \mathbf{z})\right)=\exp \left(\frac{2[\mathbf{x}]_{1}[\mathbf{z}]_{1}}{\sigma^{2}}\right), \operatorname{rule}(3)$

(d) $k_{4}(\mathbf{x}, \mathbf{z})=\exp \left(-\frac{[\mathbf{x}]_{1}[\mathbf{x}]_{1}}{\sigma^{2}}\right) \exp \left(\frac{2[\mathbf{x}]_{1}[\mathbf{z}]_{1}}{\sigma^{2}}\right) \exp \left(-\frac{[\mathbf{z}]_{1}[\mathbf{z}]_{1}}{\sigma^{2}}\right)=\exp \left(-\frac{\left\|[\mathbf{x}]_{1}-[\mathbf{z}]_{1}\right\|_{2}^{2}}{\sigma^{2}}\right)$, rule (4) with $f(\mathbf{x})=\exp \left(-\frac{[\mathbf{x}]_{1}[\mathbf{x}]_{1}}{\sigma^{2}}\right)$

(e) Repeating the above with $A_{2}, k_{5}(\mathbf{x}, \mathbf{z})=\exp \left(-\frac{\left\|[\mathbf{x}]_{2}-[\mathbf{z}]_{2}\right\|_{2}^{2}}{\sigma^{2}}\right)$ is a kernel.

(f) Finally, $k_{4}+k_{5}$ is a kernel by rule (5)."
Cornell Spring 2017,4,1,2,Decision Tree,Text,Imagine you build a K D-Tree and label each leaf with the most common label amongst all training points that fall into this leaf. Why would this not be a desirable classifier?,"Because many leaves would not be pure, which makes the most common label a bad estimate."
Cornell Spring 2017,4,2,4,Decision Tree,Text,Name two reasons why Random Forests are such popular classifiers amongst practitioners?,"1. They only have two hyper-parameters (the number of trees m and the number of features K), but both are really easy to set. You can set $K = \sqrt{d}$ and m as large as you can afford. 2. RF are based on decision/regression trees and require no feature scaling or any of the typical pre-processing of the data. Features can be in completely different units and can be categorical or real valued."
Cornell Spring 2017,4,3,2,Decision Tree,Text,"Assume you pre-process all your features in the following way: you sort each feature independently. For each feature, you then assign all those inputs that share the lowest feature value a new feature value of 1, all those with the second lowest value a 2, etc. How does this affect the trees that you construct?",It doesn't.
Cornell Spring 2017,4,4,4,Decision Tree,Text,Under what conditions on your training set will a CART tree (with unlimited depth) obtain 0% training error.,If there are no two training inputs with identical features but different labels.
Cornell Spring 2017,4,5,3,Decision Tree,Text,"You are building a regression tree with the squared loss impurity. i.e. the labels in the leaf are $L=\left\{y_{1}, \ldots, y_{m}\right\}$ and the loss, under prediction $t$, is $\sum_{y \in L}(y-t)^{2}$. Prove that the average label $t=\frac{1}{m} \sum_{i=1}^{m} y_{i}$ minimizes the loss at a leaf.","$$
t=\operatorname{argmin}_{t} \sum_{i=1}^{n}\left(t-y_{i}\right)^{2}
$$

Taking the derivative and eq. with 0 :

$$
\begin{aligned}
2 \sum_{i=1}^{n}\left(t-y_{i}\right) &=0 \\
2 n t-2 \sum_{i=1}^{n} y_{i} &=0 \\
t &=\frac{1}{n} \sum_{i=1}^{n} y_{i}
\end{aligned}
$$"
Cornell Spring 2017,4,6,6,Decision Tree,Text,"You are now considering minimizing the absolute loss instead: $\sum_{y \in L}|y-t|$. Define $L_{\leq}=\{y \in L: y \leq t\}$ and $L_{>}=\{y \in L: y>t\}$. Prove that setting $t$ to the median of $L$ minimizes this loss. To simplify things you can assume you have an odd number of samples (i.e. $m=2 r+1$ ) and that all $y_{i} \in L$ are distinct (i.e. $y_{i} \neq y_{j}$ for any $y_{i}, y_{j} \in L$ ). (Without loss of generality it is sufficient to show there is no better splitting value $t^{\prime}$ that is larger than the median. )","Let $t$ be the median of $L$. Then, we have $L_{\leq} = \{y \in L: y \leq t\}$ and $L_{>} = \{y \in L: y > t\}$. We want to show that there is no better splitting value $t'$ that is larger than the median. Let us prove this by contradiction. Imagine that we are able to find a $t'$ that is larger than the median that achieves this splitting. Since the median is the 50-th percentile, $|L_{\leq}| = |L_{>}| = \frac{m}{2}$. Since we set our $t'$ larger, the corresponding $L_{\leq}$ has more elements. Thus establishing the contradiction."
Cornell Spring 2017,5,1,3,Ensemble Methods,Text,"Name two algorithms, for which boosting will be ineffective. Briefly justify why.","e.g. k-NN classification, unlimited depth decision trees, kernel SVMs. They have high variance and essentially zero bias. Rubrics: one point for each algorithm, one point for correct justification. Maximum 3 points. special cases: (1) Linear classifiers also gain 1 point since if the data set isn't linearly separable there is not much used to ensemble linear classifer. special cases: (2) Naive Bayes doesn't get a point. special cases: (3) Random Labeling gets 1 point since it's not a weak learner (since it doesn't learn). special cases: (4) Model labeling doesn't get points."
Cornell Spring 2017,5,2,5,Ensemble Methods,Text,Describe what happens in AdaBoost if two training inputs (in a binary classification problem) are identical in features but have different labels.,Both points will obtain veyr high weights and eventually will dominate the training data set. Weak learners will no longer be able to classify the weighted data set with better than 50% accuracy and the algorithm will stop. minimum 1 point if say something and show effort. (+1) if state that the algorithm will stop. (+2) if state that these two data points will gain weights. (+1) if state that the weak learner won't be able to distinguish these two data points eventually. maximum 5 points.
Cornell Spring 2017,5,3,3,Ensemble Methods,Text,"In neural networks bagging can be performed without random subsampling of the data. i.e., one trains m neural networks independently and ensembles their results. Can you explain why the subsampling is unnecessary in this case?","The random initialization and non-convexity of neural networks ensures that independently trained models will end up in different local minima and obtain different results. The effect is similar to training on slightly different data sets. minimum 1 point if say something and show effort. (+2) if state that NN has random initialization. (+1) if state NN converges to local minimum due to non-convexity. special case (1): if mentioned that Stochastic Gradient Descent randomly sample training data and lead to different weights, get 2 points. special case (2): if mentioned that layers such as dropout is some embedded randomness, get also 2 points."
Cornell Spring 2017,5,4a.,4,Ensemble Methods,Text,"Assume you have weak learners $h \in \mathcal{H}$ s.t. $h(\mathbf{x}) \in\{+1,-1\}$ for any $\mathbf{x}$. You are trying to apply boosting with the logistic loss function

$$
\mathcal{L}(H)=\sum_{i=1}^{n} \ln \left(1+e^{-y_{i} H\left(\mathbf{x}_{i}\right)}\right) .
$$

(remember, ln here refers to the natural logarithm)

Compute the derivative $\frac{\partial \mathcal{L}(H)}{\partial H\left(\mathbf{x}_{i}\right)}$.","$$
\frac{\partial \mathcal{L}(H)}{\partial H\left(\mathbf{x}_{i}\right)}=\frac{-y_{i}}{1+e^{-y_{i} H\left(\mathbf{x}_{i}\right)}} e^{-y_{i} H\left(\mathbf{x}_{i}\right)}=-\frac{y_{i}}{1+e^{y_{i} H\left(\mathbf{x}_{i}\right)}}
$$

minimum 1 point if show effort and write something. $(+3)$ if the answer is correct. $(+2)$ if the answer has only minor mistake (i.e. flip the sign, etc)."
Cornell Spring 2017,5,4b.,6,Ensemble Methods,Text,"Assume you have weak learners $h \in \mathcal{H}$ s.t. $h(\mathbf{x}) \in\{+1,-1\}$ for any $\mathbf{x}$. You are trying to apply boosting with the logistic loss function

$$
\mathcal{L}(H)=\sum_{i=1}^{n} \ln \left(1+e^{-y_{i} H\left(\mathbf{x}_{i}\right)}\right) .
$$

(remember, ln here refers to the natural logarithm) Let $w_{i}=\frac{1}{1+e^{y_{i} H\left(\mathbf{x}_{i}\right)}}$ and let $\epsilon(h)=\sum_{i: h\left(\mathbf{x}_{i}\right) \neq y_{i}} w_{i}$ be the weighted error of the training set. For simplicity assume we are using a fixed step-size of 1. Show that the next classifier to be added to the ensemble $H$ in order to minimize the loss function is $h=\operatorname{argmin}_{h} \mathcal{L}(H+h)=\arg \min _{h} \epsilon(h)$.","$$
\begin{aligned}
h &=\operatorname{argmax}_{h} \sum_{i=1}^{n} h\left(\mathbf{x}_{i}\right) \frac{\partial \mathcal{L}(H)}{\partial H\left(\mathbf{x}_{i}\right)} \\
&=\operatorname{argmin}_{h} \sum_{i=1}^{n} w_{i} y_{i} h\left(\mathbf{x}_{i}\right) \\
&=\operatorname{argmin}_{h} \sum_{i: h\left(\mathbf{x}_{i}\right)=y_{i}} w_{i}-\sum_{i: h\left(\mathbf{x}_{i}\right) \neq y_{i}} w_{i} \\
&=\operatorname{argmin}_{h} \epsilon(h)-(1-\epsilon(h)) \\
&=\operatorname{argmin}_{h} \epsilon(h)
\end{aligned}
$$"
Cornell Spring 2017,6,1,5,Neural Networks,Text,"Assume you are given a neural network with $L$ layers to minimize a loss function $\mathcal{L}$

$$
\begin{aligned}
h(\mathbf{x}) &=\mathbf{w}^{\top} \phi_{1}(\mathbf{x}) \\
\phi_{1}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{1} \phi_{2}(\mathbf{x})\right) \\
& \vdots \\
\phi_{\ell}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{\ell} \phi_{\ell+1}(\mathbf{x})\right) \\
& \vdots \\
\phi_{L}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{L} \mathbf{x}\right)
\end{aligned}
$$

(Note that the subscript of $\phi$ starts at 1 at the end of the network, and increases to $L$ as we make our way back to the start) Let us define $a_{\ell}=\mathbf{U}_{\ell} \phi_{\ell+1}(\mathbf{x})$ such that $\phi_{\ell}=\sigma\left(a_{\ell}\right)$. Let $\delta_{\ell}=\frac{\partial \mathcal{L}}{\partial a_{\ell}}$. Express $\frac{\partial \mathcal{L}}{\partial \mathbf{U}_{\ell}}$ in terms of $\delta_{\ell}$. (assume $1<\ell<L$ )","$$
\begin{aligned}
\frac{\partial \mathcal{L}}{\partial \mathbf{U}_{\ell}} &=\frac{\partial \mathcal{L}}{\partial a_{\ell}} \frac{\partial a_{\ell}}{\partial \mathbf{U}_{\ell}} \\
&=\delta_{\ell} \phi_{\ell+1}(\mathbf{x})^{T}
\end{aligned}
$$"
Cornell Spring 2017,6,2,5,Neural Networks,Text,"Assume you are given a neural network with $L$ layers to minimize a loss function $\mathcal{L}$

$$
\begin{aligned}
h(\mathbf{x}) &=\mathbf{w}^{\top} \phi_{1}(\mathbf{x}) \\
\phi_{1}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{1} \phi_{2}(\mathbf{x})\right) \\
& \vdots \\
\phi_{\ell}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{\ell} \phi_{\ell+1}(\mathbf{x})\right) \\
& \vdots \\
\phi_{L}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{L} \mathbf{x}\right)
\end{aligned}
$$

(Note that the subscript of $\phi$ starts at 1 at the end of the network, and increases to $L$ as we make our way back to the start) Assume that the derivative of $\sigma(z)$ is given as $\sigma^{\prime}(z)$. Define $\delta_{\ell+1}$ as a function of $\delta_{\ell}$. (assume $1<\ell<L$ ) where $x=\phi_{L+1}$","$$
\begin{aligned}
\delta_{\ell+1} &=\frac{\partial \mathcal{L}}{\partial a_{\ell+1}} \\
&=\frac{\partial \mathcal{L}}{\partial \phi_{\ell+1}} \frac{\partial \phi_{\ell+1}}{\partial a_{\ell+1}} \\
&=\frac{\partial \mathcal{L}}{\partial a_{\ell}} \frac{\partial a_{\ell}}{\partial \phi_{\ell+1}} \frac{\partial \phi_{\ell+1}}{\partial a_{\ell+1}} \\
&=\sigma^{\prime}\left(a_{\ell+1}\right) \odot \mathbf{U}_{\ell}^{T} \delta_{\ell} \\
&=\sigma^{\prime}\left(\mathbf{U}_{\ell+1} \phi_{\ell+2}\right) \odot \mathbf{U}_{\ell}^{T} \delta_{\ell}
\end{aligned}
$$"
Cornell Spring 2017,6,3,3,Neural Networks,Text,Provide one reason why stochastic gradient descent can be better than traditional (batch) gradient descent when applied to neural networks.,"SGD can jump out of local minima more easily, since it's more noisy. Alternatively, you can note that as you increase your batch size, your update gradient asymptotically approaches the true gradient. Thus, you can split your batch into n parts, yielding n updates with generally better than $\frac{1}{n}$ accuracy relative to the true gradient, yielding more progress per computation time. SGD takes this to an extreme. Both answers are correct, but not equivalent."
Cornell Spring 2017,6,4,4,Classifiers,Text,Assume you make all transition functions the identity (i.e. $\sigma(z)=z$ ). Prove that the final classifier is simply a linear classifier of the form $h(\mathbf{x})=\hat{\mathbf{w}}^{\top} \mathbf{x}$ for some vector $\hat{\mathbf{w}}$.,"$$
\begin{aligned}
h(\mathbf{x}) &=\mathbf{w}^{T}\left(\prod_{\ell=L}^{1} \mathbf{U}_{\ell}\right) \mathbf{x} \\
&=\hat{\mathbf{w}}^{T} \mathbf{x}
\end{aligned}
$$"
Cornell Spring 2017,6,5,4,Optimization,Text,ML-practitioners tend to drop the learning rate during training. Explain why and what effect it has.,"Starting out with a large learning rate has two advantages: 1. it prevents you from getting trapped in sharp local minima, because the weights ""jump around"" too much with each step; and 2. it moves you quickly ""down-hill"" because you take larger steps. Then switching to a smaller learning rate allows the network to converge to the local minima closest to the current weight position."