﻿Semester,Question Number,Part,Points,Topic,Type,Question,Solution
Cornell Fall 2018,1,1,3,Bonus,Text,I filled out the course evaluation for CS4780?,
Cornell Fall 2018,1,2,2,Model Selection,Text,"(T/F) The fewer assumptions an algorithm makes, the better it is. In practce the best algorithm is Generic Programming which makes no assumptions at all.","False, all algorithms make assumptions"
Cornell Fall 2018,1,3,1,Decision Trees,Text,(T/F) With Random Forests there is no need to perform a training/validation split.,True.
Cornell Fall 2018,1,4,2,Logistic Regression,Text,"(T/F) MLE is great to learn the parameters of a binomial distribution, but it cannot be used to learn the parameters of a separating hyper-plane.","False, the logistic loss in Logistic Regression is derived through MLE to learn the best separating hyperplane."
Cornell Fall 2018,1,5,2,Classifiers,Text,(T/F) The Naive Bayes classiﬁer assumes that all features are independent.,"False, It assumes all features are conditionally independent - given the label."
Cornell Fall 2018,1,6,2,Logistic Regression,Text,"(T/F) Logistic Regression converges whenever a separating hyper-plane exists, otherwise it may run forever.",False. Logistic regression solves a convex optimization problem and always converges.
Cornell Fall 2018,1,7,2,Classifiers,Text,(T/F) The set of Support Vectors are all the the training data points an SVM cannot classify correctly.,"False, they also include all training points with a margin of $\leq 1$."
Cornell Fall 2018,1,8,1,Classifiers,Text,(T/F) A learned kernel SVM model (with RBF kernel) requires you to store some of the training data.,True (the support vectors)
Cornell Fall 2018,1,9,1,Classifiers,Text,(T/F) The decision boundary of a dual SVM classiﬁer with linear kernel is identical to that of a primal SVM classiﬁer.,True.
Cornell Fall 2018,1,10,1,Loss Functions,Text,(T/F) l1 regularizer encourage sparse solutions.,True.
Cornell Fall 2018,1,11,2,Classifiers,Text,(T/F) In SVMs l2 regularization minimizes the squared bias term $b^2$.,"False, the bias term is not regularized."
Cornell Fall 2018,1,12,2,Classifiers,Text,(T/F) Linear classiﬁers have as parameters the hyper-plane normal $\mathbf{w} and a bias term $b$. Reducing this bias term $b$ will often increase the variance of the classiﬁer.,"False, the bias term is diﬀerent from the bias/variance trade-oﬀ."
Cornell Fall 2018,1,13,1,Regression,Text,(T/F) The conditional distribution $P(y|x)$ of Gaussian Process Regression is itself a Gaussian distribution.,True.
Cornell Fall 2018,1,14,1,Regression,Text,(T/F) Kernelized linear regression (with RBF kernel) is a non-parametric algorithm.,True.
Cornell Fall 2018,1,15,1,Decision Trees,Text,"(T/F) A CART tree, if learned to full depth, are non-parametric algorithms.",True.
Cornell Fall 2018,1,16,2,Ensemble Methods,Text,"(T/F) In bagging, each classiﬁer in the ensemble is trained on a data set that is independently and identically distributed.","False, the data is not independently sampled."
Cornell Fall 2018,1,17,1,Ensemble Methods,Text,(T/F) One advantage of bagging is that all ensemble members (i.e. classiﬁers) can be trained in parallel.,True.
Cornell Fall 2018,1,18,2,Ensemble Methods,Text,(T/F) AdaBoost with decision trees (depth 3) is non-parametric.,"False, the set of parameters is not a function of the number of training instances, $n$."
Cornell Fall 2018,1,19,2,Ensemble Methods,Text,(T/F) AdaBoost terminates the moment it reaches $0\%$ training error.,"False, as long as there is a weak learner with $< 0.5$ weighted training error, AdaBoost keeps boosting."
Cornell Fall 2018,1,20,1,Decision Trees,Text,(T/F) One advantage of Random Forests is that you obtain meaningful probability estimates as your output predictions $P(y|x)$.,True.
Cornell Fall 2018,1,21,1,Neural Networks,Text,(T/F) Deep convolutional neural networks are particularly well suited for image classification tasks.,True.
Cornell Fall 2018,1,22,2,Neural Networks,Text,(T/F) The optimization of deep neural networks is a convex minimization problem.,"False, it is non-convex because of the non-linear transition functions."
Cornell Fall 2018,2,1,3,Model Selection,Text,Your Decision Tree classiﬁer has a training error of $0\%$ and a testing error of $87\%$. What can you say about the bias/variance trade-oﬀ (assuming the data is not noisy). Name two possible interventions to reduce the testing error?,"High Variance, Low Bias. You could prune the tree, or use bagging."
Cornell Fall 2018,2,2,3,Model Selection,Text,"For k-fold cross validation, describe the positive and negative eﬀects as $k \rightarrow n$. When would you be most inclined to use $k = n$?",The error decreases (as you have more training data) but as $k \rightarrow n$ the validation procedures also becomes very slower. You would use $k = n$ if you have very little training data (e.g. $n = 20$).
Cornell Fall 2018,2,3,3,Model Selection,Text,The expected regression error decomposes into three terms. Write down the mathematical decomposition and label each term.,"$$
\underbrace{E_{\mathbf{x}, y, D}\left[\left(h_{D}(\mathbf{x})-y\right)^{2}\right]}_{\mathrm {Expected Test Error }}=\underbrace{E_{\mathbf{x}, D}\left[\left(h_{D}(\mathbf{x})-\bar{h}(\mathbf{x})\right)^{2}\right]}_{\mathrm {Variance }}+\underbrace{E_{\mathbf{x}, y}\left[(\bar{y}(\mathbf{x})-y)^{2}\right]}_{\mathrm {Noise }}+\underbrace{E_{\mathbf{x}}\left[(\bar{h}(\mathbf{x})-\bar{y}(\mathbf{x}))^{2}\right]}_{\mathrm {Bias }^{2}}
$$"
Cornell Fall 2018,2,4,3,Model Selection,Text,Explain why adding more training data does not always help reduce your testing error below a desired threshold $\epsilon>0$. Describe such a scenario.,The training error is a lower bound on the testing error. Adding more data increases the training error. If your training error is already too high $(>\epsilon)$ adding more data will not help bring the testing error below $\epsilon$ as it is bounded by the training error.
Cornell Fall 2018,2,5a,1,Model Selection,Text,"Consider the following algorithms and highlighted hyper-parameters. Decide whether increasing these parameters could help reduce overfitting. Answer with ""Yes"" or ""No"". The number of hidden units in the Neural Network.",No.
Cornell Fall 2018,2,5b,1,Model Selection,Text,"Consider the following algorithms and highlighted hyper-parameters. Decide whether increasing these parameters could help reduce overfitting. Answer with ""Yes"" or ""No"". The maximum depth in Decision Trees.",No.
Cornell Fall 2018,2,5c,1,Model Selection,Text,"Consider the following algorithms and highlighted hyper-parameters. Decide whether increasing these parameters could help reduce overfitting. Answer with ""Yes"" or ""No"". $\lambda$ in Logistic Regression, trained with a $\lambda \sum_{j} w_{j}^{2}$ penalty in the objective.",Yes.
Cornell Fall 2018,2,5d,1,Model Selection,Text,"Consider the following algorithms and highlighted hyper-parameters. Decide whether increasing these parameters could help reduce overfitting. Answer with ""Yes"" or ""No"". The number of iterations $T$ in Boosting.",No.
Cornell Fall 2018,3,1,2,Classifiers,Text,Name one condition that is necessary and sufficient for a matrix $\mathbf{K}$ to be positive semi-definite.,"$\forall \mathbf{q}, \mathbf{q} \top \mathbf{K} \mathbf{q} \geq 0$ or $K=L \top L$ for some real matrix $L$, or $K$ only has non-negative eigenvalues."
Cornell Fall 2018,3,2,3,Classifiers,Text,"Which of the following algorithms can be kernelized: a) Decision Trees, b) Linear Regression, c) Gaussian Processes. Justify your answer.","b) and c) not a). b) and c) access data points only through inner-products, whereas a) splits on feature values and needs the feature realization of the data."
Cornell Fall 2018,3,3,4,Classifiers,Image,Consider the following data set. Draw the decision boundary you would obtain with a hard margin linear SVM? Circle all the support vectors!,Solution
Cornell Fall 2018,3,4,3,Classifiers,Image,Add two blue points (\#1 and \#2) such that \#1 would and \#2 would not affect the decision boundary if the SVM was re-trained.,Solution
Cornell Fall 2018,3,5,4,Classifiers,Text,Let $m$ be the number of support vectors of an SVM trained on $n$ data points (with RBF kernel). For a fixed $n$ imagine you increase the dimensionality $d$ of the data until it becomes very large. How would you expect the ratio $\frac{m}{n}$ to change as $d \gg 0$ ?,It approaches 1 because of the curse of dimensionality. All training points will be very far away from each other and close to the decision boundary.
Cornell Fall 2018,3,6,2,Classifiers,Text,Describe a scenario in which you may want to use a kernel SVM with linear kernel instead of a standard linear (primal) SVM.,"If your dimensionality is very large, once the kernel is computed the computational complexity of kernel SVMs is independent of $d$."
Cornell Fall 2018,3,7,2,Classifiers,Image,Consider the following data set. Draw a plausible decision boundary for a hard-margin SVM with polynomial kernel.,Solution
Cornell Fall 2018,3,8,2,Classifiers,Text,You are given a non-linear regression data set. You are deciding between training a Gaussian Process or kernelized linear regression (both with $\mathrm{RBF}$ Kernel). Which one will have lower testing / training error?,They are identical.
Cornell Fall 2018,4,1,4,Decision Trees,Text,Name two advantages of decision tree over nearest neighbor algorithms.,"(1) once the tree is constructed, the training data does not need to be stored. Instead, we can simply store how many points of each label ended up in each leaf - typically these are pure so we just have to store the label of all points. (2) decision trees are very fast during test time, as test inputs simply need to traverse down the tree to a leaf - the prediction is the majority label of the leaf. (3) decision trees require no metric because the splits are based on feature thresholds and not distances."
Cornell Fall 2018,4,2,2,Decision Trees,Text,Name the CART stopping criteria (with unlimited depth).,all labels are identical or all features are identical
Cornell Fall 2018,4,3a,4,Decision Trees,Text,"Consider the classification dataset $S$ with $|S|=9$ visualized in the following figure and table: \begin{tabular}{lll}
\hline
$\mathrm{i}$ & $\mathbf{x}_{i}$ & $y_{i}$ \\
\hline
1 & $(1,1)$ & $+1$ \\
2 & $(1,2)$ & $-1$ \\
3 & $(1,3)$ & $-1$ \\
4 & $(2,1)$ & $-1$ \\
5 & $(2,2)$ & $-1$ \\
6 & $(2,3)$ & $-1$ \\
7 & $(3,1)$ & $-1$ \\
8 & $(3,2)$ & $+1$ \\
9 & $(3,3)$ & $+1$ \\
\hline
\end{tabular}
Compute the Gini impurity for this dataset before any split.",Gini impurity: $I_{G}(S)=\frac{1}{3} * \frac{2}{3}+\frac{2}{3} * \frac{1}{3}=\frac{4}{9}$.
Cornell Fall 2018,4,3b,4,Decision Trees,Image,"Consider the classification dataset $S$ with $|S|=9$ visualized in the following figure and table: \begin{tabular}{lll}
\hline
$\mathrm{i}$ & $\mathbf{x}_{i}$ & $y_{i}$ \\
\hline
1 & $(1,1)$ & $+1$ \\
2 & $(1,2)$ & $-1$ \\
3 & $(1,3)$ & $-1$ \\
4 & $(2,1)$ & $-1$ \\
5 & $(2,2)$ & $-1$ \\
6 & $(2,3)$ & $-1$ \\
7 & $(3,1)$ & $-1$ \\
8 & $(3,2)$ & $+1$ \\
9 & $(3,3)$ & $+1$ \\
\hline
\end{tabular}
Perform the CART algorithm with Gini impurity on $S$. Please draw a resulting tree (with splitting values and features) and also draw the corresponding hyper-planes in the previous figure.",Solution
Cornell Fall 2018,5,1,3,Ensemble Methods,Text,What loss function does AdaBoost minimize? (Write down the precise mathematical form.),The exponential loss $\frac{1}{n} \sum_{i=1}^{n} e^{-y_{i} H\left(x_{i}\right)}$ (the $\frac{1}{n}$ is optional).
Cornell Fall 2018,5,2,2,Ensemble Methods,Text,Imagine $10 \%$ of your binary training data (all points unique) are accidentally mislabeled. What is the training error that AdaBoost will converge to after sufficient rounds of boosting?,$0 \%$
Cornell Fall 2018,5,3,2,Ensemble Methods,Text,Describe a data scenario in which AdaBoost is not a good choice. Justify your answer.,If you exhibit label noise. The exponential loss will ensure that the mislabeled data points will also be classified correctly and the algorithm will overfit (badly).
Cornell Fall 2018,5,4a,2,Ensemble Methods,Text,"Given a distribution $P$ you can sample a training set $D$ and obtain a classifier $h$. Imagine you train $m$ such classifiers $h_{1}, \ldots, h_{m}$ on $m$ data sets $D_{1}, \ldots, D_{m}$, each drawn i.i.d. from the data distribution $P$. As you increase $m$ from $m=1$ to $m \gg 0$, show how you can use these models to obtain a low variance classifier $\hat{h}$.",You average them: $\hat{h}=\frac{1}{m} \sum_{i=1}^{m} h_{i}$.
Cornell Fall 2018,5,4b,2,Ensemble Methods,Text,"Given a distribution $P$ you can sample a training set $D$ and obtain a classifier $h$. Imagine you train $m$ such classifiers $h_{1}, \ldots, h_{m}$ on $m$ data sets $D_{1}, \ldots, D_{m}$, each drawn i.i.d. from the data distribution $P$. As you increase $m$ from $m=1$ to $m \gg 0$, what happens to the variance of $\hat{h}$ in the limit, $m \gg 0$ ?","By the weak law of large numbers the average $\hat{h}$ will approach the expected classifier $\bar{h}$ as $m \gg 0$ and $E_{\mathbf{x}, D}\left[\left(h_{D}(\mathbf{x})-\bar{h}(\mathbf{x})\right)^{2}\right] \rightarrow$ 0 ."
Cornell Fall 2018,5,4c,2,Ensemble Methods,Text,"Given a distribution $P$ you can sample a training set $D$ and obtain a classifier $h$. Imagine you train $m$ such classifiers $h_{1}, \ldots, h_{m}$ on $m$ data sets $D_{1}, \ldots, D_{m}$, each drawn i.i.d. from the data distribution $P$. As you increase $m$ from $m=1$ to $m \gg 0$, how does the bias of $\hat{h}$ compare to the bias of $h$ ?","The bias is unaffected, i.e. the bias of $\hat{h}$ is identical to the bias of $h$, because the $E[\hat{h}]=E[h]$."
Cornell Fall 2018,5,5,4,Ensemble Methods,Text,"After two iterations of AdaBoost, with step sizes $\alpha_{1}, \alpha_{2}$ respectively and weak learners $h_{1}, h_{2}$, what are all possible weights that could potentially be assigned to a training data point (ignore normalization).","$e^{-\alpha_{1}-\alpha_{2}}, e^{-\alpha_{1}+\alpha_{2}}, e^{+\alpha_{1}-\alpha_{2}}, e^{\alpha_{1}+\alpha_{2}}$"
Cornell Fall 2018,5,6,4,Ensemble Methods,Text,"Robin is trying to use AdaBoost on full CART trees without depth limit (all training points are distinct). Although the code seems correct, it crashes in the very first round. What do you think is the problem?","The CART tree has zero classification error, yielding an infinite step-size $\alpha=\frac{1}{2} \ln \left(\frac{1-\epsilon}{\epsilon}\right)$ and a division by zero."
Cornell Fall 2018,6,1,2,Neural Networks,Text,Name two reasons why Newton's Method typically is not used to train deep neural networks.,1. too many parameters to store the Hessian; 2 . it converges quickly to the closest local minima / saddle point and not to a wide minimum
Cornell Fall 2018,6,2,2,Loss Functions,Text,Let the loss function be $\ell(\mathbf{w})=\frac{1}{2 n} \sum_{i=1}^{n}\left(\mathbf{x}_{i}^{\top} \mathbf{w}-y_{i}\right)^{2}$. Write down the update for Stochastic Gradient Descent and Gradient Descent.,$G_{G D}=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{x}_{i}^{\top} \mathbf{w}-\right.$ $\left.y_{i}\right) \mathbf{x}_{i}$ whereas the SGD update is $G_{S G D}=\frac{1}{m} \sum_{i=1}^{m}\left(\mathbf{x}_{s_{i}}^{\top} \mathbf{w}-y_{s_{i}}\right) \mathbf{x}_{s_{i}}$ for randomly picked $s_{i} \in[n]$.
Cornell Fall 2018,6,3,2,CNNs,Text,"Suppose you have a convolutional filter of size $k \times k$. When you apply this filter to a $n \times n$ input image, what is the dimension of the output feature map with no padding?",$(n-k+1) \times(n-k+1)$
Cornell Fall 2018,6,4,2,CNNs,Text,"Suppose you have a $3 \times 3$ matrix $I$ from one patch of an image. Each matrix value corresponds to a pixel. 
$$
I=\left[\begin{array}{lll}
3 & 1 & 1 \\
3 & 0 & 2 \\
4 & 4 & 0
\end{array}\right]
$$
and filter kernel
$$
k=\left[\begin{array}{ll}
1 & 0 \\
1 & 1
\end{array}\right]
$$
What is the output matrix after convolving the input $I$ with $k$ (no flipping of the kernel in case you learned that in your computer vision/signal processing class)? We don't consider the padding and stride here. The output should be a $2 \times 2$ matrix","$$
\left[\begin{array}{cc}
6 & 3 \\
11 & 4
\end{array}\right]
$$"
,6,5,4,Neural Networks,Text,"Consider you have the following neural network: 
\begin{itemize}
\item Input layer: 80 units

\item First hidden layer: 20 hidden units

\item Second hidden layer: 60 hidden units

\item Third hidden layer: 20 hidden units

\item Output layer: 80 units

\item Sigmoidal activation for each hidden layer and the output

\item Loss function: logistic loss

\end{itemize}
Each layer has a bias. How many parameters does this neural network have? You can leave your answer as an expression.","$$
\# \text { params }=80 \cdot 20+20 \cdot 60+60 \cdot 20+20 \cdot 80+20+60+20+80=4780
$$"