Semester,Question Number,Part,Points,Topic,Type,Question,Solution
MIT Spring 2018,1,a.i,3,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For each dataset, is there a vector $\theta^{T}$ for a linear classifier through the origin $\left(\theta^{T} x\right)$ that perfectly separates the data? Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. ",No. Data is not linearly seperable
MIT Spring 2018,1,a.ii,3,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For each dataset, is there a vector $\theta^{T}$ for a linear classifier through the origin $\left(\theta^{T} x\right)$ that perfectly separates the data? Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$",No. Data is not linearly seperable
MIT Spring 2018,1,b.i,3,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For each dataset, is there a vector $\theta^{T}$ for a linear classifier through the origin $\left(\theta^{T} \phi(x)\right)$ that perfectly separates the data? $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. ",Yes
MIT Spring 2018,1,b.ii,3,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For each dataset, is there a vector $\theta^{T}$ for a linear classifier through the origin $\left(\theta^{T} \phi(x)\right)$ that perfectly separates the data? $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$",Yes
MIT Spring 2018,1,c,4,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For the dataset indicated below, could a one-hidden-layer neural network with $x_{1}$ and $x_{2}$ as inputs, a layer of up to four relu units and a final tanh output unit be trained to separate the data set? The network is specified as follows: $$ \begin{aligned} &z=W^{T} x+W_{0} \\ &o=\tanh \left(V^{T} \operatorname{relu}(z)+V_{0}\right) \end{aligned} $$ Assuming you use $k \leq 4$ hidden units, $W$ is $2 \times k$, $W_{0}$ is $k \times 1$ and $V$ is $k \times 1$ and $V_{0}$ is $1 \times 1$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$"," Yes. The simplest way is to have four ReLU units. The $i$ th ReLU unit is responsible for being positive when given the $i$ th input, and negative when given any of the other three inputs. The connection between the $i$ th ReLU unit and the tanh layer should be a large positive number when the $i$ th label is $+1$, and a large negative number when the $i$ th label is $-1$."
MIT Spring 2018,1,d,4,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For the dataset indicated below, could a one-hidden-layer neural network with the entries in $\phi(x)$ as inputs, a layer of up to four relu units and a final tanh output unit be trained to separate the data set? If yes, show the network with weights, including offsets if any. If no, explain briefly why not. Make sure that the prediction has the correct sign. The network is specified as follows: $$ \begin{aligned} &z=W^{T} \phi(x)+W_{0} \\ &o=\tanh \left(V^{T} \operatorname{relu}(z)+V_{0}\right) \end{aligned} $$ Assuming you use $k \leq 4$ hidden units, $W$ is $6 \times k, W_{0}$ is $k \times 1$ and $V$ is $k \times 1$ and $V_{0}$ is $1 \times 1$. $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$",Yes
MIT Spring 2018,2,a,3,Decision Trees,Image,"We will continue the example from the previous question.
For the dataset indicated below, construct a decision tree (using the algorithm from class, based on weighted entropy) with the original features $x=\left[x_{1}, x_{2}\right]^{T}$. Use tests of the form $f<v$. If there is a tie in the choice of split, first prefer $x_{1}$ and then smaller thresholds. You do not need to provide numerical values of the weighted entropy.
Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$.",$\begin{aligned} \mathrm{x}_{-} 1 &<0.5 \\ \mathrm{~T}: & \mathrm{x}_{-} 2<0.5 \\ \mathrm{~T}:-1 \\ \mathrm{~F}:+1 \\ \text { F: } & \mathrm{x}_{-} 2<0.5 \\ \mathrm{~T}:+1 \\ \text { F: }-1 \end{aligned}$
MIT Spring 2018,2,b,3,Decision Trees,Image,"We will continue the example from the previous question. For the dataset indicated below, construct a decision tree (using the algorithm from class, based on weighted entropy) with features from $\phi(x)$. If there is a tie in the choice of split, first prefer features that appear earlier in the $\phi(x)$ vector and then smaller thresholds. Use tests of the form $f<v$. You do not need to provide numerical values of the weighted entropy.
$$
\phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T}
$$
Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$.","$x_{-} 1 x_{-} 2<0.5$
T: $x_{-} 1<0.5$
T: $x_{-} 2<0.5$
T: $-1$
F: $+1$
F: $+1$
F: $-1$"
MIT Spring 2018,2,c,2,Decision Trees,Image,"We will continue the example from the previous question. For any dataset with only positive-valued $x_{1}$ and $x_{2}$, what features in $\phi(x)$ cannot possibly appear in a decision tree computed by the algorithm from class. Assume the splitting rule described earlier: if there is a tie in the choice of split, first prefer features that appear earlier in the $\phi(x)$ vector and then smaller thresholds. Explain your answer.
$$
\phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T}
$$","Features $1, x_{1}^{2}$ and $x_{2}^{2}$ cannot appear. The first one provides no information and the square terms (for positive data values) create the same splits in the data as the $x_{1}$ and $x_{2}$ features."
MIT Spring 2018,3,a,2,Neural Networks,Image,"Assume two data sets are sampled from the same distribution where data set 1 has 1,000 elements and data set 2 has 10,000 elements. Also assume we randomly construct train and test sets from both data sets by dividing them into $90 \%$ training and $10 \%$ testing.

We will explore the effect of using models of increasing complexity (you can think of this as decreasing regularization).
- Draw two curves, for training error and test error, for each data set with the $y$-axis denoting the error and the $x$-axis denoting the model complexity.
- You should have total of 4 curves: one training error and one test error curve for each dataset.
- Draw all 4 of them in the same diagram below. We have included the true error value on the diagram; this is the error that the correct model has on this data.
- Clearly mark your curves with the labels: $1 \mathrm{~K}$ train, $1 \mathrm{~K}$ test, $10 \mathrm{~K}$ train, $10 \mathrm{~K}$ test.
The following factors will be used for grading:
- The general shape of the curves.
- The relative ordering of the curves in the ""Prediction Error"" direction.","- Training error is lower than the true error (with sufficient model complexity), while test error is higher, as we are fitting to the training data
- Training error decreases with increasing model complexity, as we have increased capacity to fit the data
- Test error initially decreases with increasing model complexity and then increases, as we start to fit the data better and then proceed to overfit
- The $10 \mathrm{k}$ dataset makes it more difficult to overfit, so training error is higher and test error lower compared to their $1 \mathrm{k}$ counterparts."
MIT Spring 2018,3,b,2,Neural Networks,Image,"Consider these training and test curves as a function of training dataset size. These are for two models: one simple and one complex. Which is which? Explain your choice.
Left: $\sqrt{\text { simple } O \text { complex } \quad \text { Right: } O \text { simple } \sqrt{\text { complex}}$","The complex model more easily overfits, so test error is initially worse (and training error better), but with sufficient data (to prevent overfitting) the more complex model performs better."
MIT Spring 2018,3,c.i,2,Neural Networks,Text,"In some cases, we will have a validation set in addition to training and test sets. Assume the validation set is approximately the same size as the test set. This validation set is often used to tune hyperparameters such as $\lambda$. Imagine we have trained a classifier using regularization, with λ$\lambda$ chosen based on performance on the validation set. Which will have the highest accuracy the training set, the validation set or the test set?",Validation set
MIT Spring 2018,3,c.ii,2,Neural Networks,Text,"In some cases, we will have a validation set in addition to training and test sets. Assume the validation set is approximately the same size as the test set. This validation set is often used to tune hyperparameters such as $\lambda$. Imagine we have trained a classifier using regularization, with λ$\lambda$ chosen based on performance on the validation set. Which will have the lowest accuracy the training set, the validation set or the test set?",test set
MIT Spring 2018,3,d.i,2,Neural Networks,Text,"In some cases, we will have a validation set in addition to training and test sets. Assume the validation set is approximately the same size as the test set. This validation set is often used to tune hyperparameters such as $\lambda$. Imagine we have trained a classifier using regularization, with λ$\lambda$ chosen based on performance on the training set. Which will have the highest accuracy the training set, the validation set or the test set?",training set
MIT Spring 2018,3,d.ii,2,Neural Networks,Text,"In some cases, we will have a validation set in addition to training and test sets. Assume the validation set is approximately the same size as the test set. This validation set is often used to tune hyperparameters such as $\lambda$. Imagine we have trained a classifier using regularization, with λ$\lambda$ chosen based on performance on the training set. Which will have the lowest accuracy the training set, the validation set or the test set?",test set
MIT Spring 2018,3,e,2,Neural Networks,Text,"An alternative to cross-validation for estimating prediction error is to use ""bootstrap samples"". These are datasets constructed by randomly sampling points from the original training set with replacement, that is, we do not remove previously sampled points, so a data point could appear more than once in a bootstrap sample. Consider the following alternative methodologies, assuming the training dataset contains $N$ samples. 1. Generate $K$ bootstrap samples of size $N$, train on each sample and evaluate on the original training dataset. Return average of results. 2. Generate $K$ bootstrap samples of size $N$, train on the original training dataset and evaluate on each sample. Return average of results. 3. Generate $K$ bootstrap samples of size $N$, train on each sample and evaluate on points in the original training dataset but not in the sample (assume there are always some such points). Return average of results. Order these (from best to worst) by how accurate you expect the estimates of prediction error on unseen test data to be. Explain your answer.","$3,1,2$ The more unfamiliar your test data, the more accurate the evaluation will be."
MIT Spring 2018,4,a.i,2,Classifiers,Text,"Consider a classification problem in which there are $K$ possible output classes, $1, \ldots, K$. We have studied using NLL as a loss function in such cases, but that assumes that all mistaken classifications are equally significant. Instead, we'll consider a case where some mistakes are worse than others (e.g., mis-identifying a cow as a horse is not as bad as calling it a mouse). Define the cost matrix $c_{g, a}$ to be the cost for guessing class $g$ when the actual class is $a$. For convenience, we'll write $c_{j}$ for the column of the matrix $\left[c_{1, j}, c_{2, j}, \ldots, c_{K, j}\right]^{T}$ representing the costs of all the possible guesses when $j$ is the actual value. We will use a simple neural network with a softmax activation function, so our prediction $p$ will be a $K \times 1$ vector: $$ \begin{aligned} &p=\operatorname{softmax}(z) \\ &z=W^{T} x \end{aligned} $$ Assume inputs are $d \times 1$ so $W$ is $d \times K$. Our loss function, for a prediction vector $p$ when the target output is value $y \in\{1, \ldots, K\}$ is the expected cost of the prediction: $$ L_{c}(p, y)=\sum_{k=1}^{K} p_{k} c_{k y}=p^{T} c_{y} $$ So, the overall training objective is to minimize, over a data set of $n$ points, $$ J_{c}(W)=\sum_{i=1}^{n} L_{c}\left(p^{(i)}, y^{(i)}\right) $$ (a) Select which of the following cost matrices $c$ corresponds to each situation described below. A. $\left[\begin{array}{llll}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$ B. $\left[\begin{array}{llll}0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0\end{array}\right]$ C. $\left[\begin{array}{llll}0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\end{array}\right]$ D. $\left[\begin{array}{cccc}0 & .5 & .5 & .5 \\ 2 & 0 & 1 & 1 \\ 2 & 1 & 0 & 1 \\ 2 & 1 & 1 & 0\end{array}\right]$ E. $\left[\begin{array}{cccc}0 & 2 & 2 & 2 \\ .5 & 0 & 1 & 1 \\ .5 & 1 & 0 & 1 \\ .5 & 1 & 1 & 0\end{array}\right]$. Select which matrix (A, B, C, D, E) corresponds to a situiation where any output that is not the single preferred answer is penalized equally.",B
MIT Spring 2018,4,a.ii,2,Classifiers,Text,"Consider a classification problem in which there are $K$ possible output classes, $1, \ldots, K$. We have studied using NLL as a loss function in such cases, but that assumes that all mistaken classifications are equally significant. Instead, we'll consider a case where some mistakes are worse than others (e.g., mis-identifying a cow as a horse is not as bad as calling it a mouse). Define the cost matrix $c_{g, a}$ to be the cost for guessing class $g$ when the actual class is $a$. For convenience, we'll write $c_{j}$ for the column of the matrix $\left[c_{1, j}, c_{2, j}, \ldots, c_{K, j}\right]^{T}$ representing the costs of all the possible guesses when $j$ is the actual value. We will use a simple neural network with a softmax activation function, so our prediction $p$ will be a $K \times 1$ vector: $$ \begin{aligned} &p=\operatorname{softmax}(z) \\ &z=W^{T} x \end{aligned} $$ Assume inputs are $d \times 1$ so $W$ is $d \times K$. Our loss function, for a prediction vector $p$ when the target output is value $y \in\{1, \ldots, K\}$ is the expected cost of the prediction: $$ L_{c}(p, y)=\sum_{k=1}^{K} p_{k} c_{k y}=p^{T} c_{y} $$ So, the overall training objective is to minimize, over a data set of $n$ points, $$ J_{c}(W)=\sum_{i=1}^{n} L_{c}\left(p^{(i)}, y^{(i)}\right) $$ (a) Select which of the following cost matrices $c$ corresponds to each situation described below. A. $\left[\begin{array}{llll}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$ B. $\left[\begin{array}{llll}0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0\end{array}\right]$ C. $\left[\begin{array}{llll}0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\end{array}\right]$ D. $\left[\begin{array}{cccc}0 & .5 & .5 & .5 \\ 2 & 0 & 1 & 1 \\ 2 & 1 & 0 & 1 \\ 2 & 1 & 1 & 0\end{array}\right]$ E. $\left[\begin{array}{cccc}0 & 2 & 2 & 2 \\ .5 & 0 & 1 & 1 \\ .5 & 1 & 0 & 1 \\ .5 & 1 & 1 & 0\end{array}\right]$. Select which matrix (A, B, C, D, E) corresponds to a situiation where there are two pairs of outputs that are interchangeable with no penalty.",C
MIT Spring 2018,4,a.iii,2,Classifiers,Text,"Consider a classification problem in which there are $K$ possible output classes, $1, \ldots, K$. We have studied using NLL as a loss function in such cases, but that assumes that all mistaken classifications are equally significant. Instead, we'll consider a case where some mistakes are worse than others (e.g., mis-identifying a cow as a horse is not as bad as calling it a mouse). Define the cost matrix $c_{g, a}$ to be the cost for guessing class $g$ when the actual class is $a$. For convenience, we'll write $c_{j}$ for the column of the matrix $\left[c_{1, j}, c_{2, j}, \ldots, c_{K, j}\right]^{T}$ representing the costs of all the possible guesses when $j$ is the actual value. We will use a simple neural network with a softmax activation function, so our prediction $p$ will be a $K \times 1$ vector: $$ \begin{aligned} &p=\operatorname{softmax}(z) \\ &z=W^{T} x \end{aligned} $$ Assume inputs are $d \times 1$ so $W$ is $d \times K$. Our loss function, for a prediction vector $p$ when the target output is value $y \in\{1, \ldots, K\}$ is the expected cost of the prediction: $$ L_{c}(p, y)=\sum_{k=1}^{K} p_{k} c_{k y}=p^{T} c_{y} $$ So, the overall training objective is to minimize, over a data set of $n$ points, $$ J_{c}(W)=\sum_{i=1}^{n} L_{c}\left(p^{(i)}, y^{(i)}\right) $$ (a) Select which of the following cost matrices $c$ corresponds to each situation described below. A. $\left[\begin{array}{llll}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$ B. $\left[\begin{array}{llll}0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0\end{array}\right]$ C. $\left[\begin{array}{llll}0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\end{array}\right]$ D. $\left[\begin{array}{cccc}0 & .5 & .5 & .5 \\ 2 & 0 & 1 & 1 \\ 2 & 1 & 0 & 1 \\ 2 & 1 & 1 & 0\end{array}\right]$ E. $\left[\begin{array}{cccc}0 & 2 & 2 & 2 \\ .5 & 0 & 1 & 1 \\ .5 & 1 & 0 & 1 \\ .5 & 1 & 1 & 0\end{array}\right]$. Select which matrix (A, B, C, D, E) corresponds to a situiation where it is worse to miss predicting a particular bad outcome than to predict that outcome by mistake.",D
MIT Spring 2018,4,b,4,Classifiers,Text,"Consider a classification problem in which there are $K$ possible output classes, $1, \ldots, K$. We have studied using NLL as a loss function in such cases, but that assumes that all mistaken classifications are equally significant. Instead, we'll consider a case where some mistakes are worse than others (e.g., mis-identifying a cow as a horse is not as bad as calling it a mouse). Define the cost matrix $c_{g, a}$ to be the cost for guessing class $g$ when the actual class is $a$. For convenience, we'll write $c_{j}$ for the column of the matrix $\left[c_{1, j}, c_{2, j}, \ldots, c_{K, j}\right]^{T}$ representing the costs of all the possible guesses when $j$ is the actual value. We will use a simple neural network with a softmax activation function, so our prediction $p$ will be a $K \times 1$ vector: $$ \begin{aligned} &p=\operatorname{softmax}(z) \\ &z=W^{T} x \end{aligned} $$ Assume inputs are $d \times 1$ so $W$ is $d \times K$. Our loss function, for a prediction vector $p$ when the target output is value $y \in\{1, \ldots, K\}$ is the expected cost of the prediction: $$ L_{c}(p, y)=\sum_{k=1}^{K} p_{k} c_{k y}=p^{T} c_{y} $$ So, the overall training objective is to minimize, over a data set of $n$ points, $$ J_{c}(W)=\sum_{i=1}^{n} L_{c}\left(p^{(i)}, y^{(i)}\right) $$ (a) Select which of the following cost matrices $c$ corresponds to each situation described below. A. $\left[\begin{array}{llll}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$ B. $\left[\begin{array}{llll}0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0\end{array}\right]$ C. $\left[\begin{array}{llll}0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\end{array}\right]$ D. $\left[\begin{array}{cccc}0 & .5 & .5 & .5 \\ 2 & 0 & 1 & 1 \\ 2 & 1 & 0 & 1 \\ 2 & 1 & 1 & 0\end{array}\right]$ E. $\left[\begin{array}{cccc}0 & 2 & 2 & 2 \\ .5 & 0 & 1 & 1 \\ .5 & 1 & 0 & 1 \\ .5 & 1 & 1 & 0\end{array}\right]$. What would the change to the weights $W$ be, in one step of stochastic gradient descent on $J_{c}$, with input $x$ and target output $y$, and step size $\eta$ ? Computing $\partial p / \partial z$ is kind of hairy. It is a $K \times K$ matrix. You can write your answer in terms of it without computing it. You may also use $x, y, W$, and/or $c$ in your solution.","$$
-\eta \cdot x \cdot\left(\frac{\partial p}{\partial z} \cdot c_{y}\right)^{T}
$$
. To calculate the SGD update, we first need to calculate $\frac{\partial J_{c}}{\partial W}$. We use chain rule.
$$
\frac{\partial J_{c}}{\partial W}=\frac{\partial J_{c}}{\partial L_{c}} \frac{\partial L_{c}}{\partial p} \frac{\partial p}{\partial z} \frac{\partial z}{\partial W}=1 \cdot\left(c_{y}^{T}\right)\left(\frac{\partial p^{T}}{\partial z}\right) \cdot x=x \cdot\left(\frac{\partial p}{\partial z} \cdot c_{y}\right)^{T}
$$
The SGD update is then
$$
-\eta \cdot x \cdot\left(\frac{\partial p}{\partial z} \cdot c_{y}\right)^{T}
$$"
MIT Spring 2018,5,a,3,MDPs,Image,"Consider the following Markov decision process:
Assume:
- Reward is 0 in all states, except $+10$ in s6 and $+5$ in s5; the reward is received when exiting the state.
- Transitions out of s0 are deterministic, and depend on the choice of action (A or B). Assume in this part that all transitions are deterministic, following the arrows indicated with probebility 1 . When horizon $=3$ and discount factor $\gamma=1$, provide values for:
i. $Q\left(s_{\mathrm{D}}, A\right)$
ii. $Q\left(s_{\mathrm{D}}, B\right)$","i. 0
ii. 5"
MIT Spring 2018,5,b,3,MDPs,Image,"Consider the following Markov decision process:
Assume:
- Reward is 0 in all states, except $+10$ in s6 and $+5$ in s5; the reward is received when exiting the state.
- Transitions out of s0 are deterministic, and depend on the choice of action (A or B). Still assuming that all transitions are deterministic, but letting horizon $=5$ and discount factor $\gamma=1$, provide values for:
i. $Q\left(s_{\mathrm{D}}, A\right)$
ii. $Q(s \mathrm{D}, B)$","i. 10
ii. 5"
MIT Spring 2018,5,c,2,MDPs,Image,"Consider the following Markov decision process:
Assume:
- Reward is 0 in all states, except $+10$ in s6 and $+5$ in s5; the reward is received when exiting the state.
- Transitions out of s0 are deterministic, and depend on the choice of action (A or B). Now, assume that transitions out of so are deterministic, but that all other transitions follow the arrows indicated with probsbility $0.9$ and stay in the current state with probsbility $0.1$

For policy $\pi\left(s_{0}\right)=B$, write a system of equations that can be solved in order to compute $V_{\pi}(s 0)$ when the horizon is infinite and $\gamma=0.8$.
Do not solve the equations!","$$
\begin{aligned}
&v_{0}=0.8 v_{4} \\
&v_{4}=0.8\left(0.1 v_{4}+0.9 v_{5}\right) \\
&v_{\mathrm{g}}=5+0.8\left(0.1 v_{\mathrm{g}}+0.9 v_{\mathrm{D}}\right)
\end{aligned}
$$"
MIT Spring 2018,6,a,3,Reinforcement Learning,Image,"We will be performing Q-learning in an MDP with states so through sk, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$.

Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$
(a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state.
$$
\begin{array}{r}
\left(s_{\mathrm{D}}, a_{2}, 0, s_{2}\right) \\
\left(s_{2}, a_{1}, 0, s_{3}\right) \\
\left(s_{3}, a_{1}, 0, s_{1}\right) \\
\left(s_{1}, a_{1}, 10, s_{\mathrm{D}}\right) \\
\left(s_{\mathrm{D}}, a_{1}, 0, s_{\mathrm{K}}\right) \\
\left(s_{\mathrm{K}}, a_{1}, 0, s_{4}\right) \\
\left(s_{4}, a_{1}, 5_{1}, s_{\mathrm{D}}\right)
\end{array}
$$
Fill in the resulting $Q$ values in the following table:
\begin{tabular}{l|l|l|l|l|l|l|} 
& $s_{0}$ & $s_{1}$ & $s_{2}$ & $s_{3}$ & $s_{4}$ & $s_{5}$ \\
\hline$a_{1}$ & 0 & & & & & \\
\hline & & 10 & 0 & 0 & 5 & 0 \\
$a_{2}$ & 0 & 0 & 0 & 0 & 0 & 0 \\
\hline
\end{tabular}","\begin{tabular}{l|l|l|l|l|l|l|} 
& $s_{0}$ & $s_{1}$ & $s_{2}$ & $s_{3}$ & $s_{4}$ & $s_{5}$ \\
\hline$a_{1}$ & 0 & & & & & \\
\hline & & 10 & 0 & 0 & 5 & 0 \\
$a_{2}$ & 0 & 0 & 0 & 0 & 0 & 0 \\
\hline
\end{tabular}"
MIT Spring 2018,6,b,2,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$.  Iyaz suggests that, rather than getting new experience, it would be a good idea to replay this data over several times using the regular Q-learning update. What's the minimum number of times you would have to iterate through this data before $Q\left(s_{0}, a_{2}\right)>Q\left(s_{0}, a_{1}\right.$ ? Note: it should be possible to answer this question by thinking about the structure of the problem, rather than by grinding through more Q-learning update calculations.","4, including the first update whose values we recorded in the table)."
MIT Spring 2018,6,c.i,1,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
What is a correct expression for <fill in> above?", $r+0.8 * \max$ ([nn[a_prime].predict(s_prime) for a_prime in actions]).
MIT Spring 2018,6,c.ii,1,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
What is an appropriate value for <epochs> above?",None. We want to train until convergence.
MIT Spring 2018,6,d,2,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
 If we change the loop to have the form for t in range(max_iterations): for $\left(s, a, r, s^{\prime}\right)$ in memory: data $=[(s,<f i l l$ in $\rangle)] \quad \#$ a single data point $\mathrm{nn}[\mathrm{a}]$. $\operatorname{train}$ (data, <epochs>) Provide a value for <epochs $>$ above that will cause this algorithm to converge to a correct solution oxplain why no such value exists. ",1. With 1 epoch we will look at every piece of experience in memory once per iteration
MIT Spring 2018,6,e,2,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
Would it be okay to call $\mathrm{nn}[\mathrm{a}]$. init() on the line before calling train in the code loop?",Yes
MIT Spring 2018,6,f,2,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
Would it be okay to call $\mathrm{nn}$. init() on the line before calling train in the code loop?",No
MIT Spring 2018,6,g,2,Reinforcement Learning,Text,"We often use $\epsilon$-greedy exploration in Q learning, in which we execute the action with the highest Q value in the current state with probability 1 − $\epsilon$ and execute a random action with probability $\epsilon$. What problem might occur if we set $\epsilon$ to be too small?",We might get stuck for a long time doing a sub-optimal action choice due to lack of exploration.
MIT Spring 2018,7,a,4,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. M.A. Trix suggests a new decomposition of the solution matrix $X$ into $U W V^{T}$ where $W$ is a $k \times k$ matrix, and $U$ and $V$ are as in the original approach. Is M.A. Trix's approach able to represent: A richer class of models than the original? A smaller class? $\sqrt{\text { The same class? }}$ Choose 1 and provide a short concrete justification of your answer. ",The same class. You could just multiply $W$ directly into $U$ or $V^{T}$ and end up with the original model.
MIT Spring 2018,7,b.i,1,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. Otto N. Coder thinks there's a whole different and interesting way to approach this problem. Consider a neural network with two layers of weights and tanh activation functions: $$ \begin{aligned} a &=\tanh \left(W^{a T} x\right) \\ x^{\prime} &=\tanh \left(W^{b^{T}} a\right) \end{aligned} $$ where $x$ is a $m \times 1$ vector representing a single user's movie-watching experience. We will assume just binary ratings ( $+1$ means the user liked the movie and $-1$ that they did not; a value of 0 indicates that the user has not yet rated the movie). The vector $a$ is $k \times 1$ where $k$ is significantly less than $m$. We can make a data-set with $n$ such vectors, one for each user, and then train this network as an ""auto-encoder"", which takes in an $x$ vector and attempts to recreate it as its output, but which is forced to go through a much smaller representation. To train this network, we would use ordinary supervised training, but with pairs $(x, x)$ with one $x$ vector for each user in the training data, used as both the input and the desired output of the network.

The loss function $L\left(x^{\prime}, x\right)$ where $x$ is the true output vector and $x^{\prime}$ is the prediction, would be
$$
L\left(x^{\prime}, x\right)=\sum_{j=1}^{m} \begin{cases}0 & \text { if } x_{j}=0 \\ \left(x_{j}-x_{j}^{\prime}\right)^{2} & \text { otherwise }\end{cases}
$$
What is $L\left(x^{\prime}, x\right)$ if this user has never watched any movies?",0
MIT Spring 2018,7,b.ii,1,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. Otto N. Coder thinks there's a whole different and interesting way to approach this problem. Consider a neural network with two layers of weights and tanh activation functions: $$ \begin{aligned} a &=\tanh \left(W^{a T} x\right) \\ x^{\prime} &=\tanh \left(W^{b^{T}} a\right) \end{aligned} $$ where $x$ is a $m \times 1$ vector representing a single user's movie-watching experience. We will assume just binary ratings ( $+1$ means the user liked the movie and $-1$ that they did not; a value of 0 indicates that the user has not yet rated the movie). The vector $a$ is $k \times 1$ where $k$ is significantly less than $m$. We can make a data-set with $n$ such vectors, one for each user, and then train this network as an ""auto-encoder"", which takes in an $x$ vector and attempts to recreate it as its output, but which is forced to go through a much smaller representation. To train this network, we would use ordinary supervised training, but with pairs $(x, x)$ with one $x$ vector for each user in the training data, used as both the input and the desired output of the network.

The loss function $L\left(x^{\prime}, x\right)$ where $x$ is the true output vector and $x^{\prime}$ is the prediction, would be
$$
L\left(x^{\prime}, x\right)=\sum_{j=1}^{m} \begin{cases}0 & \text { if } x_{j}=0 \\ \left(x_{j}-x_{j}^{\prime}\right)^{2} & \text { otherwise }\end{cases}
$$
How much, if any, more loss is incurred, with respect to a particular movie, for predicting $+1$ when the answer should be $-1$ than is incurred for predicting $-1$ when the
answer should be $+1 ?$",0
MIT Spring 2018,7,b.iii,1,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. Otto N. Coder thinks there's a whole different and interesting way to approach this problem. Consider a neural network with two layers of weights and tanh activation functions: $$ \begin{aligned} a &=\tanh \left(W^{a T} x\right) \\ x^{\prime} &=\tanh \left(W^{b^{T}} a\right) \end{aligned} $$ where $x$ is a $m \times 1$ vector representing a single user's movie-watching experience. We will assume just binary ratings ( $+1$ means the user liked the movie and $-1$ that they did not; a value of 0 indicates that the user has not yet rated the movie). The vector $a$ is $k \times 1$ where $k$ is significantly less than $m$. We can make a data-set with $n$ such vectors, one for each user, and then train this network as an ""auto-encoder"", which takes in an $x$ vector and attempts to recreate it as its output, but which is forced to go through a much smaller representation. To train this network, we would use ordinary supervised training, but with pairs $(x, x)$ with one $x$ vector for each user in the training data, used as both the input and the desired output of the network.

The loss function $L\left(x^{\prime}, x\right)$ where $x$ is the true output vector and $x^{\prime}$ is the prediction, would be
$$
L\left(x^{\prime}, x\right)=\sum_{j=1}^{m} \begin{cases}0 & \text { if } x_{j}=0 \\ \left(x_{j}-x_{j}^{\prime}\right)^{2} & \text { otherwise }\end{cases}
$$
In terms of making good predictions, would it be disastrous, just fine, or only mildly bad if we were to leave out the tanh activation function on the output layer? Explain.","Only mildly bad. We would get predictions that go outside the bounds of $+1$ and $-1$, but they would probably be usable for picking the max. Note that choosing the max is the ""right"" thing to do here since we want to make recommendations and the thing to recommend should have the maximum prediction value."
MIT Spring 2018,7,c,3,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. Otto N. Coder thinks there's a whole different and interesting way to approach this problem. Consider a neural network with two layers of weights and tanh activation functions: $$ \begin{aligned} a &=\tanh \left(W^{a T} x\right) \\ x^{\prime} &=\tanh \left(W^{b^{T}} a\right) \end{aligned} $$ where $x$ is a $m \times 1$ vector representing a single user's movie-watching experience. We will assume just binary ratings ( $+1$ means the user liked the movie and $-1$ that they did not; a value of 0 indicates that the user has not yet rated the movie). The vector $a$ is $k \times 1$ where $k$ is significantly less than $m$. We can make a data-set with $n$ such vectors, one for each user, and then train this network as an """"auto-encoder"""", which takes in an $x$ vector and attempts to recreate it as its output, but which is forced to go through a much smaller representation. To train this network, we would use ordinary supervised training, but with pairs $(x, x)$ with one $x$ vector for each user in the training data, used as both the input and the desired output of the network.

The loss function $L\left(x^{\prime}, x\right)$ where $x$ is the true output vector and $x^{\prime}$ is the prediction, would be
$$
L\left(x^{\prime}, x\right)=\sum_{j=1}^{m} \begin{cases}0 & \text { if } x_{j}=0 \\ \left(x_{j}-x_{j}^{\prime}\right)^{2} & \text { otherwise }\end{cases}
$$
After training this network, we could feed in a particular user's $x$ vector and receive an output $x^{\prime}$. How could we use the $x^{\prime}$ value to select the best movie to recommend to that user?
Provide your answer in completely detailed math, code, or English that could be unambiguously converted into math or code.","$$
m=\operatorname{argmax}_{\left\{i \mid x_{i}=0\right\}} x_{i}^{\prime}
$$"
MIT Spring 2018,8,a,3,RNNS,Image,"One of the RNN architectures we studied was
$$
\begin{aligned}
&s_{t}=f_{1}\left(W^{s s} s_{t-1}+W^{s x_{t}} x_{t}\right) \\
&y_{t}=f_{2}\left(W^{o} s_{t}\right)
\end{aligned}
$$
where $W^{s s}$ is $m \times m, W^{s x}$ is $m \times l$ and $W^{o}$ is $n \times m$. Assume $f_{i}$ can be any of our standard activation functions. We omit the offset parameters for simplicity (set them to zero). Suppose we modify the original architecture as follows:
$$
s_{t}=f_{1}\left(W^{s s 1} f_{3}\left(W^{s s 2} s_{t-1}\right)+W^{s z} x_{t}\right)
$$
i. Provide values for the original $W^{s a}$ that make the original architecture equivalent to this one, or explain why none exist.
$$
W^{\text {ss }}=
$$

ii. Provide values for $W^{s s 2}, f_{3}$ and $W^{s s 1}$ that make this new architecture equivalent to the original, or explain why none exist.","i. This architecture can represent state machines that can't be represented by the original architecture, because the class of state transition functions that can be modeled in the modified architecture is bigger.

ii. linear / Wss / I"
MIT Spring 2018,8,b,2,RNNs,Image,"One of the RNN architectures we studied was
$$
\begin{aligned}
&s_{t}=f_{1}\left(W^{s s} s_{t-1}+W^{s x_{t}} x_{t}\right) \\
&y_{t}=f_{2}\left(W^{o} s_{t}\right)
\end{aligned}
$$
where $W^{s s}$ is $m \times m, W^{s x}$ is $m \times l$ and $W^{o}$ is $n \times m$. Assume $f_{i}$ can be any of our standard activation functions. We omit the offset parameters for simplicity (set them to zero). Now, we'll consider two strategies for making the RNN generate two output symbols for each input symbol. Assume the symbols are drawn from a vocabulary of size $n$.
Model A: We use a separate softmsx output for each symbol, so
$$
\begin{aligned}
&y_{t}^{1}=\operatorname{softmax}\left(W^{o 1} s_{t}\right) \\
&y_{t}^{2}=\operatorname{softmax}\left(W^{02} s_{t}\right)
\end{aligned}
$$
where $W^{o 1}$ and $W^{o 2}$ are $n \times m$.
Model B: We use a single softmax output, but it ranges over $n^{2}$ possible pairs of symbols, so
$$
y_{t}^{1}, y_{t}^{2}=\operatorname{softmax}\left(W^{o 3} s_{t}\right)
$$
i. What would the dimension of $W^{33}$ need to be?
ii. Which of the following is true:
Models A and B can express exactly the same set of RNN models.
Model A is more expressive than model B.
$\sqrt{\text { Model } \mathbf{B} \text { is more expressive than model } A .$","i. $n^{2} \times m$
ii. Model B is more expressive than model A."
MIT Spring 2018,8,c,2,RNNs,Image,Image,
MIT Spring 2018,9,a,3,CNNs,Image,"We will explore how convolutional neural networks operate by designing one. Our objective is to be able to locate the pattern
in an image. Throughout this problem, treat dark squares as having value $+1$ and light squares as having value $-1$. Consider the image that would result from convolving the image below with a filter that is the same as the pattern above. (Use our definition of convolution, in which we slide the filter over the image and compute the dot product.) Assume that the edges are padded with $-1$ and that use a stride of 1 .

Indicate which pixel in the resulting image will have the maximum value by writing the resulting pixel value in the appropriate cell of the image on the right below.",Image filling
MIT Spring 2018,9,b,3,CNNs,Image,"We will explore how convolutional neural networks operate by designing one. Our objective is to be able to locate the pattern
in an image. Throughout this problem, treat dark squares as having value $+1$ and light squares as having value $-1$. In order to detect this pattern, we would create a network that has
- a convolutional layer with a single filter, corresponding to the desired pattern,
- a max-pooling layer with input size equal to the image size, and finally
- a single ReLU unit.
Provide a value for the offset $W_{o}$ on the input to the ReLU that, for any image, would guarantee the output of the ReLU is positive if and only if there is a perfect instance of this pattern in the image.","$-8$
A perfect score is 9 . The next best match would be 8 correct and 1 wrong, which would total to 7 . Any value between 7 and 9 would be correct here."
MIT Spring 2018,9,c,2,CNNs,Image,"Kanye Volution thinks that instead of having this single convolution layer with a single filter matching the whole desired pattern, it would be better to start with a convolutional layer with four smaller filters, shown below:

The following images are the result of convolving the original image with these 4 simple filters and running through a ReLU. Black squares have value $+1$, grey squares have value $+0.5$, and the rest have value 0 .

It is slightly unusual to have $2 \times 2$ filters (usually they have odd dimension). When we apply them, we place the upper-left pixel of the filter on top of the image pixel whose value we are computing.

The next layer of Kanye's network now takes an input of depth 4 and applies a single $2 x$ $2 \times 4$ filter. Specify a filter on the output of the simple filters that will generate an image with a high value at the pixel located at the upper left corner of the pattern and lower values elsewhere. Fill weight values (either $+1$ or $-1$ ) into the squares below.",Image filling