Semester,Question Number,Part,Points,Topic,Type,Question,Solution
MIT Fall 2017,1,a,2,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the ordinary least-squares formula, we can compute $$ W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T} $$ Using ridge regression, we can compute $$ W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T} $$ We decide to try to use these methods to initialize a single-unit ""neural network"" with a linear activation function. Assume that $X X^{T}$ is neither singular nor equal to the identity matrix, and that neither $W_{\text {ols }}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$.If we initialized our neuron with $W_{\text {ols }}$ and did batch gradient descent (summing the error over all the data points) with squared loss and a fixed small step size, which of the following would most typically happen: 
A. The weights would change and then converge back to the original value
B. Weights would not change
C. Weights would make small oscillations around the initial weights
D. The weights would converge to a different value
E. Something else would happen",B. The weights would not change
MIT Fall 2017,1,b,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the ordinary least-squares formula, we can compute $$ W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T} $$ Using ridge regression, we can compute $$ W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T} $$ We decide to try to use these methods to initialize a single-unit """"neural network"""" with a linear activation function. Assume that $X X^{T}$ is neither singular nor equal to the identity matrix, and that neither $W_{\text {ols }}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$.If we initialized our neuron with $W_{\text {ols }}$ and did batch gradient descent (summing the error over all the data points) with squared loss and a fixed small step size, explain why the weights would not change.",These weights are an optimum of the objective and the gradient will be (nearly) zero.
MIT Fall 2017,1,c,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the ordinary least-squares formula, we can compute $$ W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T} $$ Using ridge regression, we can compute $$ W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T} $$ We decide to try to use these methods to initialize a single-unit ""neural network"" with a linear activation function. Assume that $X X^{T}$ is neither singular nor equal to the identity matrix, and that neither $W_{\text {ols }}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. If we initialized our neuron with $W_{\text {ols }}$ and did stochastic gradient descent (one data point at a time) with squared loss and a fixed small step size, which of the following would most typically happen?
A. The weights would change and then converge back to the original value
B. Weights would not change
C. Weights would make small oscillations around the initial weights
D. The weights would converge to a different value
E. Something else would happen",C. The weights would make small osciallations around the initial weights
MIT Fall 2017,1,d,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the ordinary least-squares formula, we can compute $$ W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T} $$ Using ridge regression, we can compute $$ W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T} $$ We decide to try to use these methods to initialize a single-unit """"neural network"""" with a linear activation function. Assume that $X X^{T}$ is neither singular nor equal to the identity matrix, and that neither $W_{\text {ols }}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. If we initialized our neuron with $W_{\text {ols }}$ and did stochastic gradient descent (one data point at a time) with squared loss and a fixed small step size, explain why the weights would make small oscillations around the initial weights.",In expectation the steps should be small motions around the optimum. If someone says it will (or might) do something else (like hop out of this and get stuck somewhere else) thatâs most of the credit if it is well reasoned and explained.
MIT Fall 2017,1,e,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the ordinary least-squares formula, we can compute $$ W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T} $$ Using ridge regression, we can compute $$ W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T} $$ We decide to try to use these methods to initialize a single-unit ""neural network"" with a linear activation function. Assume that $X X^{T}$ is neither singular nor equal to the identity matrix, and that neither $W_{\text {ols }}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Consider a neuron initialized with $W_{\text {ridge. Provide an objective function }} J(W)$ that depends on the data, such that batch gradient descent to minimize $J$ will have no effect on the weights, or argue that one does not exist.",$J(W)=\left(W^{T} X-Y\right)^{2}+\lambda\|W\|^{2}$
MIT Fall 2017,1,f,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the ordinary least-squares formula, we can compute $$ W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T} $$ Using ridge regression, we can compute $$ W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T} $$ We decide to try to use these methods to initialize a single-unit ""neural network"" with a linear activation function. Assume that $X X^{T}$ is neither singular nor equal to the identity matrix, and that neither $W_{\text {ols }}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Reggie has solved many problems like this before and the solution has typically been close to $W_{0}=(1, \ldots, 1)^{T}$. Define an objective function that would result in good estimates for Reggie's next problem, even with very little data.", $J(W)=\left(W^{T} X-Y\right)^{2}+\lambda\|W-\mathbf{1}\|^{2}$
MIT Fall 2017,2,a,2.5,MDPs,Image,"Consider the MDP shown above. It has states $S_{0}, \ldots, S_{6}$ and actions $A, B$. Each arrow is labeled with one or more actions, and a probability value: this means that if any of those actions is chosen from the state at the start of the arrow, then it will make a transition to the state at the end of the arrow with the associated probability.

Rewards are associated with states, and independent, in this example, from the action that is taken in that state. Remember that with horizon $H=1$, the agent can collect the reward associated with the state it is in, and then terminates.
Consider the policy $\pi$ that takes action $B$ in $S_{0}$ and action $A$ in $S_{2}$. If the system starts in $S_{0}$ or $S_{2}$, then under that policy, only those two states $\left(S_{0}\right.$ and $\left.S_{2}\right)$ are reachable.
Assuming the discount factor $\gamma=0.5$, what are the values of $V_{\pi}\left(S_{0}\right)$ and $V_{\pi}\left(S_{2}\right)$ ? It is sufficient to write out a small system of linear equations that determine the values of those two variables; you do not have to take the time to solve them numerically.","$$
\begin{aligned}
&V_{\pi}\left(S_{0}\right)=0+0.5 \cdot V_{\pi}\left(S_{2}\right) \\
&V_{\pi}\left(S_{2}\right)=1+0.5 \cdot\left(0.9 V_{\pi}\left(S_{2}\right)+0.1 V_{\pi}\left(S_{0}\right)\right)
\end{aligned}
$$"
MIT Fall 2017,2,b,2.5,MDPs,Image,"Consider the MDP shown above. It has states $S_{0}, \ldots, S_{6}$ and actions $A, B$. Each arrow is labeled with one or more actions, and a probability value: this means that if any of those actions is chosen from the state at the start of the arrow, then it will make a transition to the state at the end of the arrow with the associated probability.

Rewards are associated with states, and independent, in this example, from the action that is taken in that state. Remember that with horizon $H=1$, the agent can collect the reward associated with the state it is in, and then terminates.
What is the optimal value $V(s)=\max _{a} Q(s, a)$ for each state for horizon $H=1$ with no discounting?","i. $S_{0}$ 0
ii. $S_{1}$ 0
iii. $S_{2}$ 1
iv. $S_{3}$ 2
v. $S_{4}$ 1
vi. $S_{5}$ 10
vii. $S_{6}$
0"
MIT Fall 2017,2,c,2.5,MDPs,Image,"Consider the MDP shown above. It has states $S_{0}, \ldots, S_{6}$ and actions $A, B$. Each arrow is labeled with one or more actions, and a probability value: this means that if any of those actions is chosen from the state at the start of the arrow, then it will make a transition to the state at the end of the arrow with the associated probability.

Rewards are associated with states, and independent, in this example, from the action that is taken in that state. Remember that with horizon $H=1$, the agent can collect the reward associated with the state it is in, and then terminates.

What is the optimal action and value $V(s)$ for each state for horizon $H=2$ with no discounting?","i. $S_{0} a$ : $v:$ 1
ii. $S_{1} a$ : A $v:$ 2
iii. $S_{2} a$ : A $v:$ $1.9$
iv. $S_{3} a$ : A or B $v:$ 2
v. $S_{4} a: \quad \mathbf{A}$ or $\mathbf{B} v$ : 1
vi. $S_{5} a$ : A or B $v:$ 10
vii. $S_{6} a$ : A or $B$ $v:$"
MIT Fall 2017,2,d,2.5,MDPs,Image,"Consider the MDP shown above. It has states $S_{0}, \ldots, S_{6}$ and actions $A, B$. Each arrow is labeled with one or more actions, and a probability value: this means that if any of those actions is chosen from the state at the start of the arrow, then it will make a transition to the state at the end of the arrow with the associated probability.

Rewards are associated with states, and independent, in this example, from the action that is taken in that state. Remember that with horizon $H=1$, the agent can collect the reward associated with the state it is in, and then terminates.

Are there any policies that result in infinite-horizon $Q_{\pi}$ values that are finite for all states even when $\gamma=1$ ? If so, provide such a policy. If not, explain why not.","i. $S_{0}$ A
ii. $S_{1}-2$ A
iii. $S_{2}$ A or $\mathbf{B}$
iv. $S_{3} \quad \mathbf{A}$ or $\mathbf{B}$
v. $S_{4} \ldots \mathbf{A}$ or $\mathbf{B}$
vi. $S_{5} \quad \mathbf{A}$ or $\mathbf{B}$
vii. $S_{6}$ A or $\mathbf{B}$"
MIT Fall 2017,3,a.i,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
Provide the q-learning value for Q(A, Move).",0
MIT Fall 2017,3,a.ii,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
Provide the q-learning value for Q(B, Move).",0
MIT Fall 2017,3,a.iii,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(C, Move)",1
MIT Fall 2017,3,a.iv,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(A, move).",0
MIT Fall 2017,3,a.v,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. 
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(b, move).",0.9
MIT Fall 2017,3,b,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Characterize the weakness of Q-learning demonstrated by this example, which would be worse if there were a long sequence of states $B_{1}, \ldots, B_{100}$ between A and C. Very briefly describe a strategy for overcoming this weakness. ",It doesn't propagate the value all the way back the chain. Do the updates backward along the trajectory; or save your experience and replay it.
MIT Fall 2017,3,c.i,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. 
A, Move, 0
B, Move, 0
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(A, move).","Q(A, move) = .81"
MIT Fall 2017,3,c.ii,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. 
A, Move, 0
B, Move, 0
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(B, move).","Q(B, move) = 0"
MIT Fall 2017,3,d,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. 
A, Move, 0
B, Move, 0
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. What problem with our algorithm is revealed by this example? Very briefly explain a small change to the method or parameters we are using that will solve this problem.",Use a smaller learning rate
MIT Fall 2017,4,a,2,Neural Networks,Image,"Consider just the general shape of the following plots. For each of the following possible interpretations of the quantities being plotted on the $\mathrm{X}$ and $Y$ axes, indicate which of the plots would most typically be the result, or mark ""none"" if none are appropriate.

Assume all quantities other than $\mathrm{X}$ are held constant during the experiment. Error quantities reported are averages over the data set they are being reported on.
Provide a one-sentence justification for each answer.

X axis: Number of training examples $\mathbf{Y}$ axis: test set error $\sqrt{A} O \mathrm{~B} O \mathrm{C} O \mathrm{D} O$ none","A.
With more training data, we are better able to find a good hypothesis."
MIT Fall 2017,4,b,2,Neural Networks,Image,"Consider just the general shape of the following plots. For each of the following possible interpretations of the quantities being plotted on the $\mathrm{X}$ and $Y$ axes, indicate which of the plots would most typically be the result, or mark ""none"" if none are appropriate.

Assume all quantities other than $\mathrm{X}$ are held constant during the experiment. Error quantities reported are averages over the data set they are being reported on.
Provide a one-sentence justification for each answer.

$\mathbf{X}$ axis: Number of training examples $\mathbf{Y}$ axis: training error
A $\sqrt{B} O$ C $O$ D none","B.
It's easy to fit a small amount of data exactly; harder as we get more data."
MIT Fall 2017,4,c,2,Neural Networks,Image,"Consider just the general shape of the following plots. For each of the following possible interpretations of the quantities being plotted on the $\mathrm{X}$ and $Y$ axes, indicate which of the plots would most typically be the result, or mark ""none"" if none are appropriate.

Assume all quantities other than $\mathrm{X}$ are held constant during the experiment. Error quantities reported are averages over the data set they are being reported on.
Provide a one-sentence justification for each answer.

X axis: Order of polynomial feature set $\mathbf{Y}$ axis: test set error
A O B $\sqrt{\text { C }}$ O D $O$ none","C.

Underfits if too low; overfits if too high."
MIT Fall 2017,4,d,2,Neural Networks,Image,"Consider just the general shape of the following plots. For each of the following possible interpretations of the quantities being plotted on the $\mathrm{X}$ and $Y$ axes, indicate which of the plots would most typically be the result, or mark ""none"" if none are appropriate.

Assume all quantities other than $\mathrm{X}$ are held constant during the experiment. Error quantities reported are averages over the data set they are being reported on.
Provide a one-sentence justification for each answer.

X axis: Order of polynomial feature set $\mathbf{Y}$ axis: training set error
$\sqrt{A} \bigcirc \mathrm{B} \bigcirc \mathrm{C} \bigcirc \mathrm{D} \bigcirc$ none","A.
More features makes it easier to fit complex data."
MIT Fall 2017,4,e,2,Neural Networks,Image,"Consider just the general shape of the following plots. For each of the following possible interpretations of the quantities being plotted on the $\mathrm{X}$ and $Y$ axes, indicate which of the plots would most typically be the result, or mark ""none"" if none are appropriate.

Assume all quantities other than $\mathrm{X}$ are held constant during the experiment. Error quantities reported are averages over the data set they are being reported on.
Provide a one-sentence justification for each answer.

X axis: Order of polynomial feature set $\quad \mathbf{Y}$ axis: cross validation error
A $\bigcirc$ B $\sqrt{\text { C }} \bigcirc$ D $O$ none","C.
Same as test set error."
MIT Fall 2017,5,a,2,CNNs,Text,"We'll consider here a simple one-dimensional convolutional neural network layer. This is a feature map created by a single filter whose parameters we must learn. The filter represents a local pattern detector that is applied in every position of the input signal. The feature map therefore transforms an input vector (one dimensional signal) into another vector (one-dimensional feature map). To train such a layer on its own, i.e., not as part of a bigger network as we typically do, we can imagine having training pairs $(x, y)$ where $x$ is the input signal as a vector and $y$ is a binary vector representing whether the relevant pattern appeared in a particular position or not. Specifically,
- Input $x$ is a one-dimensional vector of length $d$.
- Target $y$ is also a one-dimensional vector of length. $d$. One """"pixel"""" in the output, $y_{j}$, has value 1 if the input pixels $x_{j-1}, x_{j}, x_{j+1}$, centered at $j$, exhibit the target pattern and 0 if they do not.
- The filter is represented by a weight vector $w$ consisting of three values.
- The output of the network is a vector $\hat{y}$ whose $j^{t h}$ coordinate (pixel) is $\hat{y}_{j}=\sigma\left(z_{j}\right.$ ) where $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ and $\sigma(\cdot)$ is the sigmoid function. Assume that $x_{0}$ and $x_{d+1}$ are 0 for the purposes of computing outputs.
- We have a training set $D=\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)$.
- We measure the loss between the target binary vector $y$ and the network output $\hat{y}$ pixel by pixel using cross-entropy (Negative Log-Likelihood or NLL). The aggregate loss over the whole training set is
$$
L(w, D)=\sum_{i=1}^{n} \sum_{j=1}^{d} \operatorname{NLL}\left(y_{j}^{(i)}, \hat{y}_{j}^{(i)}\right)
$$
What is the ""stride"" for this feature map?",1
MIT Fall 2017,5,b,2,CNNs,Text,"We'll consider here a simple one-dimensional convolutional neural network layer. This is a feature map created by a single filter whose parameters we must learn. The filter represents a local pattern detector that is applied in every position of the input signal. The feature map therefore transforms an input vector (one dimensional signal) into another vector (one-dimensional feature map). To train such a layer on its own, i.e., not as part of a bigger network as we typically do, we can imagine having training pairs $(x, y)$ where $x$ is the input signal as a vector and $y$ is a binary vector representing whether the relevant pattern appeared in a particular position or not. Specifically,
- Input $x$ is a one-dimensional vector of length $d$.
- Target $y$ is also a one-dimensional vector of length. $d$. One ""pixel"" in the output, $y_{j}$, has value 1 if the input pixels $x_{j-1}, x_{j}, x_{j+1}$, centered at $j$, exhibit the target pattern and 0 if they do not.
- The filter is represented by a weight vector $w$ consisting of three values.
- The output of the network is a vector $\hat{y}$ whose $j^{t h}$ coordinate (pixel) is $\hat{y}_{j}=\sigma\left(z_{j}\right.$ ) where $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ and $\sigma(\cdot)$ is the sigmoid function. Assume that $x_{0}$ and $x_{d+1}$ are 0 for the purposes of computing outputs.
- We have a training set $D=\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)$.
- We measure the loss between the target binary vector $y$ and the network output $\hat{y}$ pixel by pixel using cross-entropy (Negative Log-Likelihood or NLL). The aggregate loss over the whole training set is
$$
L(w, D)=\sum_{i=1}^{n} \sum_{j=1}^{d} \operatorname{NLL}\left(y_{j}^{(i)}, \hat{y}_{j}^{(i)}\right)
$$ Provide a formula for $\nabla_{w} \mathrm{NLL}\left(y_{j}, \hat{y}_{j}\right)$, which is the gradient of the loss with respect to pixel $j$ of an example with respect to $w=\left[w_{1}, w_{2}, w_{3}\right]^{T}$, in terms of $x, y$, and $z$ values only.","$$
\left(\sigma\left(z_{j}\right)-y_{j}\right)\left[\begin{array}{c}
x_{j-1} \\
x_{j} \\
x_{j+1}
\end{array}\right]
$$"
MIT Fall 2017,5,c.i,2,CNNs,Text,"We'll consider here a simple one-dimensional convolutional neural network layer. This is a feature map created by a single filter whose parameters we must learn. The filter represents a local pattern detector that is applied in every position of the input signal. The feature map therefore transforms an input vector (one dimensional signal) into another vector (one-dimensional feature map). To train such a layer on its own, i.e., not as part of a bigger network as we typically do, we can imagine having training pairs $(x, y)$ where $x$ is the input signal as a vector and $y$ is a binary vector representing whether the relevant pattern appeared in a particular position or not. Specifically,
- Input $x$ is a one-dimensional vector of length $d$.
- Target $y$ is also a one-dimensional vector of length. $d$. One ""pixel"" in the output, $y_{j}$, has value 1 if the input pixels $x_{j-1}, x_{j}, x_{j+1}$, centered at $j$, exhibit the target pattern and 0 if they do not.
- The filter is represented by a weight vector $w$ consisting of three values.
- The output of the network is a vector $\hat{y}$ whose $j^{t h}$ coordinate (pixel) is $\hat{y}_{j}=\sigma\left(z_{j}\right.$ ) where $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ and $\sigma(\cdot)$ is the sigmoid function. Assume that $x_{0}$ and $x_{d+1}$ are 0 for the purposes of computing outputs.
- We have a training set $D=\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)$.
- We measure the loss between the target binary vector $y$ and the network output $\hat{y}$ pixel by pixel using cross-entropy (Negative Log-Likelihood or NLL). The aggregate loss over the whole training set is
$$
L(w, D)=\sum_{i=1}^{n} \sum_{j=1}^{d} \operatorname{NLL}\left(y_{j}^{(i)}, \hat{y}_{j}^{(i)}\right)
$$ We'd like to use a simple stochastic gradient descent (SGD) algorithm for estimating the filter parameters $w$. But wait... how is the algorithm stochastic? Given the framing of our problem there may be multiple ways to write a valid SGD update.
For each of the following cases of SGD, write an update rule for $w$, in terms of step size $\eta, \nabla_{w} \mathrm{NLL}$, target pixel values $y_{j}^{(i)}$ and actual pixel values $\hat{y}_{j}^{(i)}$.
Update based on all pixels of example ","$w \leftarrow w-\eta \sum_{j=1}^{d} \nabla_{w} N\left(y_{j}^{(i)}, \hat{y}_{j}^{(i)}\right)$"
MIT Fall 2017,5,c.ii,2,CNNs,Text,"We'll consider here a simple one-dimensional convolutional neural network layer. This is a feature map created by a single filter whose parameters we must learn. The filter represents a local pattern detector that is applied in every position of the input signal. The feature map therefore transforms an input vector (one dimensional signal) into another vector (one-dimensional feature map). To train such a layer on its own, i.e., not as part of a bigger network as we typically do, we can imagine having training pairs $(x, y)$ where $x$ is the input signal as a vector and $y$ is a binary vector representing whether the relevant pattern appeared in a particular position or not. Specifically,
- Input $x$ is a one-dimensional vector of length $d$.
- Target $y$ is also a one-dimensional vector of length. $d$. One ""pixel"" in the output, $y_{j}$, has value 1 if the input pixels $x_{j-1}, x_{j}, x_{j+1}$, centered at $j$, exhibit the target pattern and 0 if they do not.
- The filter is represented by a weight vector $w$ consisting of three values.
- The output of the network is a vector $\hat{y}$ whose $j^{t h}$ coordinate (pixel) is $\hat{y}_{j}=\sigma\left(z_{j}\right.$ ) where $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ and $\sigma(\cdot)$ is the sigmoid function. Assume that $x_{0}$ and $x_{d+1}$ are 0 for the purposes of computing outputs.
- We have a training set $D=\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)$.
- We measure the loss between the target binary vector $y$ and the network output $\hat{y}$ pixel by pixel using cross-entropy (Negative Log-Likelihood or NLL). The aggregate loss over the whole training set is
$$
L(w, D)=\sum_{i=1}^{n} \sum_{j=1}^{d} \operatorname{NLL}\left(y_{j}^{(i)}, \hat{y}_{j}^{(i)}\right)
$$ We'd like to use a simple stochastic gradient descent (SGD) algorithm for estimating the filter parameters $w$. But wait... how is the algorithm stochastic? Given the framing of our problem there may be multiple ways to write a valid SGD update.
For each of the following cases of SGD, write an update rule for $w$, in terms of step size $\eta, \nabla_{w} \mathrm{NLL}$, target pixel values $y_{j}^{(i)}$ and actual pixel values $\hat{y}_{j}^{(i)}$.
Update based on pixel $j$ of all examples"," Select $j$ (position) at random, $w \leftarrow w-\eta \sum_{i=1}^{n} \nabla_{w} N\left(y_{j}^{(i)}, \hat{y}_{j}^{(i)}\right)$"
MIT Fall 2017,5,c.iii,2,CNNs,Text,"We'll consider here a simple one-dimensional convolutional neural network layer. This is a feature map created by a single filter whose parameters we must learn. The filter represents a local pattern detector that is applied in every position of the input signal. The feature map therefore transforms an input vector (one dimensional signal) into another vector (one-dimensional feature map). To train such a layer on its own, i.e., not as part of a bigger network as we typically do, we can imagine having training pairs $(x, y)$ where $x$ is the input signal as a vector and $y$ is a binary vector representing whether the relevant pattern appeared in a particular position or not. Specifically,
- Input $x$ is a one-dimensional vector of length $d$.
- Target $y$ is also a one-dimensional vector of length. $d$. One ""pixel"" in the output, $y_{j}$, has value 1 if the input pixels $x_{j-1}, x_{j}, x_{j+1}$, centered at $j$, exhibit the target pattern and 0 if they do not.
- The filter is represented by a weight vector $w$ consisting of three values.
- The output of the network is a vector $\hat{y}$ whose $j^{t h}$ coordinate (pixel) is $\hat{y}_{j}=\sigma\left(z_{j}\right.$ ) where $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ and $\sigma(\cdot)$ is the sigmoid function. Assume that $x_{0}$ and $x_{d+1}$ are 0 for the purposes of computing outputs.
- We have a training set $D=\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)$.
- We measure the loss between the target binary vector $y$ and the network output $\hat{y}$ pixel by pixel using cross-entropy (Negative Log-Likelihood or NLL). The aggregate loss over the whole training set is
$$
L(w, D)=\sum_{i=1}^{n} \sum_{j=1}^{d} \operatorname{NLL}\left(y_{j}^{(i)}, \hat{y}_{j}^{(i)}\right)
$$ We'd like to use a simple stochastic gradient descent (SGD) algorithm for estimating the filter parameters $w$. But wait... how is the algorithm stochastic? Given the framing of our problem there may be multiple ways to write a valid SGD update.
For each of the following cases of SGD, write an update rule for $w$, in terms of step size $\eta, \nabla_{w} \mathrm{NLL}$, target pixel values $y_{j}^{(i)}$ and actual pixel values $\hat{y}_{j}^{(i)}$.
Update based on pixel $j$ of example $i$","Select $i$ (example) and $j$ (position) at random, $w \leftarrow w-\eta \nabla_{w} N\left(y_{j}^{(i)}, \hat{y}_{j}^{(i)}\right)$"
MIT Fall 2017,6,a,2,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest.For which network does a high output value correspond, qualitatively, to ""every pixel in $x$ corresponds to an instance of the desired pattern,"" A, B or none?",B
MIT Fall 2017,6,b,2,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. For which network does a high output value correspond, qualitatively, to ""at least half of the pixels in $x$ correspond to an instance of the desired pattern"" A, B, or none?",None
MIT Fall 2017,6,c,2,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. For which network does a high output value correspond, qualitatively, to ""there is at least one instance of the desired pattern in this image"" A, B, or none?",A
MIT Fall 2017,6,d,2,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. Assume for simplicity that all $z_{1}, \ldots, z_{d}$ have distinct values (we can ignore the corner cases where some of the values are equal). What is $\partial \hat{y} / \partial z_{i}$ for network A?","$\sigma\left(z_{i}\right)\left(1-\sigma\left(z_{i}\right)\right)$ if $z_{i}=\max \left(z_{1}, \ldots, z_{d}\right)$, and 0 otherwise."
MIT Fall 2017,6,e.i,0.5,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. Now, suppose we are just given a single training pair $(x, y)$ where the target $y$ is binary $0 / 1$. The loss that we are minimizing is again just
$$
\mathrm{NLL}(y, \hat{y})=-y \log \hat{y}-(1-y) \log (1-\hat{y})
$$
which is minimized when $\hat{y}$ matches the target $y$. We are interested in understanding qualitatively how the filter parameters $w$ get updated in the two networks if we use simple gradient descent to minimize $\operatorname{NLL}(y, \hat{y})$. Specify whether the behavior would occur in:
- Which network (A, B, or it doesn't matter)
- Target $y$ (1, 0, or it doesn't matter). 
The behavior is the filter weights $w$ become increasingly aligned in the direction of a particular triplet $\left[x_{j-1} ; x_{j} ; x_{j+1}\right]$.","A, 1"
MIT Fall 2017,6,e.ii,0.5,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. Now, suppose we are just given a single training pair $(x, y)$ where the target $y$ is binary $0 / 1$. The loss that we are minimizing is again just
$$
\mathrm{NLL}(y, \hat{y})=-y \log \hat{y}-(1-y) \log (1-\hat{y})
$$
which is minimized when $\hat{y}$ matches the target $y$. We are interested in understanding qualitatively how the filter parameters $w$ get updated in the two networks if we use simple gradient descent to minimize $\operatorname{NLL}(y, \hat{y})$. Specify whether the behavior would occur in:
- Which network (A, B, or it doesn't matter)
- Target $y$ (1, 0, or it doesn't matter). 
The behavior is the filter weights $w$ become increasingly aligned in the negative direction of a particular triplet $\left[x_{j-1} ; x_{j} ; x_{j+1}\right]$.","B, 0"
MIT Fall 2017,6,e.iii,0.5,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. Now, suppose we are just given a single training pair $(x, y)$ where the target $y$ is binary $0 / 1$. The loss that we are minimizing is again just
$$
\mathrm{NLL}(y, \hat{y})=-y \log \hat{y}-(1-y) \log (1-\hat{y})
$$
which is minimized when $\hat{y}$ matches the target $y$. We are interested in understanding qualitatively how the filter parameters $w$ get updated in the two networks if we use simple gradient descent to minimize $\operatorname{NLL}(y, \hat{y})$. Specify whether the behavior would occur in:
- Which network (A, B, or it doesn't matter)
- Target $y$ (1, 0, or it doesn't matter). 
The behavior is each update moves the filter weights $w$ in the direction of some triplet but the specific triplet keeps changing from one update to another.","B, 1"
MIT Fall 2017,6,e.iv,0.5,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. Now, suppose we are just given a single training pair $(x, y)$ where the target $y$ is binary $0 / 1$. The loss that we are minimizing is again just
$$
\mathrm{NLL}(y, \hat{y})=-y \log \hat{y}-(1-y) \log (1-\hat{y})
$$
which is minimized when $\hat{y}$ matches the target $y$. We are interested in understanding qualitatively how the filter parameters $w$ get updated in the two networks if we use simple gradient descent to minimize $\operatorname{NLL}(y, \hat{y})$. Specify whether the behavior would occur in:
- Which network (A, B, or it doesn't matter)
- Target $y$ (1, 0, or it doesn't matter). 
The behavior is Each update causes the filter weights $w$ to move in the negative direction of some triplet but the specific triplet may change from one update to next.","A, 0"
MIT Fall 2017,7,a,2,Reinforcement Learning,Text,"Chris wants to use Q-learning to solve a video-game problem in which there is a ball moving on an $\mathrm{n}$ by $\mathrm{n}$ pixel screen, similar to the one we studied in class. However, instead of moving a paddle up and down along the right wall, there is a ""photon cannon"" fixed in the middle of the right-hand side, and the player is allowed to instantaneously set the angle of the cannon and to try to shoot. If the photon beam hits the ball, the ball will reflect backwards. It takes 10 time steps for the cannon to recharge after being fired, however, before it can be fired again. Our goal here is to try to understand how to apply deep Q-learning to this problem.
The state of the system is composed of five parts:
- Ball position x (1 .. n)
- Ball position y (1 ... n)
- Ball velocity x $(-1,1)$
- Ball velocity y $(-1,0,1)$
- Number of time steps until the cannon is ready to shoot again $(0, \ldots, 10)$
The possible actions at each time step involve both the aim and whether to try to shoot:
- Cannon angle in degrees $(-60,-30,0,30,60)$
- Shoot cannon $(1,0)$
The options for us in terms of solving the game include how we represent the states and actions and how these are mapped to Q-values. We won't worry about the exploration problem here, only about representing $Q$-values. Our learning algorithm performs gradient descent steps on the squared Bellman error with respect to the parameters in the $Q$-values. How many states and actions are there in this game?",$n^{2} * 2 * 3 * 11$ states and $5 * 2$ actions
MIT Fall 2017,7,b,2,Reinforcement Learning,Text,"Chris wants to use Q-learning to solve a video-game problem in which there is a ball moving on an $\mathrm{n}$ by $\mathrm{n}$ pixel screen, similar to the one we studied in class. However, instead of moving a paddle up and down along the right wall, there is a ""photon cannon"" fixed in the middle of the right-hand side, and the player is allowed to instantaneously set the angle of the cannon and to try to shoot. If the photon beam hits the ball, the ball will reflect backwards. It takes 10 time steps for the cannon to recharge after being fired, however, before it can be fired again. Our goal here is to try to understand how to apply deep Q-learning to this problem.
The state of the system is composed of five parts:
- Ball position x (1 .. n)
- Ball position y (1 ... n)
- Ball velocity x $(-1,1)$
- Ball velocity y $(-1,0,1)$
- Number of time steps until the cannon is ready to shoot again $(0, \ldots, 10)$
The possible actions at each time step involve both the aim and whether to try to shoot:
- Cannon angle in degrees $(-60,-30,0,30,60)$
- Shoot cannon $(1,0)$
The options for us in terms of solving the game include how we represent the states and actions and how these are mapped to Q-values. We won't worry about the exploration problem here, only about representing $Q$-values. Our learning algorithm performs gradient descent steps on the squared Bellman error with respect to the parameters in the $Q$-values. Let $S$ be the set of states, and $A$ the set of actions. Suppose we construct a simple one layer neural network to represent Q-values. The network has $|S|+|A|$ input units, no hidden units, and just one linear output unit to represent the associated $Q$-value. The pair $(s, a)$ is fed into the model by concatenating a one-hot vector for $s$ and a one-hot vector for $a$. Could this model learn to match the correct Q-values for each state-action pair? Briefly describe why/why not.","No, since the model is restricted. The value is predicted as a sum of a state-dependent and action-dependent parts without any cross-talk."
MIT Fall 2017,7,c,2,Reinforcement Learning,Text,"Chris wants to use Q-learning to solve a video-game problem in which there is a ball moving on an $\mathrm{n}$ by $\mathrm{n}$ pixel screen, similar to the one we studied in class. However, instead of moving a paddle up and down along the right wall, there is a ""photon cannon"" fixed in the middle of the right-hand side, and the player is allowed to instantaneously set the angle of the cannon and to try to shoot. If the photon beam hits the ball, the ball will reflect backwards. It takes 10 time steps for the cannon to recharge after being fired, however, before it can be fired again. Our goal here is to try to understand how to apply deep Q-learning to this problem.
The state of the system is composed of five parts:
- Ball position x (1 .. n)
- Ball position y (1 ... n)
- Ball velocity x $(-1,1)$
- Ball velocity y $(-1,0,1)$
- Number of time steps until the cannon is ready to shoot again $(0, \ldots, 10)$
The possible actions at each time step involve both the aim and whether to try to shoot:
- Cannon angle in degrees $(-60,-30,0,30,60)$
- Shoot cannon $(1,0)$
The options for us in terms of solving the game include how we represent the states and actions and how these are mapped to Q-values. We won't worry about the exploration problem here, only about representing $Q$-values. Our learning algorithm performs gradient descent steps on the squared Bellman error with respect to the parameters in the $Q$-values. Suppose we modify the network a bit by giving it $|S|$ input units and $|A|$ output units where the output units represent the Q-values $Q(s, a), a \in A$, for the state $s$ fed in as a one-hot vector. Again, we have no hidden units. Could this model match the correct Q-values? Why/why not.","Yes, it could. We can specify arbitrary outgoing weights for each input state thus can set the Q-values without restriction."
MIT Fall 2017,7,d,2,Reinforcement Learning,Text,"Chris wants to use Q-learning to solve a video-game problem in which there is a ball moving on an $\mathrm{n}$ by $\mathrm{n}$ pixel screen, similar to the one we studied in class. However, instead of moving a paddle up and down along the right wall, there is a """"photon cannon"""" fixed in the middle of the right-hand side, and the player is allowed to instantaneously set the angle of the cannon and to try to shoot. If the photon beam hits the ball, the ball will reflect backwards. It takes 10 time steps for the cannon to recharge after being fired, however, before it can be fired again. Our goal here is to try to understand how to apply deep Q-learning to this problem.
The state of the system is composed of five parts:
- Ball position x (1 .. n)
- Ball position y (1 ... n)
- Ball velocity x $(-1,1)$
- Ball velocity y $(-1,0,1)$
- Number of time steps until the cannon is ready to shoot again $(0, \ldots, 10)$
The possible actions at each time step involve both the aim and whether to try to shoot:
- Cannon angle in degrees $(-60,-30,0,30,60)$
- Shoot cannon $(1,0)$
The options for us in terms of solving the game include how we represent the states and actions and how these are mapped to Q-values. We won't worry about the exploration problem here, only about representing $Q$-values. Our learning algorithm performs gradient descent steps on the squared Bellman error with respect to the parameters in the $Q$-values. Suppose we modify the network a bit by giving it 5 input units, m hidden units and $|A|$ output units where the output units represent the Q-values $Q(s, a), a \in A$, for the state $s$ fed in as a one-hot vector. Again, we have no hidden units. Could this model match the correct Q-values? Why/why not.",Yes we could provided that m is large enough
MIT Fall 2017,7,e,2,Reinforcement Learning,Text,"Chris wants to use Q-learning to solve a video-game problem in which there is a ball moving on an $\mathrm{n}$ by $\mathrm{n}$ pixel screen, similar to the one we studied in class. However, instead of moving a paddle up and down along the right wall, there is a """"photon cannon"""" fixed in the middle of the right-hand side, and the player is allowed to instantaneously set the angle of the cannon and to try to shoot. If the photon beam hits the ball, the ball will reflect backwards. It takes 10 time steps for the cannon to recharge after being fired, however, before it can be fired again. Our goal here is to try to understand how to apply deep Q-learning to this problem.
The state of the system is composed of five parts:
- Ball position x (1 .. n)
- Ball position y (1 ... n)
- Ball velocity x $(-1,1)$
- Ball velocity y $(-1,0,1)$
- Number of time steps until the cannon is ready to shoot again $(0, \ldots, 10)$
The possible actions at each time step involve both the aim and whether to try to shoot:
- Cannon angle in degrees $(-60,-30,0,30,60)$
- Shoot cannon $(1,0)$
The options for us in terms of solving the game include how we represent the states and actions and how these are mapped to Q-values. We won't worry about the exploration problem here, only about representing $Q$-values. Our learning algorithm performs gradient descent steps on the squared Bellman error with respect to the parameters in the $Q$-values.If we increase $n$ and also include many more angle gradations for the aim, so that $|S|$ and $|A|$ are very large, which of the following architectures would we prefer for repenting Q-values? Choose from: 
A. $|S|$ input units (one-hot vector for $s),
B. |A|$ output units, 5 input units, some $m$ hidden units, $|A|$ output units,
C. $5+2$ input units for the five part state, two-part action, $m$ hidden units and one output unit 
D. 5 input units, some $m$ hidden units, and two output units.","C. $5+2$ input units for the five part state, two-part action, $m$ hidden units and one output unit "
MIT Fall 2017,8,a,4,RNNs,Text,"Consider three RNN variants:
1. The basic RNN architecture we studied was
$$
\begin{aligned}
&s_{t}=f\left(W^{s s} s_{t-1}+W^{s x} x_{t}\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
where $W^{s s}$ is $m \times m, W^{s x}$ is $m \times d$, and $f$ is an activation function to be specified later. We omit the offset parameters for simplicity (set them to zero).
2. Ranndy thinks the basic RNN is representationally weak, and it would be better not to decompose the state update in this way. Ranndy's proposal is to instead
$$
\begin{aligned}
&s_{t}=f\left(W^{s s x} \operatorname{concat}\left(s_{t-1}, x_{t}\right)\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
where $\operatorname{concat}\left(s_{t-1}, x_{t}\right)$ is a vector of length $m+d$ obtained by concatenating $s_{t-1}$ and $x_{t}$, so $W^{s s x}$ has dimensions $m \times(m+d)$.
3. Orenn wants to try yet another model, of the form:
$$
\begin{aligned}
&s_{t}=f\left(W^{s s} s_{t-1}\right)+f\left(W^{s x} x_{t}\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$ Lec Surer insists on understanding these models a bit better, and how they might relate.
(a) Select the correct claim and answer the associated question.
(1) Claim: The three models are all equivalent when $f(z)=z$. In this case, define $W^{s s x}$
(2) Claim: The three models are not all equivalent when $f(z)=z$. In this case, assume $m=d=1$ and provide one setting of $W^{s s x}$ in Ranndy's model such that $W^{s s}$ and $W^{s x}$ cannot be chosen to make the basic and Orenn's models the same as Ranndy's.","Claim $1 W^{s s x}=hstack\left(W^{s s}, W^{s x}\right)$"
MIT Fall 2017,8,b,3,RNNs,Image,"Consider three RNN variants:
1. The basic RNN architecture we studied was
$$
\begin{aligned}
&s_{t}=f\left(W^{s s} s_{t-1}+W^{s x} x_{t}\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
where $W^{s s}$ is $m \times m, W^{s I}$ is $m \times d$, and $f$ is an activation function to be specified later. We omit the offset parameters for simplicity (set them to zero).
2. Ranndy thinks the basic RNN is representationally weak, and it would be better not to decompose the state update in this way. Ranndy's proposal is to instead
$$
\begin{aligned}
&s_{t}=f\left(W^{s s x} \operatorname{concat}\left(s_{t-1}, x_{t}\right)\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
where concat $\left(s_{t-1}, x_{t}\right)$ is a vector of length $m+d$ obtained by concatenating $s_{t-1}$ and $x_{t}$, so $W^{\text {sss }}$ has dimensions $m \times(m+d)$.
3. Orenn wants to try yet another model, of the form:
$$
\begin{aligned}
&s_{t}=f\left(W^{s s} s_{t-1}\right)+f\left(W^{s x} x_{t}\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
Lec Surer insists on understanding these models a bit better, and how they might relate.

Lec Surer thinks that something interesting happens with Orenn's model when $f(z)=$ $\tanh (z)$. Specifically, it supposedly corresponds to the architecture shown in the figure below, which includes an additional hidden layer. Specify what $W, W^{\prime}$, and $m^{\prime}$ are so that this architecture indeed corresponds to Orenn's model.
Ignore the dimensions written on the figure above; they are backwards.
i. $m^{\prime}$
ii. $W$
iii. $W^{\prime}$
","i. $2 m$
ii.  A block-diagonal matrix of the form
$$
\left[\begin{array}{cc}
W^{s s} & 0 \\
0 & W^{s x}
\end{array}\right]
$$
iii. hstack $(I(m) ; I(m))$"
MIT Fall 2017,8,c,3,RNNs,Image,"Consider three RNN variants:
1. The basic RNN architecture we studied was
$$
\begin{aligned}
&s_{t}=f\left(W^{s s} s_{t-1}+W^{s x} x_{t}\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
where $W^{s s}$ is $m \times m, W^{s I}$ is $m \times d$, and $f$ is an activation function to be specified later. We omit the offset parameters for simplicity (set them to zero).
2. Ranndy thinks the basic RNN is representationally weak, and it would be better not to decompose the state update in this way. Ranndy's proposal is to instead
$$
\begin{aligned}
&s_{t}=f\left(W^{s s x} \operatorname{concat}\left(s_{t-1}, x_{t}\right)\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
where concat $\left(s_{t-1}, x_{t}\right)$ is a vector of length $m+d$ obtained by concatenating $s_{t-1}$ and $x_{t}$, so $W^{\text {sss }}$ has dimensions $m \times(m+d)$.
3. Orenn wants to try yet another model, of the form:
$$
\begin{aligned}
&s_{t}=f\left(W^{s s} s_{t-1}\right)+f\left(W^{s x} x_{t}\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
Lec Surer insists on understanding these models a bit better, and how they might relate.

Assume again that $f(z)=\tanh (z)$. Suppose $s_{0}=0$ (vector) and we feed $x_{1}, \ldots, x_{n}$ as the input sequence to Orenn's model, obtaining $y_{1}, \ldots, y_{n}$ as the associated output sequence. If we change the input sequence to $-x_{1}, \ldots,-x_{n}$, which of the following is the best characterization of the resulting output sequence?

The new output sequence will alternate between positive and negative values.
The new output sequence depends on the parameters.
The new output is just the negative of the previous output sequence
",The new output is just the negative of the previous output sequence
MIT Fall 2017,9,a.i,1,Classifiers,Text,"Here is a dataset in the form of (x, y) coordinates. P1 = (1, 1), P2 = (6, 6), P3 = (4, 6), P4 = (6, 4) and the seperator is a line at y = 3.5. What is the margin for P1 with respect to the separator?",2.5
MIT Fall 2017,9,a.ii,1,Classifiers,Text,"Here is a dataset in the form of (x, y) coordinates. P1 = (1, 1), P2 = (6, 6), P3 = (4, 6), P4 = (6, 4) and the seperator is a line at y = 3.5. How far is P2 from the seperator?",2.5
MIT Fall 2017,9,a.iii,1,Classifiers,Text,"Here is a dataset in the form of (x, y) coordinates. P1 = (1, 1), P2 = (6, 6), P3 = (4, 6), P4 = (6, 4) and the seperator is a line at y = 3.5. What is the margin for P3 with respect to the separator?",2.5
MIT Fall 2017,9,a.iv,1,Classifiers,Text,"Here is a dataset in the form of (x, y) coordinates. P1 = (1, 1), P2 = (6, 6), P3 = (4, 6), P4 = (6, 4) and the seperator is a line at y = 3.5. What is the margin for P4 with respect to the separator?",0.5
MIT Fall 2017,9,b,2,Classifiers,Text,"Here is a dataset in the form of (x, y) coordinates. P1 = (1, 1), P2 = (6, 6), P3 = (4, 6), P4 = (6, 4) and the seperator is a line at y = 3.5. Let our objective be the regularized average hinge loss with respect to $\gamma_{\text {ref }}$ :
$$
J\left(\theta, \theta_{0}, \gamma_{\text {ref }}\right)=\frac{1}{n} \sum_{i=1}^{n} L_{H}\left(\frac{\gamma\left(x, y, \theta, \theta_{0}\right)}{\gamma_{\text {ref }}}\right)+\lambda \frac{1}{\gamma_{\text {ref }}^{2}}
$$
If $\lambda=0$, and we're still referring to the separator at y=3.5, what range of values of $\gamma_{\text {ref }}$ achieves the optimal value of $J$?",$\gamma_{\text {ref }}<0.5$
MIT Fall 2017,9,c,2,Classifiers,Text,"Here is a dataset in the form of (x, y) coordinates. P1 = (1, 1), P2 = (6, 6), P3 = (4, 6), P4 = (6, 4) and the seperator is a line at y = 3.5. Let our objective be the regularized average hinge loss with respect to $\gamma_{\text {ref }}$ :
$$
J\left(\theta, \theta_{0}, \gamma_{\text {ref }}\right)=\frac{1}{n} \sum_{i=1}^{n} L_{H}\left(\frac{\gamma\left(x, y, \theta, \theta_{0}\right)}{\gamma_{\text {ref }}}\right)+\lambda \frac{1}{\gamma_{\text {ref }}^{2}}
$$
Now let $\lambda=\epsilon$, a very small value, and consider that same separator. What value of $\gamma_{\text {ref }}$ achieves the closest to optimal value of $J$? Choose from the following list (-10, -1, -.5, 0, .5, 1, 10)",0.5
MIT Fall 2017,9,d,2,Classifiers,Text,"Here is a dataset in the form of (x, y) coordinates. P1 = (1, 1), P2 = (6, 6), P3 = (4, 6), P4 = (6, 4) and the seperator is a line at y = 3.5. Let our objective be the regularized average hinge loss with respect to $\gamma_{\text {ref }}$ :
$$
J\left(\theta, \theta_{0}, \gamma_{\text {ref }}\right)=\frac{1}{n} \sum_{i=1}^{n} L_{H}\left(\frac{\gamma\left(x, y, \theta, \theta_{0}\right)}{\gamma_{\text {ref }}}\right)+\lambda \frac{1}{\gamma_{\text {ref }}^{2}}
$$
When $\lambda=\epsilon$, and using the maximum margin separator, supply a value of $\gamma_{\text {ref }}$ that approximately minimizes $J$.",\gamma_{\text {ref }}=2 \sqrt{2}
MIT Fall 2017,9,e,2,Classifiers,Text,"Here is a dataset in the form of (x, y) coordinates. P1 = (1, 1), P2 = (6, 6), P3 = (4, 6), P4 = (6, 4) and the seperator is a line at y = 3.5. Let our objective be the regularized average hinge loss with respect to $\gamma_{\text {ref }}$ :
$$
J\left(\theta, \theta_{0}, \gamma_{\text {ref }}\right)=\frac{1}{n} \sum_{i=1}^{n} L_{H}\left(\frac{\gamma\left(x, y, \theta, \theta_{0}\right)}{\gamma_{\text {ref }}}\right)+\lambda \frac{1}{\gamma_{\text {ref }}^{2}}
$$
Why might we prefer a maximum-margin separator over the one originally provided?",We expect it will generalize better because it is not as dependent on the data points (a small variation in the data probably won't change the result too much).
MIT Fall 2017,10,a.i,2,Classifiers,Text,"Consider the following data in (coordinate, sign) format: ((-4, 4),positive), ((4, -4),positive), ((-1, -1),-1) , ((1, 1),-1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Approach 1: Nested linear classifiers Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and $a=\left[\begin{array}{l}a_{1} \\ a_{2}\end{array}\right]$ where $$ \begin{aligned} a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\ a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\ y_{\text {predicted }} &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right) \end{aligned} $$ (a) Select the value of v1 so that the nested classifier correctly predicts the value in the data set.",-1
MIT Fall 2017,10,a.ii,1,Classifiers,Text,"Consider the following data in (coordinate, sign) format: ((-4, 4),positive), ((4, -4),positive), ((-1, -1),-1) , ((1, 1),-1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Approach 1: Nested linear classifiers Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and $a=\left[\begin{array}{l}a_{1} \\ a_{2}\end{array}\right]$ where $$ \begin{aligned} a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\ a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\ y_{\text {predicted }} &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right) \end{aligned} $$ (a) Select the value of v2 so that the nested classifier correctly predicts the value in the data set.",1
MIT Fall 2017,10,a.iii,1,Classifiers,Text,"Consider the following data in (coordinate, sign) format: ((-4, 4),positive), ((4, -4),positive), ((-1, -1),-1) , ((1, 1),-1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Approach 1: Nested linear classifiers Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and $a=\left[\begin{array}{l}a_{1} \\ a_{2}\end{array}\right]$ where $$ \begin{aligned} a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\ a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\ y_{\text {predicted }} &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right) \end{aligned} $$ (a) Select the value of v0 so that the nested classifier correctly predicts the value in the data set.",0.5
MIT Fall 2017,10,b.i,1,Classifiers,Text,"Consider the following data in (coordinate, sign) format: ((-4, 4),positive), ((4, -4),positive), ((-1, -1),-1) , ((1, 1),-1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Suppose $K\left(x, x^{\prime}\right)=e^{-\left\|x-x^{\prime}\right\|^{2}}$ and for the points in our data set, $K\left(p_{i}, p_{j}\right)=\mathbf{K}_{i j}$ where the matrix $$ \mathbf{K} \approx\left[\begin{array}{cccc} 1 & \exp (-15) & \exp (-15) & \exp (-56) \\ \exp (-15) & 1 & \exp (-4) & \exp (-15) \\ \exp (-15) & \exp (-4) & 1 & \exp (-15) \\ \exp (-56) & \exp (-15) & \exp (-15) & 1 \end{array}\right] $$ (b) Now we will use the kernel perceptron algorithm to find the $\alpha$ values in the classifer, which has the form $$ y_{\text {predicted }}=\operatorname{sign}\left(\sum_{i=1}^{4} \alpha_{i} y^{(i)} K\left(x^{(i)}, x\right)\right) $$ Assuming that we go through the points in order $p 1, p 2, p 3, p 4$ as many times as necessary to correctly classify the data, what is the value of $\alpha_{1}?",1
MIT Fall 2017,10,b.ii,1,Classifiers,Text,"Consider the following data in (coordinate, sign) format: ((-4, 4),positive), ((4, -4),positive), ((-1, -1),-1) , ((1, 1),-1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Suppose $K\left(x, x^{\prime}\right)=e^{-\left\|x-x^{\prime}\right\|^{2}}$ and for the points in our data set, $K\left(p_{i}, p_{j}\right)=\mathbf{K}_{i j}$ where the matrix $$ \mathbf{K} \approx\left[\begin{array}{cccc} 1 & \exp (-15) & \exp (-15) & \exp (-56) \\ \exp (-15) & 1 & \exp (-4) & \exp (-15) \\ \exp (-15) & \exp (-4) & 1 & \exp (-15) \\ \exp (-56) & \exp (-15) & \exp (-15) & 1 \end{array}\right] $$ (b) Now we will use the kernel perceptron algorithm to find the $\alpha$ values in the classifer, which has the form $$ y_{\text {predicted }}=\operatorname{sign}\left(\sum_{i=1}^{4} \alpha_{i} y^{(i)} K\left(x^{(i)}, x\right)\right) $$ Assuming that we go through the points in order $p 1, p 2, p 3, p 4$ as many times as necessary to correctly classify the data, what is the value of $\alpha_{2}?",1
MIT Fall 2017,10,b.iii,1,Classifiers,Text,"Consider the following data in (coordinate, sign) format: ((-4, 4),positive), ((4, -4),positive), ((-1, -1),-1) , ((1, 1),-1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Suppose $K\left(x, x^{\prime}\right)=e^{-\left\|x-x^{\prime}\right\|^{2}}$ and for the points in our data set, $K\left(p_{i}, p_{j}\right)=\mathbf{K}_{i j}$ where the matrix $$ \mathbf{K} \approx\left[\begin{array}{cccc} 1 & \exp (-15) & \exp (-15) & \exp (-56) \\ \exp (-15) & 1 & \exp (-4) & \exp (-15) \\ \exp (-15) & \exp (-4) & 1 & \exp (-15) \\ \exp (-56) & \exp (-15) & \exp (-15) & 1 \end{array}\right] $$ (b) Now we will use the kernel perceptron algorithm to find the $\alpha$ values in the classifer, which has the form $$ y_{\text {predicted }}=\operatorname{sign}\left(\sum_{i=1}^{4} \alpha_{i} y^{(i)} K\left(x^{(i)}, x\right)\right) $$ Assuming that we go through the points in order $p 1, p 2, p 3, p 4$ as many times as necessary to correctly classify the data, what is the value of $\alpha_{3}?",0
MIT Fall 2017,10,b.iv,1,Classifiers,Text,"Consider the following data in (coordinate, sign) format: ((-4, 4),positive), ((4, -4),positive), ((-1, -1),-1) , ((1, 1),-1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Suppose $K\left(x, x^{\prime}\right)=e^{-\left\|x-x^{\prime}\right\|^{2}}$ and for the points in our data set, $K\left(p_{i}, p_{j}\right)=\mathbf{K}_{i j}$ where the matrix $$ \mathbf{K} \approx\left[\begin{array}{cccc} 1 & \exp (-15) & \exp (-15) & \exp (-56) \\ \exp (-15) & 1 & \exp (-4) & \exp (-15) \\ \exp (-15) & \exp (-4) & 1 & \exp (-15) \\ \exp (-56) & \exp (-15) & \exp (-15) & 1 \end{array}\right] $$ (b) Now we will use the kernel perceptron algorithm to find the $\alpha$ values in the classifer, which has the form $$ y_{\text {predicted }}=\operatorname{sign}\left(\sum_{i=1}^{4} \alpha_{i} y^{(i)} K\left(x^{(i)}, x\right)\right) $$ Assuming that we go through the points in order $p 1, p 2, p 3, p 4$ as many times as necessary to correctly classify the data, what is the value of $\alpha_{4}?",1
MIT Fall 2017,10,c,4,Classifiers,Image,"Consider the following data. Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem.
Approach 1: Nested linear classifiers
Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and $a=\left[\begin{array}{l}a_{1} \\ a_{2}\end{array}\right]$ where
$$
\begin{aligned}
a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\
a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\
y_{\text {predicted }} &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right)
\end{aligned}
$$

We can classify the points correctly if $f$ is tanh. Assume that
$w_{11}=w_{12}=+1$ and
$w_{21}=w_{22}=-1$.
Provide the rest of the weights so this network will correctly classify the given points.
i. $v_{1}$
ii. $v_{2} 
iii. $v_{0}$
iv. $w_{01}$
V. $w_{02}$","i. -1
ii. 1
iii. .5
iv. 4
v. -4"
MIT Spring 2018,1,a.i,3,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For each dataset, is there a vector $\theta^{T}$ for a linear classifier through the origin $\left(\theta^{T} x\right)$ that perfectly separates the data? Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. ",No. Data is not linearly seperable
MIT Spring 2018,1,a.ii,3,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For each dataset, is there a vector $\theta^{T}$ for a linear classifier through the origin $\left(\theta^{T} x\right)$ that perfectly separates the data? Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$",No. Data is not linearly seperable
MIT Spring 2018,1,b.i,3,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For each dataset, is there a vector $\theta^{T}$ for a linear classifier through the origin $\left(\theta^{T} \phi(x)\right)$ that perfectly separates the data? $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. ",Yes
MIT Spring 2018,1,b.ii,3,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For each dataset, is there a vector $\theta^{T}$ for a linear classifier through the origin $\left(\theta^{T} \phi(x)\right)$ that perfectly separates the data? $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$",Yes
MIT Spring 2018,1,c,4,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For the dataset indicated below, could a one-hidden-layer neural network with $x_{1}$ and $x_{2}$ as inputs, a layer of up to four relu units and a final tanh output unit be trained to separate the data set? The network is specified as follows: $$ \begin{aligned} &z=W^{T} x+W_{0} \\ &o=\tanh \left(V^{T} \operatorname{relu}(z)+V_{0}\right) \end{aligned} $$ Assuming you use $k \leq 4$ hidden units, $W$ is $2 \times k$, $W_{0}$ is $k \times 1$ and $V$ is $k \times 1$ and $V_{0}$ is $1 \times 1$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$"," Yes. The simplest way is to have four ReLU units. The $i$ th ReLU unit is responsible for being positive when given the $i$ th input, and negative when given any of the other three inputs. The connection between the $i$ th ReLU unit and the tanh layer should be a large positive number when the $i$ th label is $+1$, and a large negative number when the $i$ th label is $-1$."
MIT Spring 2018,1,d,4,Classifiers,Text,"In this problem, we will consider two-dimensional input data vectors $x=\left[x_{1}, x_{2}\right]^{T}$. We will explore the impact of a feature transformation on various learning methods. We will be using the feature transformation $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Consider the following $2 \mathrm{D}$ data sets: Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$. Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$  For the dataset indicated below, could a one-hidden-layer neural network with the entries in $\phi(x)$ as inputs, a layer of up to four relu units and a final tanh output unit be trained to separate the data set? If yes, show the network with weights, including offsets if any. If no, explain briefly why not. Make sure that the prediction has the correct sign. The network is specified as follows: $$ \begin{aligned} &z=W^{T} \phi(x)+W_{0} \\ &o=\tanh \left(V^{T} \operatorname{relu}(z)+V_{0}\right) \end{aligned} $$ Assuming you use $k \leq 4$ hidden units, $W$ is $6 \times k, W_{0}$ is $k \times 1$ and $V$ is $k \times 1$ and $V_{0}$ is $1 \times 1$. $$ \phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T} $$ Signed XOR: positive: $(-1,1),(1,-1)$ and negative: $(-1,-1),(1,1)$",Yes
MIT Spring 2018,2,a,3,Decision Trees,Image,"We will continue the example from the previous question.
For the dataset indicated below, construct a decision tree (using the algorithm from class, based on weighted entropy) with the original features $x=\left[x_{1}, x_{2}\right]^{T}$. Use tests of the form $f<v$. If there is a tie in the choice of split, first prefer $x_{1}$ and then smaller thresholds. You do not need to provide numerical values of the weighted entropy.
Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$.",$\begin{aligned} \mathrm{x}_{-} 1 &<0.5 \\ \mathrm{~T}: & \mathrm{x}_{-} 2<0.5 \\ \mathrm{~T}:-1 \\ \mathrm{~F}:+1 \\ \text { F: } & \mathrm{x}_{-} 2<0.5 \\ \mathrm{~T}:+1 \\ \text { F: }-1 \end{aligned}$
MIT Spring 2018,2,b,3,Decision Trees,Image,"We will continue the example from the previous question. For the dataset indicated below, construct a decision tree (using the algorithm from class, based on weighted entropy) with features from $\phi(x)$. If there is a tie in the choice of split, first prefer features that appear earlier in the $\phi(x)$ vector and then smaller thresholds. Use tests of the form $f<v$. You do not need to provide numerical values of the weighted entropy.
$$
\phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T}
$$
Classic XOR: positive: $(0,1),(1,0)$ and negative: $(0,0),(1,1)$.","$x_{-} 1 x_{-} 2<0.5$
T: $x_{-} 1<0.5$
T: $x_{-} 2<0.5$
T: $-1$
F: $+1$
F: $+1$
F: $-1$"
MIT Spring 2018,2,c,2,Decision Trees,Image,"We will continue the example from the previous question. For any dataset with only positive-valued $x_{1}$ and $x_{2}$, what features in $\phi(x)$ cannot possibly appear in a decision tree computed by the algorithm from class. Assume the splitting rule described earlier: if there is a tie in the choice of split, first prefer features that appear earlier in the $\phi(x)$ vector and then smaller thresholds. Explain your answer.
$$
\phi(x)=\left[1, x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}, x_{2}^{2}\right]^{T}
$$","Features $1, x_{1}^{2}$ and $x_{2}^{2}$ cannot appear. The first one provides no information and the square terms (for positive data values) create the same splits in the data as the $x_{1}$ and $x_{2}$ features."
MIT Spring 2018,3,a,2,Neural Networks,Image,"Assume two data sets are sampled from the same distribution where data set 1 has 1,000 elements and data set 2 has 10,000 elements. Also assume we randomly construct train and test sets from both data sets by dividing them into $90 \%$ training and $10 \%$ testing.

We will explore the effect of using models of increasing complexity (you can think of this as decreasing regularization).
- Draw two curves, for training error and test error, for each data set with the $y$-axis denoting the error and the $x$-axis denoting the model complexity.
- You should have total of 4 curves: one training error and one test error curve for each dataset.
- Draw all 4 of them in the same diagram below. We have included the true error value on the diagram; this is the error that the correct model has on this data.
- Clearly mark your curves with the labels: $1 \mathrm{~K}$ train, $1 \mathrm{~K}$ test, $10 \mathrm{~K}$ train, $10 \mathrm{~K}$ test.
The following factors will be used for grading:
- The general shape of the curves.
- The relative ordering of the curves in the ""Prediction Error"" direction.","- Training error is lower than the true error (with sufficient model complexity), while test error is higher, as we are fitting to the training data
- Training error decreases with increasing model complexity, as we have increased capacity to fit the data
- Test error initially decreases with increasing model complexity and then increases, as we start to fit the data better and then proceed to overfit
- The $10 \mathrm{k}$ dataset makes it more difficult to overfit, so training error is higher and test error lower compared to their $1 \mathrm{k}$ counterparts."
MIT Spring 2018,3,b,2,Neural Networks,Image,"Consider these training and test curves as a function of training dataset size. These are for two models: one simple and one complex. Which is which? Explain your choice.
Left: $\sqrt{\text { simple } O \text { complex } \quad \text { Right: } O \text { simple } \sqrt{\text { complex}}$","The complex model more easily overfits, so test error is initially worse (and training error better), but with sufficient data (to prevent overfitting) the more complex model performs better."
MIT Spring 2018,3,c.i,2,Neural Networks,Text,"In some cases, we will have a validation set in addition to training and test sets. Assume the validation set is approximately the same size as the test set. This validation set is often used to tune hyperparameters such as $\lambda$. Imagine we have trained a classifier using regularization, with Î»$\lambda$ chosen based on performance on the validation set. Which will have the highest accuracy the training set, the validation set or the test set?",Validation set
MIT Spring 2018,3,c.ii,2,Neural Networks,Text,"In some cases, we will have a validation set in addition to training and test sets. Assume the validation set is approximately the same size as the test set. This validation set is often used to tune hyperparameters such as $\lambda$. Imagine we have trained a classifier using regularization, with Î»$\lambda$ chosen based on performance on the validation set. Which will have the lowest accuracy the training set, the validation set or the test set?",test set
MIT Spring 2018,3,d.i,2,Neural Networks,Text,"In some cases, we will have a validation set in addition to training and test sets. Assume the validation set is approximately the same size as the test set. This validation set is often used to tune hyperparameters such as $\lambda$. Imagine we have trained a classifier using regularization, with Î»$\lambda$ chosen based on performance on the training set. Which will have the highest accuracy the training set, the validation set or the test set?",training set
MIT Spring 2018,3,d.ii,2,Neural Networks,Text,"In some cases, we will have a validation set in addition to training and test sets. Assume the validation set is approximately the same size as the test set. This validation set is often used to tune hyperparameters such as $\lambda$. Imagine we have trained a classifier using regularization, with Î»$\lambda$ chosen based on performance on the training set. Which will have the lowest accuracy the training set, the validation set or the test set?",test set
MIT Spring 2018,3,e,2,Neural Networks,Text,"An alternative to cross-validation for estimating prediction error is to use ""bootstrap samples"". These are datasets constructed by randomly sampling points from the original training set with replacement, that is, we do not remove previously sampled points, so a data point could appear more than once in a bootstrap sample. Consider the following alternative methodologies, assuming the training dataset contains $N$ samples. 1. Generate $K$ bootstrap samples of size $N$, train on each sample and evaluate on the original training dataset. Return average of results. 2. Generate $K$ bootstrap samples of size $N$, train on the original training dataset and evaluate on each sample. Return average of results. 3. Generate $K$ bootstrap samples of size $N$, train on each sample and evaluate on points in the original training dataset but not in the sample (assume there are always some such points). Return average of results. Order these (from best to worst) by how accurate you expect the estimates of prediction error on unseen test data to be. Explain your answer.","$3,1,2$ The more unfamiliar your test data, the more accurate the evaluation will be."
MIT Spring 2018,4,a.i,2,Classifiers,Text,"Consider a classification problem in which there are $K$ possible output classes, $1, \ldots, K$. We have studied using NLL as a loss function in such cases, but that assumes that all mistaken classifications are equally significant. Instead, we'll consider a case where some mistakes are worse than others (e.g., mis-identifying a cow as a horse is not as bad as calling it a mouse). Define the cost matrix $c_{g, a}$ to be the cost for guessing class $g$ when the actual class is $a$. For convenience, we'll write $c_{j}$ for the column of the matrix $\left[c_{1, j}, c_{2, j}, \ldots, c_{K, j}\right]^{T}$ representing the costs of all the possible guesses when $j$ is the actual value. We will use a simple neural network with a softmax activation function, so our prediction $p$ will be a $K \times 1$ vector: $$ \begin{aligned} &p=\operatorname{softmax}(z) \\ &z=W^{T} x \end{aligned} $$ Assume inputs are $d \times 1$ so $W$ is $d \times K$. Our loss function, for a prediction vector $p$ when the target output is value $y \in\{1, \ldots, K\}$ is the expected cost of the prediction: $$ L_{c}(p, y)=\sum_{k=1}^{K} p_{k} c_{k y}=p^{T} c_{y} $$ So, the overall training objective is to minimize, over a data set of $n$ points, $$ J_{c}(W)=\sum_{i=1}^{n} L_{c}\left(p^{(i)}, y^{(i)}\right) $$ (a) Select which of the following cost matrices $c$ corresponds to each situation described below. A. $\left[\begin{array}{llll}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$ B. $\left[\begin{array}{llll}0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0\end{array}\right]$ C. $\left[\begin{array}{llll}0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\end{array}\right]$ D. $\left[\begin{array}{cccc}0 & .5 & .5 & .5 \\ 2 & 0 & 1 & 1 \\ 2 & 1 & 0 & 1 \\ 2 & 1 & 1 & 0\end{array}\right]$ E. $\left[\begin{array}{cccc}0 & 2 & 2 & 2 \\ .5 & 0 & 1 & 1 \\ .5 & 1 & 0 & 1 \\ .5 & 1 & 1 & 0\end{array}\right]$. Select which matrix (A, B, C, D, E) corresponds to a situiation where any output that is not the single preferred answer is penalized equally.",B
MIT Spring 2018,4,a.ii,2,Classifiers,Text,"Consider a classification problem in which there are $K$ possible output classes, $1, \ldots, K$. We have studied using NLL as a loss function in such cases, but that assumes that all mistaken classifications are equally significant. Instead, we'll consider a case where some mistakes are worse than others (e.g., mis-identifying a cow as a horse is not as bad as calling it a mouse). Define the cost matrix $c_{g, a}$ to be the cost for guessing class $g$ when the actual class is $a$. For convenience, we'll write $c_{j}$ for the column of the matrix $\left[c_{1, j}, c_{2, j}, \ldots, c_{K, j}\right]^{T}$ representing the costs of all the possible guesses when $j$ is the actual value. We will use a simple neural network with a softmax activation function, so our prediction $p$ will be a $K \times 1$ vector: $$ \begin{aligned} &p=\operatorname{softmax}(z) \\ &z=W^{T} x \end{aligned} $$ Assume inputs are $d \times 1$ so $W$ is $d \times K$. Our loss function, for a prediction vector $p$ when the target output is value $y \in\{1, \ldots, K\}$ is the expected cost of the prediction: $$ L_{c}(p, y)=\sum_{k=1}^{K} p_{k} c_{k y}=p^{T} c_{y} $$ So, the overall training objective is to minimize, over a data set of $n$ points, $$ J_{c}(W)=\sum_{i=1}^{n} L_{c}\left(p^{(i)}, y^{(i)}\right) $$ (a) Select which of the following cost matrices $c$ corresponds to each situation described below. A. $\left[\begin{array}{llll}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$ B. $\left[\begin{array}{llll}0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0\end{array}\right]$ C. $\left[\begin{array}{llll}0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\end{array}\right]$ D. $\left[\begin{array}{cccc}0 & .5 & .5 & .5 \\ 2 & 0 & 1 & 1 \\ 2 & 1 & 0 & 1 \\ 2 & 1 & 1 & 0\end{array}\right]$ E. $\left[\begin{array}{cccc}0 & 2 & 2 & 2 \\ .5 & 0 & 1 & 1 \\ .5 & 1 & 0 & 1 \\ .5 & 1 & 1 & 0\end{array}\right]$. Select which matrix (A, B, C, D, E) corresponds to a situiation where there are two pairs of outputs that are interchangeable with no penalty.",C
MIT Spring 2018,4,a.iii,2,Classifiers,Text,"Consider a classification problem in which there are $K$ possible output classes, $1, \ldots, K$. We have studied using NLL as a loss function in such cases, but that assumes that all mistaken classifications are equally significant. Instead, we'll consider a case where some mistakes are worse than others (e.g., mis-identifying a cow as a horse is not as bad as calling it a mouse). Define the cost matrix $c_{g, a}$ to be the cost for guessing class $g$ when the actual class is $a$. For convenience, we'll write $c_{j}$ for the column of the matrix $\left[c_{1, j}, c_{2, j}, \ldots, c_{K, j}\right]^{T}$ representing the costs of all the possible guesses when $j$ is the actual value. We will use a simple neural network with a softmax activation function, so our prediction $p$ will be a $K \times 1$ vector: $$ \begin{aligned} &p=\operatorname{softmax}(z) \\ &z=W^{T} x \end{aligned} $$ Assume inputs are $d \times 1$ so $W$ is $d \times K$. Our loss function, for a prediction vector $p$ when the target output is value $y \in\{1, \ldots, K\}$ is the expected cost of the prediction: $$ L_{c}(p, y)=\sum_{k=1}^{K} p_{k} c_{k y}=p^{T} c_{y} $$ So, the overall training objective is to minimize, over a data set of $n$ points, $$ J_{c}(W)=\sum_{i=1}^{n} L_{c}\left(p^{(i)}, y^{(i)}\right) $$ (a) Select which of the following cost matrices $c$ corresponds to each situation described below. A. $\left[\begin{array}{llll}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$ B. $\left[\begin{array}{llll}0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0\end{array}\right]$ C. $\left[\begin{array}{llll}0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\end{array}\right]$ D. $\left[\begin{array}{cccc}0 & .5 & .5 & .5 \\ 2 & 0 & 1 & 1 \\ 2 & 1 & 0 & 1 \\ 2 & 1 & 1 & 0\end{array}\right]$ E. $\left[\begin{array}{cccc}0 & 2 & 2 & 2 \\ .5 & 0 & 1 & 1 \\ .5 & 1 & 0 & 1 \\ .5 & 1 & 1 & 0\end{array}\right]$. Select which matrix (A, B, C, D, E) corresponds to a situiation where it is worse to miss predicting a particular bad outcome than to predict that outcome by mistake.",D
MIT Spring 2018,4,b,4,Classifiers,Text,"Consider a classification problem in which there are $K$ possible output classes, $1, \ldots, K$. We have studied using NLL as a loss function in such cases, but that assumes that all mistaken classifications are equally significant. Instead, we'll consider a case where some mistakes are worse than others (e.g., mis-identifying a cow as a horse is not as bad as calling it a mouse). Define the cost matrix $c_{g, a}$ to be the cost for guessing class $g$ when the actual class is $a$. For convenience, we'll write $c_{j}$ for the column of the matrix $\left[c_{1, j}, c_{2, j}, \ldots, c_{K, j}\right]^{T}$ representing the costs of all the possible guesses when $j$ is the actual value. We will use a simple neural network with a softmax activation function, so our prediction $p$ will be a $K \times 1$ vector: $$ \begin{aligned} &p=\operatorname{softmax}(z) \\ &z=W^{T} x \end{aligned} $$ Assume inputs are $d \times 1$ so $W$ is $d \times K$. Our loss function, for a prediction vector $p$ when the target output is value $y \in\{1, \ldots, K\}$ is the expected cost of the prediction: $$ L_{c}(p, y)=\sum_{k=1}^{K} p_{k} c_{k y}=p^{T} c_{y} $$ So, the overall training objective is to minimize, over a data set of $n$ points, $$ J_{c}(W)=\sum_{i=1}^{n} L_{c}\left(p^{(i)}, y^{(i)}\right) $$ (a) Select which of the following cost matrices $c$ corresponds to each situation described below. A. $\left[\begin{array}{llll}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{array}\right]$ B. $\left[\begin{array}{llll}0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ 1 & 1 & 1 & 0\end{array}\right]$ C. $\left[\begin{array}{llll}0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\end{array}\right]$ D. $\left[\begin{array}{cccc}0 & .5 & .5 & .5 \\ 2 & 0 & 1 & 1 \\ 2 & 1 & 0 & 1 \\ 2 & 1 & 1 & 0\end{array}\right]$ E. $\left[\begin{array}{cccc}0 & 2 & 2 & 2 \\ .5 & 0 & 1 & 1 \\ .5 & 1 & 0 & 1 \\ .5 & 1 & 1 & 0\end{array}\right]$. What would the change to the weights $W$ be, in one step of stochastic gradient descent on $J_{c}$, with input $x$ and target output $y$, and step size $\eta$ ? Computing $\partial p / \partial z$ is kind of hairy. It is a $K \times K$ matrix. You can write your answer in terms of it without computing it. You may also use $x, y, W$, and/or $c$ in your solution.","$$
-\eta \cdot x \cdot\left(\frac{\partial p}{\partial z} \cdot c_{y}\right)^{T}
$$
. To calculate the SGD update, we first need to calculate $\frac{\partial J_{c}}{\partial W}$. We use chain rule.
$$
\frac{\partial J_{c}}{\partial W}=\frac{\partial J_{c}}{\partial L_{c}} \frac{\partial L_{c}}{\partial p} \frac{\partial p}{\partial z} \frac{\partial z}{\partial W}=1 \cdot\left(c_{y}^{T}\right)\left(\frac{\partial p^{T}}{\partial z}\right) \cdot x=x \cdot\left(\frac{\partial p}{\partial z} \cdot c_{y}\right)^{T}
$$
The SGD update is then
$$
-\eta \cdot x \cdot\left(\frac{\partial p}{\partial z} \cdot c_{y}\right)^{T}
$$"
MIT Spring 2018,5,a,3,MDPs,Image,"Consider the following Markov decision process:
Assume:
- Reward is 0 in all states, except $+10$ in s6 and $+5$ in s5; the reward is received when exiting the state.
- Transitions out of s0 are deterministic, and depend on the choice of action (A or B). Assume in this part that all transitions are deterministic, following the arrows indicated with probebility 1 . When horizon $=3$ and discount factor $\gamma=1$, provide values for:
i. $Q\left(s_{\mathrm{D}}, A\right)$
ii. $Q\left(s_{\mathrm{D}}, B\right)$","i. 0
ii. 5"
MIT Spring 2018,5,b,3,MDPs,Image,"Consider the following Markov decision process:
Assume:
- Reward is 0 in all states, except $+10$ in s6 and $+5$ in s5; the reward is received when exiting the state.
- Transitions out of s0 are deterministic, and depend on the choice of action (A or B). Still assuming that all transitions are deterministic, but letting horizon $=5$ and discount factor $\gamma=1$, provide values for:
i. $Q\left(s_{\mathrm{D}}, A\right)$
ii. $Q(s \mathrm{D}, B)$","i. 10
ii. 5"
MIT Spring 2018,5,c,2,MDPs,Image,"Consider the following Markov decision process:
Assume:
- Reward is 0 in all states, except $+10$ in s6 and $+5$ in s5; the reward is received when exiting the state.
- Transitions out of s0 are deterministic, and depend on the choice of action (A or B). Now, assume that transitions out of so are deterministic, but that all other transitions follow the arrows indicated with probsbility $0.9$ and stay in the current state with probsbility $0.1$

For policy $\pi\left(s_{0}\right)=B$, write a system of equations that can be solved in order to compute $V_{\pi}(s 0)$ when the horizon is infinite and $\gamma=0.8$.
Do not solve the equations!","$$
\begin{aligned}
&v_{0}=0.8 v_{4} \\
&v_{4}=0.8\left(0.1 v_{4}+0.9 v_{5}\right) \\
&v_{\mathrm{g}}=5+0.8\left(0.1 v_{\mathrm{g}}+0.9 v_{\mathrm{D}}\right)
\end{aligned}
$$"
MIT Spring 2018,6,a,3,Reinforcement Learning,Image,"We will be performing Q-learning in an MDP with states so through sk, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$.

Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$
(a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state.
$$
\begin{array}{r}
\left(s_{\mathrm{D}}, a_{2}, 0, s_{2}\right) \\
\left(s_{2}, a_{1}, 0, s_{3}\right) \\
\left(s_{3}, a_{1}, 0, s_{1}\right) \\
\left(s_{1}, a_{1}, 10, s_{\mathrm{D}}\right) \\
\left(s_{\mathrm{D}}, a_{1}, 0, s_{\mathrm{K}}\right) \\
\left(s_{\mathrm{K}}, a_{1}, 0, s_{4}\right) \\
\left(s_{4}, a_{1}, 5_{1}, s_{\mathrm{D}}\right)
\end{array}
$$
Fill in the resulting $Q$ values in the following table:
\begin{tabular}{l|l|l|l|l|l|l|} 
& $s_{0}$ & $s_{1}$ & $s_{2}$ & $s_{3}$ & $s_{4}$ & $s_{5}$ \\
\hline$a_{1}$ & 0 & & & & & \\
\hline & & 10 & 0 & 0 & 5 & 0 \\
$a_{2}$ & 0 & 0 & 0 & 0 & 0 & 0 \\
\hline
\end{tabular}","\begin{tabular}{l|l|l|l|l|l|l|} 
& $s_{0}$ & $s_{1}$ & $s_{2}$ & $s_{3}$ & $s_{4}$ & $s_{5}$ \\
\hline$a_{1}$ & 0 & & & & & \\
\hline & & 10 & 0 & 0 & 5 & 0 \\
$a_{2}$ & 0 & 0 & 0 & 0 & 0 & 0 \\
\hline
\end{tabular}"
MIT Spring 2018,6,b,2,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$.  Iyaz suggests that, rather than getting new experience, it would be a good idea to replay this data over several times using the regular Q-learning update. What's the minimum number of times you would have to iterate through this data before $Q\left(s_{0}, a_{2}\right)>Q\left(s_{0}, a_{1}\right.$ ? Note: it should be possible to answer this question by thinking about the structure of the problem, rather than by grinding through more Q-learning update calculations.","4, including the first update whose values we recorded in the table)."
MIT Spring 2018,6,c.i,1,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
What is a correct expression for <fill in> above?", $r+0.8 * \max$ ([nn[a_prime].predict(s_prime) for a_prime in actions]).
MIT Spring 2018,6,c.ii,1,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
What is an appropriate value for <epochs> above?",None. We want to train until convergence.
MIT Spring 2018,6,d,2,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
 If we change the loop to have the form for t in range(max_iterations): for $\left(s, a, r, s^{\prime}\right)$ in memory: data $=[(s,<f i l l$ in $\rangle)] \quad \#$ a single data point $\mathrm{nn}[\mathrm{a}]$. $\operatorname{train}$ (data, <epochs>) Provide a value for <epochs $>$ above that will cause this algorithm to converge to a correct solution oxplain why no such value exists. ",1. With 1 epoch we will look at every piece of experience in memory once per iteration
MIT Spring 2018,6,e,2,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
Would it be okay to call $\mathrm{nn}[\mathrm{a}]$. init() on the line before calling train in the code loop?",Yes
MIT Spring 2018,6,f,2,Reinforcement Learning,Text,"We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$. We will be performing Q-learning in an MDP with states $s_{0}$ through $s_{5}$, and actions $a_{1}$ and $a_{2}$. Let the discount factor $\gamma=0.8$ and learning rate $\alpha=1$. Note: do not assume anything more about the MDP from which this experience was generated. It is not necessarily the same as the one from question $5 .$ (a) The Q-learning algorithm begins with $Q(s, a)=0$ for all $s$ and $a$. It receives the following experience, in order. Each tuple represents a state, action, reward, and next state. $$ \begin{array}{r} \left(s_{0}, a_{2}, 0, s_{2}\right) \\ \left(s_{2}, a_{1}, 0, s_{3}\right) \\ \left(s_{3}, a_{1}, 0, s_{1}\right) \\ \left(s_{1}, a_{1}, 10, s_{0}\right) \\ \left(s_{0}, a_{1}, 0, s_{5}\right) \\ \left(s_{5}, a_{1}, 0, s_{4}\right) \\ \left(s_{4}, a_{1}, 5, s_{0}\right) \end{array} $$ Now, imagine that you have a very large memory full of experience that explores the whole environment effectively. In the following questions, you can assume that a neural network $\mathrm{nn}$ has the following methods:
- nn.train (data, epochs) takes a list of $(x, y)$ pairs and does supervised training on it for the specified number of epochs. If epochs is None it trains until convergence.
- $\mathrm{nn}$.predict $(\mathrm{x})$ takes an input $x$ and returns the predicted $y$ value
- nn.init() randomly reassigns all the weights in the network.
Consider the following code.
$\# \mathrm{nn}$ = dictionary of neural networks, one for each action
# each nn[a] maps state s into $Q(s, a)$
gamma $=0.8$
for $t$ in range(max_iterations):
for a in actions: data[a] $=[(s,<$ fill in>) for (s, a_prime, $r$, s_prime) in memory if a_prime $==a]$
for a in actions:
$\mathrm{nn}[\mathrm{a}]$.train (data[a], <epochs>)
Would it be okay to call $\mathrm{nn}$. init() on the line before calling train in the code loop?",No
MIT Spring 2018,6,g,2,Reinforcement Learning,Text,"We often use $\epsilon$-greedy exploration in Q learning, in which we execute the action with the highest Q value in the current state with probability 1 â $\epsilon$ and execute a random action with probability $\epsilon$. What problem might occur if we set $\epsilon$ to be too small?",We might get stuck for a long time doing a sub-optimal action choice due to lack of exploration.
MIT Spring 2018,7,a,4,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. M.A. Trix suggests a new decomposition of the solution matrix $X$ into $U W V^{T}$ where $W$ is a $k \times k$ matrix, and $U$ and $V$ are as in the original approach. Is M.A. Trix's approach able to represent: A richer class of models than the original? A smaller class? $\sqrt{\text { The same class? }}$ Choose 1 and provide a short concrete justification of your answer. ",The same class. You could just multiply $W$ directly into $U$ or $V^{T}$ and end up with the original model.
MIT Spring 2018,7,b.i,1,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. Otto N. Coder thinks there's a whole different and interesting way to approach this problem. Consider a neural network with two layers of weights and tanh activation functions: $$ \begin{aligned} a &=\tanh \left(W^{a T} x\right) \\ x^{\prime} &=\tanh \left(W^{b^{T}} a\right) \end{aligned} $$ where $x$ is a $m \times 1$ vector representing a single user's movie-watching experience. We will assume just binary ratings ( $+1$ means the user liked the movie and $-1$ that they did not; a value of 0 indicates that the user has not yet rated the movie). The vector $a$ is $k \times 1$ where $k$ is significantly less than $m$. We can make a data-set with $n$ such vectors, one for each user, and then train this network as an ""auto-encoder"", which takes in an $x$ vector and attempts to recreate it as its output, but which is forced to go through a much smaller representation. To train this network, we would use ordinary supervised training, but with pairs $(x, x)$ with one $x$ vector for each user in the training data, used as both the input and the desired output of the network.

The loss function $L\left(x^{\prime}, x\right)$ where $x$ is the true output vector and $x^{\prime}$ is the prediction, would be
$$
L\left(x^{\prime}, x\right)=\sum_{j=1}^{m} \begin{cases}0 & \text { if } x_{j}=0 \\ \left(x_{j}-x_{j}^{\prime}\right)^{2} & \text { otherwise }\end{cases}
$$
What is $L\left(x^{\prime}, x\right)$ if this user has never watched any movies?",0
MIT Spring 2018,7,b.ii,1,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. Otto N. Coder thinks there's a whole different and interesting way to approach this problem. Consider a neural network with two layers of weights and tanh activation functions: $$ \begin{aligned} a &=\tanh \left(W^{a T} x\right) \\ x^{\prime} &=\tanh \left(W^{b^{T}} a\right) \end{aligned} $$ where $x$ is a $m \times 1$ vector representing a single user's movie-watching experience. We will assume just binary ratings ( $+1$ means the user liked the movie and $-1$ that they did not; a value of 0 indicates that the user has not yet rated the movie). The vector $a$ is $k \times 1$ where $k$ is significantly less than $m$. We can make a data-set with $n$ such vectors, one for each user, and then train this network as an ""auto-encoder"", which takes in an $x$ vector and attempts to recreate it as its output, but which is forced to go through a much smaller representation. To train this network, we would use ordinary supervised training, but with pairs $(x, x)$ with one $x$ vector for each user in the training data, used as both the input and the desired output of the network.

The loss function $L\left(x^{\prime}, x\right)$ where $x$ is the true output vector and $x^{\prime}$ is the prediction, would be
$$
L\left(x^{\prime}, x\right)=\sum_{j=1}^{m} \begin{cases}0 & \text { if } x_{j}=0 \\ \left(x_{j}-x_{j}^{\prime}\right)^{2} & \text { otherwise }\end{cases}
$$
How much, if any, more loss is incurred, with respect to a particular movie, for predicting $+1$ when the answer should be $-1$ than is incurred for predicting $-1$ when the
answer should be $+1 ?$",0
MIT Spring 2018,7,b.iii,1,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. Otto N. Coder thinks there's a whole different and interesting way to approach this problem. Consider a neural network with two layers of weights and tanh activation functions: $$ \begin{aligned} a &=\tanh \left(W^{a T} x\right) \\ x^{\prime} &=\tanh \left(W^{b^{T}} a\right) \end{aligned} $$ where $x$ is a $m \times 1$ vector representing a single user's movie-watching experience. We will assume just binary ratings ( $+1$ means the user liked the movie and $-1$ that they did not; a value of 0 indicates that the user has not yet rated the movie). The vector $a$ is $k \times 1$ where $k$ is significantly less than $m$. We can make a data-set with $n$ such vectors, one for each user, and then train this network as an ""auto-encoder"", which takes in an $x$ vector and attempts to recreate it as its output, but which is forced to go through a much smaller representation. To train this network, we would use ordinary supervised training, but with pairs $(x, x)$ with one $x$ vector for each user in the training data, used as both the input and the desired output of the network.

The loss function $L\left(x^{\prime}, x\right)$ where $x$ is the true output vector and $x^{\prime}$ is the prediction, would be
$$
L\left(x^{\prime}, x\right)=\sum_{j=1}^{m} \begin{cases}0 & \text { if } x_{j}=0 \\ \left(x_{j}-x_{j}^{\prime}\right)^{2} & \text { otherwise }\end{cases}
$$
In terms of making good predictions, would it be disastrous, just fine, or only mildly bad if we were to leave out the tanh activation function on the output layer? Explain.","Only mildly bad. We would get predictions that go outside the bounds of $+1$ and $-1$, but they would probably be usable for picking the max. Note that choosing the max is the ""right"" thing to do here since we want to make recommendations and the thing to recommend should have the maximum prediction value."
MIT Spring 2018,7,c,3,Neural Networks,Text,"Consider a recommender system that has $n$ users and $m$ movies, with sparse ratings available for each user and movie. Our typical strategy is to look for matrices $U$ and $V$ such that $U V^{T}$ is a rank $k$ approximation to the true matrix at the locations it has been observed. Otto N. Coder thinks there's a whole different and interesting way to approach this problem. Consider a neural network with two layers of weights and tanh activation functions: $$ \begin{aligned} a &=\tanh \left(W^{a T} x\right) \\ x^{\prime} &=\tanh \left(W^{b^{T}} a\right) \end{aligned} $$ where $x$ is a $m \times 1$ vector representing a single user's movie-watching experience. We will assume just binary ratings ( $+1$ means the user liked the movie and $-1$ that they did not; a value of 0 indicates that the user has not yet rated the movie). The vector $a$ is $k \times 1$ where $k$ is significantly less than $m$. We can make a data-set with $n$ such vectors, one for each user, and then train this network as an """"auto-encoder"""", which takes in an $x$ vector and attempts to recreate it as its output, but which is forced to go through a much smaller representation. To train this network, we would use ordinary supervised training, but with pairs $(x, x)$ with one $x$ vector for each user in the training data, used as both the input and the desired output of the network.

The loss function $L\left(x^{\prime}, x\right)$ where $x$ is the true output vector and $x^{\prime}$ is the prediction, would be
$$
L\left(x^{\prime}, x\right)=\sum_{j=1}^{m} \begin{cases}0 & \text { if } x_{j}=0 \\ \left(x_{j}-x_{j}^{\prime}\right)^{2} & \text { otherwise }\end{cases}
$$
After training this network, we could feed in a particular user's $x$ vector and receive an output $x^{\prime}$. How could we use the $x^{\prime}$ value to select the best movie to recommend to that user?
Provide your answer in completely detailed math, code, or English that could be unambiguously converted into math or code.","$$
m=\operatorname{argmax}_{\left\{i \mid x_{i}=0\right\}} x_{i}^{\prime}
$$"
MIT Spring 2018,8,a,3,RNNS,Image,"One of the RNN architectures we studied was
$$
\begin{aligned}
&s_{t}=f_{1}\left(W^{s s} s_{t-1}+W^{s x_{t}} x_{t}\right) \\
&y_{t}=f_{2}\left(W^{o} s_{t}\right)
\end{aligned}
$$
where $W^{s s}$ is $m \times m, W^{s x}$ is $m \times l$ and $W^{o}$ is $n \times m$. Assume $f_{i}$ can be any of our standard activation functions. We omit the offset parameters for simplicity (set them to zero). Suppose we modify the original architecture as follows:
$$
s_{t}=f_{1}\left(W^{s s 1} f_{3}\left(W^{s s 2} s_{t-1}\right)+W^{s z} x_{t}\right)
$$
i. Provide values for the original $W^{s a}$ that make the original architecture equivalent to this one, or explain why none exist.
$$
W^{\text {ss }}=
$$

ii. Provide values for $W^{s s 2}, f_{3}$ and $W^{s s 1}$ that make this new architecture equivalent to the original, or explain why none exist.","i. This architecture can represent state machines that can't be represented by the original architecture, because the class of state transition functions that can be modeled in the modified architecture is bigger.

ii. linear / Wss / I"
MIT Spring 2018,8,b,2,RNNs,Image,"One of the RNN architectures we studied was
$$
\begin{aligned}
&s_{t}=f_{1}\left(W^{s s} s_{t-1}+W^{s x_{t}} x_{t}\right) \\
&y_{t}=f_{2}\left(W^{o} s_{t}\right)
\end{aligned}
$$
where $W^{s s}$ is $m \times m, W^{s x}$ is $m \times l$ and $W^{o}$ is $n \times m$. Assume $f_{i}$ can be any of our standard activation functions. We omit the offset parameters for simplicity (set them to zero). Now, we'll consider two strategies for making the RNN generate two output symbols for each input symbol. Assume the symbols are drawn from a vocabulary of size $n$.
Model A: We use a separate softmsx output for each symbol, so
$$
\begin{aligned}
&y_{t}^{1}=\operatorname{softmax}\left(W^{o 1} s_{t}\right) \\
&y_{t}^{2}=\operatorname{softmax}\left(W^{02} s_{t}\right)
\end{aligned}
$$
where $W^{o 1}$ and $W^{o 2}$ are $n \times m$.
Model B: We use a single softmax output, but it ranges over $n^{2}$ possible pairs of symbols, so
$$
y_{t}^{1}, y_{t}^{2}=\operatorname{softmax}\left(W^{o 3} s_{t}\right)
$$
i. What would the dimension of $W^{33}$ need to be?
ii. Which of the following is true:
Models A and B can express exactly the same set of RNN models.
Model A is more expressive than model B.
$\sqrt{\text { Model } \mathbf{B} \text { is more expressive than model } A .$","i. $n^{2} \times m$
ii. Model B is more expressive than model A."
MIT Spring 2018,8,c,2,RNNs,Image,Image,
MIT Spring 2018,9,a,3,CNNs,Image,"We will explore how convolutional neural networks operate by designing one. Our objective is to be able to locate the pattern
in an image. Throughout this problem, treat dark squares as having value $+1$ and light squares as having value $-1$. Consider the image that would result from convolving the image below with a filter that is the same as the pattern above. (Use our definition of convolution, in which we slide the filter over the image and compute the dot product.) Assume that the edges are padded with $-1$ and that use a stride of 1 .

Indicate which pixel in the resulting image will have the maximum value by writing the resulting pixel value in the appropriate cell of the image on the right below.",Image filling
MIT Spring 2018,9,b,3,CNNs,Image,"We will explore how convolutional neural networks operate by designing one. Our objective is to be able to locate the pattern
in an image. Throughout this problem, treat dark squares as having value $+1$ and light squares as having value $-1$. In order to detect this pattern, we would create a network that has
- a convolutional layer with a single filter, corresponding to the desired pattern,
- a max-pooling layer with input size equal to the image size, and finally
- a single ReLU unit.
Provide a value for the offset $W_{o}$ on the input to the ReLU that, for any image, would guarantee the output of the ReLU is positive if and only if there is a perfect instance of this pattern in the image.","$-8$
A perfect score is 9 . The next best match would be 8 correct and 1 wrong, which would total to 7 . Any value between 7 and 9 would be correct here."
MIT Spring 2018,9,c,2,CNNs,Image,"Kanye Volution thinks that instead of having this single convolution layer with a single filter matching the whole desired pattern, it would be better to start with a convolutional layer with four smaller filters, shown below:

The following images are the result of convolving the original image with these 4 simple filters and running through a ReLU. Black squares have value $+1$, grey squares have value $+0.5$, and the rest have value 0 .

It is slightly unusual to have $2 \times 2$ filters (usually they have odd dimension). When we apply them, we place the upper-left pixel of the filter on top of the image pixel whose value we are computing.

The next layer of Kanye's network now takes an input of depth 4 and applies a single $2 x$ $2 \times 4$ filter. Specify a filter on the output of the simple filters that will generate an image with a high value at the pixel located at the upper left corner of the pattern and lower values elsewhere. Fill weight values (either $+1$ or $-1$ ) into the squares below.",Image filling
MIT Fall 2018,2,a,2.5,Neural Networks,Text,"We will consider a neural network with a slightly unusual structure. Let the input $x$ be $d \times 1$ and let the weights be represented as $k 1 \times d$ vectors, $W^{(1)}, \ldots, W^{(k)}$. Then the final output is
$$
\hat{y}=\prod_{i=1}^{k} \sigma\left(W^{(i)} x\right)=\sigma\left(W^{(1)} x\right) \times \cdots \times \sigma\left(W^{(k)} x\right)
$$
Define $a^{(j)}=\sigma\left(W^{(j)} x\right)$.
What is $\partial L(\hat{y}, y) / \partial a^{(j)}$ for some $j$ ? Since we have not specified the loss function, you can express your answer in terms of $\partial L(\hat{y}, y) / \partial \hat{y}$.","$$
\frac{\partial L(\hat{y}, y)}{\partial \hat{y}} \prod_{i \neq j} \sigma\left(W^{(i)} x\right)
$$"
MIT Fall 2018,2,b,2.5,Neural Networks,Text,"We will consider a neural network with a slightly unusual structure. Let the input $x$ be $d \times 1$ and let the weights be represented as $k 1 \times d$ vectors, $W^{(1)}, \ldots, W^{(k)}$. Then the final output is
$$
\hat{y}=\prod_{i=1}^{k} \sigma\left(W^{(i)} x\right)=\sigma\left(W^{(1)} x\right) \times \cdots \times \sigma\left(W^{(k)} x\right)
$$
Define $a^{(j)}=\sigma\left(W^{(j)} x\right)$.
What are the dimensions of $\partial a^{(j)} / \partial W^{(j)}$ ?","Because $a^{(j)}$ is a scalar, they are the same as for $W^{(j)}$, which is $1 \times d$."
MIT Fall 2018,2,c,2.5,Neural Networks,Text,"We will consider a neural network with a slightly unusual structure. Let the input $x$ be $d \times 1$ and let the weights be represented as $k 1 \times d$ vectors, $W^{(1)}, \ldots, W^{(k)}$. Then the final output is
$$
\hat{y}=\prod_{i=1}^{k} \sigma\left(W^{(i)} x\right)=\sigma\left(W^{(1)} x\right) \times \cdots \times \sigma\left(W^{(k)} x\right)
$$
Define $a^{(j)}=\sigma\left(W^{(j)} x\right)$.
What is $\partial a^{(j)} / \partial W^{(j)}$ ? (Recall that $d \sigma(v) / d v=\sigma(v)(1-\sigma(v))$.)","$$
a^{(j)}\left(1-a^{(j)}\right) x^{T}
$$"
MIT Fall 2018,2,d,2.5,Neural Networks,Text,"We will consider a neural network with a slightly unusual structure. Let the input $x$ be $d \times 1$ and let the weights be represented as $k 1 \times d$ vectors, $W^{(1)}, \ldots, W^{(k)}$. Then the final output is
$$
\hat{y}=\prod_{i=1}^{k} \sigma\left(W^{(i)} x\right)=\sigma\left(W^{(1)} x\right) \times \cdots \times \sigma\left(W^{(k)} x\right)
$$
Define $a^{(j)}=\sigma\left(W^{(j)} x\right)$.
What would the form of a stochastic gradient descent update rule be for $W^{(j)}$ ? Express your answer in terms of $\partial L(\hat{y}, y) / \partial a^{(j)}$ and $\partial a^{(j)} / \partial W^{(j)}$. Use $\eta$ for the step size.","$$
W^{(j)}=W^{(j)}-\eta \frac{\partial L(\hat{y}, y)}{\partial a^{(j)}} \frac{\partial a^{(j)}}{\partial W^{(j)}}
$$"
MIT Fall 2018,3,a,1.666666667,CNNs,Image,"Consider the following image (on the left) and filter (on the right):
Consider what results from filtering this image with this filter, assuming that the input image is padded with zeros, and using a stride of 1 . To compute the output value of a particular pixel $(i, j)$, apply the filter with its center on pixel $(i, j)$ of the input image.
Assume dark pixels have a value of 1 and light pixels have a value of -1.
i. What is the output value for the top-left image pixel (that is, the pixel with indices $(1,1)$ in one-based indexing)?
ii. What element of the output image will have the highest value? (Assume the rows and columns of the image are numbered starting with 1.)
","i. -2
ii. 3,1"
MIT Fall 2018,3,b,1.666666667,CNNs,Text,"If for a Convolutional Neural Network we used 5 different filters with size 3x3 and stride 1 on this image, what would the dimensions of the resulting output be?",4x4x5
MIT Fall 2018,3,c.i,1.666666667,CNNs,Image,"What would be the result of applying max-pooling with size $k=2$ and stride 2 on the original, unfiltered image above?
i. What are the dimensions of the resulting image?
ii. Draw the actual image with numerical values for each pixel in the space below.
Solution:
11
$-11$",$\frac{2 \times 2}{\text { iraw the actual ins }}$
MIT Fall 2018,3,c.ii,1.666666667,CNNs,Image,"What would be the result of applying max-pooling with size $k=2$ and stride 2 on the original, unfiltered image above?
ii. Draw the actual image with numerical values for each pixel in the space below.","11
$-11$"
MIT Fall 2018,3,d.i,1.666666667,CNNs,Text,"Dana has an idea for a new kind of network called a ModConv NN. If the network is nâ¥n, we will use a filter of size n/k (assume k evenly divides n). To compute entry (a, b) of the resulting image, we apply this filter to the âsubimageâ of pixels (i, j) from the original image, where i mod k = a and j mod k = b. Could we train the weights of a ModConvNN using gradient descent? Explain why or why not.",Sure. Just another parametric model
MIT Fall 2018,3,d.ii,1.666666667,CNNs,Text,"Dana has an idea for a new kind of network called a ModConv NN. If the network is nâ¥n, we will use a filter of size n/k (assume k evenly divides n). To compute entry (a, b) of the resulting image, we apply this filter to the âsubimageâ of pixels (i, j) from the original image, where i mod k = a and j mod k = b. What underlying assumption about patterns in images is built into a regular convolutional network, but not this one?",This one does not encode the fact that nearby groups of pixels work together to encode information (that there is spatial locality of useful patterns in an image).
MIT Fall 2018,4,a,3,Neural Networks,Text,"You are working on a new system that will replace Keras for building neural networks. It is founded on the ideas of series and parallel combination. For simplicity, in this problem, we will assume all of our modules have input and output dimension $n$.
A series combination of two modules looks like this:
If you think of each module as a function, then the final output
$$
\hat{y}=M_{2}\left(M_{1}\left(x ; W_{1}\right) ; W_{2}\right) .
$$
A parallel combination of two modules looks like this (we added the outputs of the two modules to keep the input and output dimensions equal).
If you think of each module as a function, then the final output
$$
\hat{y}=M_{1}\left(x ; W_{1}\right)+M_{2}\left(x ; W_{2}\right)
$$
We won't assume that we know anything about the modules, except that they are feed-forward, have some collection of parameters $W_{i}$, which we will treat as a single vector, and that we can compute
$$
M_{\mathrm{i}}\left(v ; W_{i}\right), \frac{\partial M_{i}\left(v ; W_{\mathrm{i}}\right)}{\partial W_{\mathrm{i}}} \text { and } \frac{\partial M_{\mathrm{i}}\left(v ; W_{i}\right)}{\partial v}
$$
for each module, where $v$ is the input to that module. Assume that our loss function is squared loss, so
$$
L(\hat{y}, y)=\frac{1}{2}(\hat{y}-y)^{2}
$$

What is $\partial L(\hat{y}, y) / \partial W_{1}$ for a series combination of $M_{1}$ and $M_{2}$ ? Write your answer in terms of input $x$, target output $y$, and weights $W_{1}$ and $W_{2}$, using the given forward and gradient functions.","$$
\left.\underset{\left(x ; W_{1}\right)}{\partial M_{2}\left(a_{1} ; W_{2}\right)} \frac{\partial M_{1}\left(x ; W_{1}\right)}{\partial W_{1}}\right)^{T}\left(M_{2}\left(M_{1}\left(x ; W_{1}\right)\right)-y\right)
$$
where $a_{1}=M_{1}\left(x ; W_{1}\right)$."
MIT Fall 2018,4,b,3,Neural Networks,Text,"You are working on a new system that will replace Keras for building neural networks. It is founded on the ideas of series and parallel combination. For simplicity, in this problem, we will assume all of our modules have input and output dimension $n$.
A series combination of two modules looks like this:
If you think of each module as a function, then the final output
$$
\hat{y}=M_{2}\left(M_{1}\left(x ; W_{1}\right) ; W_{2}\right) .
$$
A parallel combination of two modules looks like this (we added the outputs of the two modules to keep the input and output dimensions equal).
If you think of each module as a function, then the final output
$$
\hat{y}=M_{1}\left(x ; W_{1}\right)+M_{2}\left(x ; W_{2}\right)
$$
We won't assume that we know anything about the modules, except that they are feed-forward, have some collection of parameters $W_{i}$, which we will treat as a single vector, and that we can compute
$$
M_{\mathrm{i}}\left(v ; W_{i}\right), \frac{\partial M_{i}\left(v ; W_{\mathrm{i}}\right)}{\partial W_{\mathrm{i}}} \text { and } \frac{\partial M_{\mathrm{i}}\left(v ; W_{i}\right)}{\partial v}
$$
for each module, where $v$ is the input to that module. Assume that our loss function is squared loss, so
$$
L(\hat{y}, y)=\frac{1}{2}(\hat{y}-y)^{2}
$$


What is $\partial L / \partial W_{1}$ for a parallel combination of $M_{1}$ and $M_{2}$ ? Write your answer in terms of input $x$, target output $y$, and weights $W_{1}$ and $W_{2}$, using the given forward and gradient functions.","$$
\left(\frac{\partial M_{1}\left(x ; W_{1}\right)}{W_{1}}\right)^{T}\left(M_{1}\left(x ; W_{1}\right)+M_{2}\left(x ; W_{2}\right)-y\right)
$$"
MIT Fall 2018,5,a,1.857142857,MDPs,Image,"Consider the following MDP with $k+4$ statess. There are two actions, $a_{1}$ and $a_{2}$. Arrows with no labels represent a transition for both actions with probability 1. Arrows labeled $a / p$ make the transition on action $a$ with probability $p$. States with no label have reward 0 . Two states have reward $+1$, obtained when taking an action in that state. There are $k-2$ states between $s_{1}$ and $s_{k}$, with a deterministic transition on any action (so that once you are in s1 you are guaranteed to end up in $s_{k}$ in $k-1$ steps).
We are interested in the infinite-horizon discounted values of some states in this MDP. What is $V\left(s_{1}\right)$ as a function of $k$ when $\gamma=0$ ?",0
MIT Fall 2018,5,b,1.857142857,MDPs,Image,"Consider the following MDP with $k+4$ statess. There are two actions, $a_{1}$ and $a_{2}$. Arrows with no labels represent a transition for both actions with probability 1. Arrows labeled $a / p$ make the transition on action $a$ with probability $p$. States with no label have reward 0 . Two states have reward $+1$, obtained when taking an action in that state. There are $k-2$ states between $s_{1}$ and $s_{k}$, with a deterministic transition on any action (so that once you are in s1 you are guaranteed to end up in $s_{k}$ in $k-1$ steps).
We are interested in the infinite-horizon discounted values of some states in this MDP. What is $V\left(s_{1}\right)$ as a function of $k$ when $\gamma=1 ?$",1
MIT Fall 2018,5,c,1.857142857,MDPs,Image,"Consider the following MDP with $k+4$ statess. There are two actions, $a_{1}$ and $a_{2}$. Arrows with no labels represent a transition for both actions with probability 1. Arrows labeled $a / p$ make the transition on action $a$ with probability $p$. States with no label have reward 0 . Two states have reward $+1$, obtained when taking an action in that state. There are $k-2$ states between $s_{1}$ and $s_{k}$, with a deterministic transition on any action (so that once you are in s1 you are guaranteed to end up in $s_{k}$ in $k-1$ steps).
We are interested in the infinite-horizon discounted values of some states in this MDP. What is $V\left(s_{1}\right)$ as a function of $k$ when $0<\gamma<1$ ?",$\gamma^{k-1}$
MIT Fall 2018,5,d,1.857142857,MDPs,Image,"Consider the following MDP with $k+4$ statess. There are two actions, $a_{1}$ and $a_{2}$. Arrows with no labels represent a transition for both actions with probability 1. Arrows labeled $a / p$ make the transition on action $a$ with probability $p$. States with no label have reward 0 . Two states have reward $+1$, obtained when taking an action in that state. There are $k-2$ states between $s_{1}$ and $s_{k}$, with a deterministic transition on any action (so that once you are in s1 you are guaranteed to end up in $s_{k}$ in $k-1$ steps).
We are interested in the infinite-horizon discounted values of some states in this MDP. What is $V\left(s_{x}\right)$ when $\gamma=0$ ?",0
MIT Fall 2018,5,e,1.857142857,MDPs,Image,"Consider the following MDP with $k+4$ statess. There are two actions, $a_{1}$ and $a_{2}$. Arrows with no labels represent a transition for both actions with probability 1. Arrows labeled $a / p$ make the transition on action $a$ with probability $p$. States with no label have reward 0 . Two states have reward $+1$, obtained when taking an action in that state. There are $k-2$ states between $s_{1}$ and $s_{k}$, with a deterministic transition on any action (so that once you are in s1 you are guaranteed to end up in $s_{k}$ in $k-1$ steps).
We are interested in the infinite-horizon discounted values of some states in this MDP. What is $V\left(s_{x}\right)$ when $\gamma=1$ ?",1
MIT Fall 2018,5,f,1.857142857,MDPs,Image,"Consider the following MDP with $k+4$ statess. There are two actions, $a_{1}$ and $a_{2}$. Arrows with no labels represent a transition for both actions with probability 1. Arrows labeled $a / p$ make the transition on action $a$ with probability $p$. States with no label have reward 0 . Two states have reward $+1$, obtained when taking an action in that state. There are $k-2$ states between $s_{1}$ and $s_{k}$, with a deterministic transition on any action (so that once you are in s1 you are guaranteed to end up in $s_{k}$ in $k-1$ steps).
We are interested in the infinite-horizon discounted values of some states in this MDP. What is $V\left(s_{x}\right)$ when $0<\gamma<1$ ?",\frac{\gamma}{2 - \gamma}
MIT Fall 2018,5,g,1.857142857,MDPs,Image,"Consider the following MDP with $k+4$ statess. There are two actions, $a_{1}$ and $a_{2}$. Arrows with no labels represent a transition for both actions with probability 1. Arrows labeled $a / p$ make the transition on action $a$ with probability $p$. States with no label have reward 0 . Two states have reward $+1$, obtained when taking an action in that state. There are $k-2$ states between $s_{1}$ and $s_{k}$, with a deterministic transition on any action (so that once you are in s1 you are guaranteed to end up in $s_{k}$ in $k-1$ steps).
We are interested in the infinite-horizon discounted values of some states in this MDP. Under what conditions on $k$ and $\gamma$ would we prefer to take action $a_{1}$ in state $s_{0}$ ? Write down a specific mathematical relationship.",When $(9 / 10) \gamma^{k-1}>\gamma /(2-\gamma)$.
MIT Fall 2018,6,a-p,12,Regression,Image,"We generated a data set with 5 data-points, with $x$ and $y$ values in $\mathbb{R}$ and applied several regression methods to it.

For each figure below, specify (a) which regression methods could possibly have generated the hypothesis on some data set and (b) given that each hypothesis was actually generated by exactly one of these methods, match each hypothesis to a single method.
A 1-Nearest neighbor
B Regression tree (with constants in the leaves)
C Regression tree (with linear regressors in the leaves)
D Linear regression with no feature transformation
E Linear regression with second-order polynomial features
$\mathrm{F}$ Linear regression with fifth-order polynomial features
$\mathrm{G}$ Neural network with no hidden layer and sigmoid output non-linearity
H Neural network with one ReLU hidden layer and no output non-linearity",Image matching
MIT Fall 2018,7,a,2,RNNs,Text,"Recall the specification of a standard recurrent neural network (RNN): input $x_{t}$ of dimension $\ell \times 1$, state $s_{t}$ of dimension $m \times 1$, and output $y_{t}$ of dimension $v \times 1$. The weights in the network, then, are
$$
\begin{aligned}
&W^{s x}: m \times \ell \\
&W^{s s}: m \times m \\
&W^{O}: v \times m
\end{aligned}
$$
with activation functions $f_{1}$ and $f_{2}$. Throughout this problem, for simplicity, we will treat all offsets as equal to 0 . Finally, the operation of the RNN is described by
$$
\begin{aligned}
&s_{t}=f_{1}\left(W^{s x} x_{t}+W^{s s} s_{t-1}\right) \\
&y_{t}=f_{2}\left(W^{o} s_{t}\right)
\end{aligned}
$$

Consider an RNN defined by $\ell=1, m=2, v=1, f_{1}=f_{2}=$ the identity function, and
$$
W^{s x}=\left[\begin{array}{l}
5 \\
6
\end{array}\right] \quad W^{s s}=\left[\begin{array}{ll}
1 & 2 \\
3 & 4
\end{array}\right] \quad W^{O}=\left[\begin{array}{ll}
-3 & -2
\end{array}\right]
$$
Assuming the initial state is all 0 , and the input sequence is $[[1],[-1]]$, what is the output sequence?","$$
\begin{aligned}
s 1 &=[5,6]^{T} \\
y 1 &=-15-12=-27 \\
s 2 &=[-5,-6]^{T}+[5+12,15+24]^{T}=[12,33]^{T} \\
y 2 &=-36-66=-102
\end{aligned}
$$
So answer is $[[-27],[-102]]$. Don't worry about transpose."
MIT Fall 2018,7,b.i,2,RNNs,Text,"Recall the specification of a standard recurrent neural network (RNN): input $x_{t}$ of dimension $\ell \times 1$, state $s_{t}$ of dimension $m \times 1$, and output $y_{t}$ of dimension $v \times 1$. The weights in the network, then, are
$$
\begin{aligned}
&W^{s x}: m \times \ell \\
&W^{s s}: m \times m \\
&W^{O}: v \times m
\end{aligned}
$$
with activation functions $f_{1}$ and $f_{2}$. Throughout this problem, for simplicity, we will treat all offsets as equal to 0 . Finally, the operation of the RNN is described by
$$
\begin{aligned}
&s_{t}=f_{1}\left(W^{s x} x_{t}+W^{s s} s_{t-1}\right) \\
&y_{t}=f_{2}\left(W^{o} s_{t}\right)
\end{aligned}
$$
We can think of the RNN as mapping input sequences to output sequences. Jody thinks that if we remove $f_{1}$ and $f_{2}$ then the mapping from input sequence to output sequence can be achieved by a hypothesis of the form $Y=W X$. In the case of a length 3 sequence, assuming inputs and outputs are 1-dimensional, $s_{0}=[0], X=\left[x_{1}, x_{2}, x_{3}\right]^{T}, Y=\left[y_{1}, y_{2}, y_{3}\right]^{T}$, and $W$ is $3 \times 3$.
Is Jody right? 




",Yes
MIT Fall 2018,7,b.ii,2,RNNs,Text,"Recall the specification of a standard recurrent neural network (RNN): input $x_{t}$ of dimension $\ell \times 1$, state $s_{t}$ of dimension $m \times 1$, and output $y_{t}$ of dimension $v \times 1$. The weights in the network, then, are
$$
\begin{aligned}
&W^{s x}: m \times \ell \\
&W^{s s}: m \times m \\
&W^{O}: v \times m
\end{aligned}
$$
with activation functions $f_{1}$ and $f_{2}$. Throughout this problem, for simplicity, we will treat all offsets as equal to 0 . Finally, the operation of the RNN is described by
$$
\begin{aligned}
&s_{t}=f_{1}\left(W^{s x} x_{t}+W^{s s} s_{t-1}\right) \\
&y_{t}=f_{2}\left(W^{o} s_{t}\right)
\end{aligned}
$$
We can think of the RNN as mapping input sequences to output sequences. Jody thinks that if we remove $f_{1}$ and $f_{2}$ then the mapping from input sequence to output sequence can be achieved by a hypothesis of the form $Y=W X$. In the case of a length 3 sequence, assuming inputs and outputs are 1-dimensional, $s_{0}=[0], X=\left[x_{1}, x_{2}, x_{3}\right]^{T}, Y=\left[y_{1}, y_{2}, y_{3}\right]^{T}$, and $W$ is $3 \times 3$.
If Jody is right, provide a definition for $W$ in Jody's model in terms of $W^{s x}, W^{s s}$, and $W^{O}$ of the original RNN that makes them equivalent If Jody is wrong, explain why.
","$$
W=\left[\begin{array}{ccc}
W^{O} W^{s x} & 0 & 0 \\
W^{O} W^{s s} W^{s x} & W^{O} W^{s x} & 0 \\
W^{O} W^{s s} W^{s s} W^{s x} & W^{O} W^{s s} W^{s x} & W^{O} W^{s x}
\end{array}\right]
$$"
MIT Fall 2018,7,c.i,2,RNNs,Text,"Recall the specification of a standard recurrent neural network (RNN): input $x_{t}$ of dimension $\ell \times 1$, state $s_{t}$ of dimension $m \times 1$, and output $y_{t}$ of dimension $v \times 1$. The weights in the network, then, are
$$
\begin{aligned}
&W^{s x}: m \times \ell \\
&W^{s s}: m \times m \\
&W^{O}: v \times m
\end{aligned}
$$
Pat thinks a different RNN model would be good. Its operation is defined by
$$
\begin{aligned}
s_{t}^{(i)} &=f_{1}\left(W_{i}^{s x} x_{t}^{(i)}+W_{i}^{s s} s_{t-1}^{(i)}\right) \\
y_{t} &=f_{2}\left(W^{O} s_{t}\right)
\end{aligned}
$$
where the dimension of the state, $m=k \cdot \ell$, so there are $k$ state dimensions for each input dimension, $s^{(i)}$ is the ith group of $k$ dimensions in the state vector, $x^{(i)}$ is the ith entry in the input vector, $W_{i}^{s x}$ is $k \times 1$ and $W_{i}^{s s}$ is $k \times k$.
with activation functions $f_{1}$ and $f_{2}$. Throughout this problem, for simplicity, we will treat all offsets as equal to 0 . Finally, the operation of the RNN is described by
$$
\begin{aligned}
&s_{t}=f_{1}\left(W^{s x} x_{t}+W^{s s} s_{t-1}\right) \\
&y_{t}=f_{2}\left(W^{o} s_{t}\right)
\end{aligned}
$$
Can this model represent the same set of state machines as a regular RNN?
",No
MIT Fall 2018,7,c.ii,2,RNNs,Text,"Recall the specification of a standard recurrent neural network (RNN): input $x_{t}$ of dimension $\ell \times 1$, state $s_{t}$ of dimension $m \times 1$, and output $y_{t}$ of dimension $v \times 1$. The weights in the network, then, are
$$
\begin{aligned}
&W^{s x}: m \times \ell \\
&W^{s s}: m \times m \\
&W^{O}: v \times m
\end{aligned}
$$
Pat thinks a different RNN model would be good. Its operation is defined by
$$
\begin{aligned}
s_{t}^{(i)} &=f_{1}\left(W_{i}^{s x} x_{t}^{(i)}+W_{i}^{s s} s_{t-1}^{(i)}\right) \\
y_{t} &=f_{2}\left(W^{O} s_{t}\right)
\end{aligned}
$$
where the dimension of the state, $m=k \cdot \ell$, so there are $k$ state dimensions for each input dimension, $s^{(i)}$ is the ith group of $k$ dimensions in the state vector, $x^{(i)}$ is the ith entry in the input vector, $W_{i}^{s x}$ is $k \times 1$ and $W_{i}^{s s}$ is $k \times k$.
with activation functions $f_{1}$ and $f_{2}$. Throughout this problem, for simplicity, we will treat all offsets as equal to 0 . Finally, the operation of the RNN is described by
$$
\begin{aligned}
&s_{t}=f_{1}\left(W^{s x} x_{t}+W^{s s} s_{t-1}\right) \\
&y_{t}=f_{2}\left(W^{o} s_{t}\right)
\end{aligned}
$$
If this model can represent the same set of state machines as a regular RNN, explain how to convert the weights of a regular RNN into weights for Pat's model.
If this model cannot represent the same set of state machines as a regular RNN, describe a concrete input/output relationship (for example, the output $y_{t}$ is the sum of all the inputs $x_{t}^{(1)}, \ldots, x_{t}^{(\ell)}$ ) that can be represented by a regular RNN but cannot be represented by Pat's model, for any value of $k$.",Output a 1 if and only if $x^{(1)}$ and $x^{(2)}$ were simultaneously non-zero.
MIT Fall 2018,8,a,4,Decision Trees,Text,"There are different strategies for pruning decision trees. We assume that we grow 
a decision tree until there is one or a small number of elements in each leaf. Then, we 
prune by deleting individual leaves of the tree until the score of the tree starts to get worse.
The question is how to score each possible pruning of the tree.
 Here is a definition of the score: The score is the percentage correct of the tree on the training set. Explain whether or not it would be a good 
idea and give a reason why or why not.",Not a good idea. The original tree was constructed to maximize performance on the training set. Pruning any part of the tree will reduce performance on the training set.
MIT Fall 2018,8,b,4,Decision Trees,Text,"There are different strategies for pruning decision trees. We assume that we grow 
a decision tree until there is one or a small number of elements in each leaf. Then, we 
prune by deleting individual leaves of the tree until the score of the tree starts to get worse.
The question is how to score each possible pruning of the tree.
 Here is a definition of the score: The score is the percentage correct of the tree on a separate validation set. Explain whether or not it would be a good 
idea and give a reason why or why not.",A good idea. The validation set will be an independent check on whether pruning a node is likely to increase or decrease performance on unseen data.
MIT Fall 2018,8,c,4,Decision Trees,Text,"There are different strategies for pruning decision trees. We assume that we grow 
a decision tree until there is one or a small number of elements in each leaf. Then, we 
prune by deleting individual leaves of the tree until the score of the tree starts to get worse.
The question is how to score each possible pruning of the tree.
 Here is a definition of the score: The score is the percentage correct of the tree, computed using cross validation. Explain whether or not it would be a good idea and give a reason why or why not.","Not a good idea. Cross-validation allows you to evaluate algorithms, not individual
hypotheses. Cross-validation will construct many new hypotheses and average their
performance, this will not tell you whether pruning a node in a particular hypothesis is
worthwhile or not."
MIT Fall 2018,8,d,4,Decision Trees,Text,"There are different strategies for pruning decision trees. We assume that we grow 
a decision tree until there is one or a small number of elements in each leaf. Then, we 
prune by deleting individual leaves of the tree until the score of the tree starts to get worse.
The question is how to score each possible pruning of the tree.
 Here is a definition of the score: The score is the percentage correct of the tree, computed on the training set, minus a
constant C times the number of nodes in the tree. C is chosen in advance by running this algorithm (grow a large tree then prune in order
to maximize percent correct minus C times number of nodes) for many diâerent values
of C, and choosing the value of C that minimizes training-set error. Explain whether or not it would be a good 
idea and give a reason why or why not.","Not a good idea. Running trials to maximize performance on the training set will not
give us an indication of whether this algorithm will produce answers that generalize to
other data sets."
MIT Fall 2018,8,e,4,Decision Trees,Text,"There are different strategies for pruning decision trees. We assume that we grow 
a decision tree until there is one or a small number of elements in each leaf. Then, we 
prune by deleting individual leaves of the tree until the score of the tree starts to get worse.
The question is how to score each possible pruning of the tree.
 Here is a definition of the score: The score is the percentage correct of the tree, computed on the training set, minus a
constant C times the number of nodes in the tree.
C is chosen in advance by running cross-validation trials of this algorithm (grow a
large tree then prune in order to maximize percent correct minus C times number of
nodes) for many diâerent values of C, and choosing the value of C that minimizes
cross-validation error. Explain whether or not it would be a good 
idea and give a reason why or why not.","A good idea when we donât have enough data to hold out a validation set. Choosing
C by cross-validation will hopefully give us an eâective general way of penalizing for
complexity of the tree (for this type of data)."
MIT Fall 2018,9,a.i,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Assume horizon $h=1$. Construct a supervised learning problem to find $Q^{1}(s, 0)$, that is, the horizon-1 $Q$ value for action 0 , as a function of state $s$.
Would you call classification or regression?",regression
MIT Fall 2018,9,a.ii,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Assume horizon $h=1$. Construct a supervised learning problem to find $Q^{1}(s, 0)$, that is, the horizon-1 $Q$ value for action 0 , as a function of state $s$.
Will you use subset D, D0, or D1?",D0
MIT Fall 2018,9,a.iii,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Assume horizon $h=1$. Construct a supervised learning problem to find $Q^{1}(s, 0)$, that is, the horizon-1 $Q$ value for action 0 , as a function of state $s$.
How will you construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.","x: s, y: r"
MIT Fall 2018,9,b.i,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Assuming horizon $h=1$, construct a supervised learning problem to find the optimal policy $\pi^{1}$. Recall that the space of possible rewards is $\{0,1\}$.
Would you call this classification or regression?
",classification
MIT Fall 2018,9,b.ii,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Assuming horizon $h=1$, construct a supervised learning problem to find the optimal policy $\pi^{1}$. Recall that the space of possible rewards is $\{0,1\}$.
Will you use subset D, D0, or D1?
",D
MIT Fall 2018,9,b.iii,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Assuming horizon $h=1$, construct a supervised learning problem to find the optimal policy $\pi^{1}$. Recall that the space of possible rewards is $\{0,1\}$.

How will you construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$?
","x: s, y: a if r =1 else 1- a"
MIT Fall 2018,9,c.i,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Assume that we have already learned $V^{3}(s)$, that is, a function that maps a state $s$ into the optimal horizon-three value.

Construct a supervised learning problem to find the optimal horizon 4 Q function for action $0, Q^{4}(s, 0)$. You can malue calls to $V^{3}$.

Would you call classification or regression?",regression
MIT Fall 2018,9,c.ii,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Assume that we have already learned $V^{3}(s)$, that is, a function that maps a state $s$ into the optimal horizon-three value.

Construct a supervised learning problem to find the optimal horizon 4 Q function for action $0, Q^{4}(s, 0)$. You can malue calls to $V^{3}$.
Will you use subset D, D0, or D1?",D0
MIT Fall 2018,9,c.iii,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Assume that we have already learned $V^{3}(s)$, that is, a function that maps a state $s$ into the optimal horizon-three value.

Construct a supervised learning problem to find the optimal horizon 4 Q function for action $0, Q^{4}(s, 0)$. You can malue calls to $V^{3}$.
How will you construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$?","x: s, $y: r+\gamma V^{3}\left(g^{r}\right)$"
MIT Fall 2018,9,d,1.1,MDPs,Text,"Sometimes we can make robust reinforcement-learning algorithms by reducing the problem to supervised learning. Assume:
- The state space is $\mathbb{R}^{d}$, so in general the same state $s$ may not occur more than once in our data set.
- The action space is $\{0,1\}$.
- The space of possible rewards is $\{0,1\}$.
- There is a discount factor $\gamma$.
You are given a data set $\mathcal{D}$ of experience interacting with a domain. It contains $n$ tuples, each of the form $\left(s, a, r, s^{\prime}\right)$. Let $\mathcal{D}_{0}$ be the subset of the data tuples where $a=0$, and similarly $\mathcal{D}_{1}$ be the subset of the data tuples where $a=1$.

Assume you have supervised classification and regression algorithms available to you, so that you can call classify $(X, Y)$ or regress $(X, Y)$ where $X$ is a matrix of input values and $Y$ is a vector of output values, and get out a hypothesis.

In each of the following questions, we will ask you to construct a call to one of these procedures to produce a $Q, V$, or $\pi$ function. In each case, we will ask you to specify:
- Whether it is a regression or classification problem.
- The subset of $\mathcal{D}$ you will use.
- How you will construct a training example $(x, y)$ from an original tuple $\left(s, a, r, s^{r}\right)$.
For example, if you wanted to train a neural network to take in a state $s$ and predict the expected next state given that you take action 1 , then you might do a regression problem using data $D_{1}$, by setting $x=s$ and $y=s^{\prime}$.
Because the state space is continuous, it is difficult to train $V^{4}$ without first estimating $Q^{4}$, given only our data set and $V^{3}$. Explain briefly why.","For any given $g$ we only know what happens when we take one of the actions, but not the other, since they don't line up, we don't have a way to take the max over actions."
MIT Fall 2018,10,a,2,Loss Functions,Image,"We can do machine learning experiments in which we hold all parameters constant, vary a single parameter, take the hypotheses we have learned at each point, and plot its error on the training and test sets. For each experiment below, select a plot from above that indicates the generally expected shape of the curve for training and for test error. If the experiment doesn't make any sense, choose ""not sensible"" rather than a curve above.

Don't worry about the fact that we would generally expect curves to bounce around a bit and not be as smooth as these. Also, don't necessarily interpret the plotted $x$ axis as being for the value $y=0$.
Provide a one-sentence justification for each answer. 
$\mathrm{X}$ axis: size of training set
train error:
.
test error:
","train error:
$\mathbf{B}$
It's easier to get low training error on small dataset.
test error:
$\mathbf{4}$
As we get more training data, we generalize better to new test data."
MIT Fall 2018,10,b,2,Loss Functions,Image,"We can do machine learning experiments in which we hold all parameters constant, vary a single parameter, take the hypotheses we have learned at each point, and plot its error on the training and test sets. For each experiment below, select a plot from above that indicates the generally expected shape of the curve for training and for test error. If the experiment doesn't make any sense, choose ""not sensible"" rather than a curve above.

Don't worry about the fact that we would generally expect curves to bounce around a bit and not be as smooth as these. Also, don't necessarily interpret the plotted $x$ axis as being for the value $y=0$.
Provide a one-sentence justification for each answer. 
X axis: number of iterations of gradient descent
train error: 
test error:","train error: $\mathbf{A}$
Training error is usually our objective, and generally decreases with iterations.
test error:
$\mathbf{C}$
Early, we have not fit well enough; later we may overfit"
MIT Fall 2018,10,c,2,Loss Functions,Image,"We can do machine learning experiments in which we hold all parameters constant, vary a single parameter, take the hypotheses we have learned at each point, and plot its error on the training and test sets. For each experiment below, select a plot from above that indicates the generally expected shape of the curve for training and for test error. If the experiment doesn't make any sense, choose ""not sensible"" rather than a curve above.

Don't worry about the fact that we would generally expect curves to bounce around a bit and not be as smooth as these. Also, don't necessarily interpret the plotted $x$ axis as being for the value $y=0$.
Provide a one-sentence justification for each answer. 
X axis: gradient-descent step size $\eta$
train error:
test error:","train error: C
Small step size is slow to converge; big step size may diverge,
test error: $\mathrm{C}$
Test error is likely to suffer in the same way as training error."
MIT Fall 2018,10,d,2,Loss Functions,Image,"We can do machine learning experiments in which we hold all parameters constant, vary a single parameter, take the hypotheses we have learned at each point, and plot its error on the training and test sets. For each experiment below, select a plot from above that indicates the generally expected shape of the curve for training and for test error. If the experiment doesn't make any sense, choose ""not sensible"" rather than a curve above.

Don't worry about the fact that we would generally expect curves to bounce around a bit and not be as smooth as these. Also, don't necessarily interpret the plotted $x$ axis as being for the value $y=0$.
Provide a one-sentence justification for each answer.
$\mathrm{X}$ axis: regularization parameter $\lambda$
train error:
test error:
","train error: B
With bigger $\lambda$ we quit caring about training error.
test error: $\mathbf{C}$
With small $\lambda$ we may overfit; with big $\lambda$ we may not fit well enough."
MIT Spring 2019,1,a,2,Classifiers,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). Break ties in distance by choosing the point with smaller x_1 coordinate, and if still tied, by smaller x_2 coordinate.
Compute the leave-one-out cross validation accuracy (i.e., average 8 -fold cross validation accuracy) of the 1-nearest-neighbor learning algorithm on this dataset.","6/8. When left out of the training set, the point at (1,-1) will be misclassifed during testing; similarly for the point at (2,-2)."
MIT Spring 2019,1,b,2,Classifiers,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). Break ties in distance by choosing the point with smaller x_1 coordinate, and if still tied, by smaller x_2 coordinate.
Compute the leave-one-out cross validation accuracy of the 3-nearest-neighbor learning algorithm on this dataset.","7/8. Now only the point at (2,-2) will be misclassied during testing, when left out of the training set."
MIT Spring 2019,1,c,2,Classifiers,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). Break ties in distance by choosing the point with smaller x_1 coordinate, and if still tied, by smaller x_2 coordinate.
In the case of the 1-nearest-neighbor learning algorithm, is it possible to strictly increase the leave-one-out cross validation accuracy on this dataset by changing the label of a single point in the original dataset? If so, give such a point.","Yes. Change either point at (2,-2) to +1, or point at (1,-1) to -1."
MIT Spring 2019,1,d,2,Classifiers,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). Break ties in distance by choosing the point with smaller x_1 coordinate, and if still tied, by smaller x_2 coordinate.
In the case of the 3-nearest neighbor algorithm, is it possible to strictly increase the leave-one-out cross validation accuracy on this dataset by changing the label of a single point in the original dataset? If so, give such a point.","No, not possible. If we try to change the point at (2, -2) to +1, then that point will be correctly predicted during cross-validation as +1. Unfortunately, with that change the two points at (5,-1) and (5,-2) will now be misclassied, making our cross-validation accuracy worse."
MIT Spring 2019,2,a,1.333333333,Decision Trees,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). We will construct a tree using a greedy algorithm that recursively minimizes weighted average entropy. Recall that the weighted average entropy of a split into subsets A and B is: (fraction of points in $A) \cdot H\left(R_{j, s}^{A}\right)+($ fraction of points in $B) \cdot H\left(R_{j, s}^{B}\right)$ where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by $H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}$. The $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$. Some facts that might be useful to you: H(0) = 0, H(3/5) = 0.97, H(3/8) = 0.95, H(3/4) = 0.81, H(5/6) = 0.65, H(1) = 0. 
Draw the decision tree that would be constructed by our tree algorithm for this dataset. Clearly label the test in each node, which case (yes or no) each branch corresponds to, and the prediction that will be made at each leaf. Assume there is no pruning and that the algorithm runs until each leaf has only members of a single class.","x_2 < 0
Yes branch:
    x_1 < 1.5
    Yes branch: +1
    No branch: -1
No branch: +1"
MIT Spring 2019,2,b,1.333333333,Decision Trees,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). We will construct a tree using a greedy algorithm that recursively minimizes weighted average entropy. Recall that the weighted average entropy of a split into subsets A and B is: (fraction of points in $A) \cdot H\left(R_{j, s}^{A}\right)+($ fraction of points in $B) \cdot H\left(R_{j, s}^{B}\right)$ where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by $H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}$. The $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$. Some facts that might be useful to you: H(0) = 0, H(3/5) = 0.97, H(3/8) = 0.95, H(3/4) = 0.81, H(5/6) = 0.65, H(1) = 0. 
Draw the decision tree boundaries represented by the following decision tree on a plot:
x_2 < 0
Yes branch:
    x_1 < 1.5
    Yes branch: +1
    No branch: -1
No branch: +1","x_2 = 0, x_1 = 1.5 for x_2 <= 0 (https://cdn.mathpix.com/cropped/2022_06_01_4b45961d5bf942e8929cg-05.jpg?height=367&width=896&top_left_y=722&top_left_x=236)"
MIT Spring 2019,2,c,1.333333333,Decision Trees,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). We will construct a tree using a greedy algorithm that recursively minimizes weighted average entropy. Recall that the weighted average entropy of a split into subsets A and B is: (fraction of points in $A) \cdot H\left(R_{j, s}^{A}\right)+($ fraction of points in $B) \cdot H\left(R_{j, s}^{B}\right)$ where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by $H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}$. The $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$. Some facts that might be useful to you: H(0) = 0, H(3/5) = 0.97, H(3/8) = 0.95, H(3/4) = 0.81, H(5/6) = 0.65, H(1) = 0. 
Given the decision tree below, what class does the decision tree predict for the new point: (1, -2)?:
x_2 < 0
Yes branch:
    x_1 < 1.5
    Yes branch: +1
    No branch: -1
No branch: +1",1
MIT Spring 2019,2,d,1.333333333,Decision Trees,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). We will construct a tree using a greedy algorithm that recursively minimizes weighted average entropy. Recall that the weighted average entropy of a split into subsets A and B is: (fraction of points in $A) \cdot H\left(R_{j, s}^{A}\right)+($ fraction of points in $B) \cdot H\left(R_{j, s}^{B}\right)$ where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by $H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}$. The $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$. Some facts that might be useful to you: H(0) = 0, H(3/5) = 0.97, H(3/8) = 0.95, H(3/4) = 0.81, H(5/6) = 0.65, H(1) = 0. 
Decision trees built using our greedy algorithm are a good choice of classiers for images: true or false?",FALSE
MIT Spring 2019,2,e,1.333333333,Decision Trees,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). We will construct a tree using a greedy algorithm that recursively minimizes weighted average entropy. Recall that the weighted average entropy of a split into subsets A and B is: (fraction of points in $A) \cdot H\left(R_{j, s}^{A}\right)+($ fraction of points in $B) \cdot H\left(R_{j, s}^{B}\right)$ where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by $H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}$. The $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$. Some facts that might be useful to you: H(0) = 0, H(3/5) = 0.97, H(3/8) = 0.95, H(3/4) = 0.81, H(5/6) = 0.65, H(1) = 0. 
For decision trees built using our greedy algorithm, standardizing feature values is important: true or false?",FALSE
MIT Spring 2019,2,f,1.333333333,Decision Trees,Text,"Consider the following 2D dataset in (x,y) format: ((1,-1), +1), ((1,1),  +1), ((1,2.5),+1), ((2,-2),-1), ((2,1),+1),((2,3),+1),((5,-1),-1),((5,-2),-1). We will construct a tree using a greedy algorithm that recursively minimizes weighted average entropy. Recall that the weighted average entropy of a split into subsets A and B is: (fraction of points in $A) \cdot H\left(R_{j, s}^{A}\right)+($ fraction of points in $B) \cdot H\left(R_{j, s}^{B}\right)$ where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by $H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}$. The $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$. Some facts that might be useful to you: H(0) = 0, H(3/5) = 0.97, H(3/8) = 0.95, H(3/4) = 0.81, H(5/6) = 0.65, H(1) = 0. 
A disadvantage of using decision trees for classication is that they can only be used to classify data having two classes: true or false?",FALSE
MIT Spring 2019,3,a.i,0.75,MDPs,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
Is Dana's suggestion better or worse for tabular Q learning than Jody's? Explain your answer.","Dana's is better, because some of Jody's states might not be part of any plausible games. Jody's approach covers a much larger state space, including states that cannot arise given the rules of the game."
MIT Spring 2019,3,a.ii,0.75,MDPs,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
Is Chris' suggestion better or worse for tabular Q learning than Jody's? Explain your answer.","Worse, since we do not know if O plays optimally. We might not cover all possible states. Also, Chris' suggestion may be infeasible, if we do not know the optimal strategies for both players."
MIT Spring 2019,3,b.i,0.75,MDPs,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
Many states of the game are effectively the same due to symmetry. Draw a pair of such states which are the same due to symmetry:","Horizontal, vertical, two different diagonal symmetries with respect to the line passing through the center; rotations through 90, 180, 270 degrees."
MIT Spring 2019,3,b.ii,0.75,MDPs,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
Jordan suggests using a state-space that includes one state that stands for each set of board games that are equivalent due to symmetry. Would this be better or worse for learning than Jody's representation? Explain your answer.","Better. Jordan's state space representation has fewer states, and should facilitate faster learning."
MIT Spring 2019,3,c,1.5,MDPs,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
What is the action space of the MDP with Dana's state space definition?",Selection of one of the 9 squares.
MIT Spring 2019,3,d,1.5,MDPs,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
You get to sit and watch an expert player (who always makes optimal moves) play this game for a long time, and you observe the sequence of state-action pairs that occur in many games. Which of the following machine-learning problem formulations is most appropriate, for you to learn how to play the game? For the item you select, provide the specified additional information (where not ""none"").
1. supervised regression (describe the loss function)
2. supervised classification (describe the loss function)
3. reinforcement learning of a policy (none)
4. reinforcement learning of a value function (none)
Explain your answer.","supervised classification (loss function). You learn the mapping from input to output (e.g., the position on the grid, where you need to make the next move). The loss function could be the negative log likelihood between the expert's move and your predicted move."
MIT Spring 2019,3,e,1.5,Reinforcement Learning,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
You get to interact with an implementation of this game for many game instances, selecting your actions, observing the results and rewards. Which of the following machine-learning problem formulations is most appropriate, for you to learn how to play the game? For the item you select, provide the specified additional information (where not ""none"").
1. supervised regression (describe the loss function) Name:
2. supervised classification (describe the loss function)
3. reinforcement learning of a policy (none)
4. reinforcement learning of a value function (none)
Explain your answer.",Reinforcement learning of a policy (none).
MIT Spring 2019,3,f.i,1.5,Reinforcement Learning,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
Barney wants to solve a tic-tac-toe problem that is exactly the same as the above game (i.e., three in a row/column/diagonal wins), except that it is played on a 100 x 100 grid. Is it better for Barney to use tabular Q learning or neural-net Q learning? Explain. ",Neural-net Q-learning. A table would be too large.
MIT Spring 2019,3,f.ii,1.5,Reinforcement Learning,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
Barney wants to solve a tic-tac-toe problem that is exactly the same as the above game (i.e., three in a row/column/diagonal wins), except that it is played on a 100 x 100 grid. Suppose Barney were to use neural-net Q learning; would it help for him to start with a convolutional layer? If your answer is yes, describe four 3x3 convolutional filters that would be particularly helpful for this problem.","Yes. A 3x3 filter that detects vertical, horizontal, or diagonal lines can be very useful in detecting local solution (both for X and O)."
MIT Spring 2019,3,g,1.5,Reinforcement Learning,Text,"Tic-tac-toe is a paper-and-pencil game for two players, X and O, who take turns
marking the spaces in a 3Ã3 grid. The player who succeeds in placing three of their marks sequentially in a horizontal, vertical, or diagonal row wins the game. In this question, we'll consider a solitaire version of tic-tac-toe, in which we assume:
â¢ We are the X player;
â¢ The O player is a fixed (but possibly stochastic) algorithm;
â¢ The initial state of the board is empty, and X has the first move;
â¢ We can select any of the nine squares on our turn;
â¢ We don't know the strategy of the O player or the reward function used by O.
We place an X in an empty square, then an O appears in some other square, and then it's our turn to play again. We receive a +1 reward for getting three X's in a row, reward -1 if there are three O's in a row, and reward 0 otherwise. If we select a square that already has an X or an O in it, nothing changes and it's still our turn.
We can model this problem as a Markov decision process in several different ways. Here are some possible choices for the state space.
â¢ Jody suggests letting the state space be all possible 3 x 3 grids in which each square contains one of the following: a space, an O, and an X.
â¢ Dana suggests using all possible 3 x 3 grids in which each square contains one of the three options (a space, an O, and an X), and there is an equal number of O's and X's.
â¢ Chris suggests using all 3 x 3 tic-tac-toe game grids which appear in games where the players both employ optimal strategies.
Suppose you apply Q-learning to the 3x3 tic-tac-toe problem, and your actions always select an unfilled square. Bert suggests that it is okay to let the discount factor be 1. Is that true? Explain why or why not.",Yes. The game has a finite number of steps.
MIT Spring 2019,4,a,2,CNNs,Text,"Conne von Lucien has many pictures from her trip to Flatland and wants to determine which ones have her in the image. All of the pictures are arrays of size 4x1, with array values of either 0 or 1. Conne looks like the vector [1,0,1] in one dimension, so if a picture contains the pattern [1,0,1] anywhere inside it, it should be classified as a positive example, otherwise as a negative example.
Fortunately, you learned about CNNs and have helped Conne by designing the following network architecture with three layers:
1. A convolutional layer with one filter W that is size 3x1, and stride 1, and a single bias w_0 (where the output pixel corresponds to the input pixel that the filter is centered on). Input values of 0 should be assumed beyond the boundaries of the input.
2. A max-pooling layer P with size 2x1 and stride 2.
3. A fully connected layer $\sigma(\cdot)$ with a single output unit having a sigmoidal activation function.
What is the shape of the output of each layer?","4x1, 2x2, 1x1 scalar"
MIT Spring 2019,4,b,2,CNNs,Text,"Conne von Lucien has many pictures from her trip to Flatland and wants to determine which ones have her in the image. All of the pictures are arrays of size 4x1, with array values of either 0 or 1. Conne looks like the vector [1,0,1] in one dimension, so if a picture contains the pattern [1,0,1] anywhere inside it, it should be classified as a positive example, otherwise as a negative example.
Fortunately, you learned about CNNs and have helped Conne by designing the following network architecture with three layers:
1. A convolutional layer with one filter W that is size 3x1, and stride 1, and a single bias w_0 (where the output pixel corresponds to the input pixel that the filter is centered on). Input values of 0 should be assumed beyond the boundaries of the input.
2. A max-pooling layer P with size 2x1 and stride 2.
3. A fully connected layer $\sigma(\cdot)$ with a single output unit having a sigmoidal activation function.
What loss function is most appropriate here, especially if you want your neural network package to be useful with few modifications, to other Flatland visitors (who may appear as longer vectors)? 
A. NLL loss
B. Hinge loss
C. Quadratic loss",NLL loss
MIT Spring 2019,4,c,2,CNNs,Text,"Conne von Lucien has many pictures from her trip to Flatland and wants to determine which ones have her in the image. All of the pictures are arrays of size 4x1, with array values of either 0 or 1. Conne looks like the vector [1,0,1] in one dimension, so if a picture contains the pattern [1,0,1] anywhere inside it, it should be classified as a positive example, otherwise as a negative example.
Fortunately, you learned about CNNs and have helped Conne by designing the following network architecture with three layers:
1. A convolutional layer with one filter W that is size 3x1, and stride 1, and a single bias w_0 (where the output pixel corresponds to the input pixel that the filter is centered on). Input values of 0 should be assumed beyond the boundaries of the input.
2. A max-pooling layer P with size 2x1 and stride 2.
3. A fully connected layer $\sigma(\cdot)$ with a single output unit having a sigmoidal activation function.
We can express the loss function as $L(\sigma(P), y)$ where $P$ is the output from the max pooling layer of the CNN and $y$ is the true label for the input. Given $\frac{d L}{d P}$, derive the update rule for $w_{1}$ if the filter is composed of $W=\left[w_{1}, w_{2}, w_{3}\right]^{T}$ with bias $w_{0}$, and step size is $\eta$.","Consider $Z$ to be the outputs of layer $1, Z=\left[z_{1}, z_{2}, z_{3}, z_{4}\right]^{T}$.

$$
\begin{aligned}
z_{1} &=w_{1} \cdot 0+w_{2} x_{1}+w_{3} x_{2}+w_{0} \\
z_{2} &=w_{1} x_{1}+w_{2} x_{2}+w_{3} x_{3}+w_{0} \\
z_{3} &=w_{1} x_{2}+w_{2} x_{3}+w_{3} x_{4}+w_{0} \\
z_{4} &=w_{1} x_{3}+w_{2} x_{4}+w_{3} \cdot 0+w_{0} \\
P &=\left[p_{1}, p_{2}\right]^{T} \\
p_{1} &=\max \left(z_{1}, z_{2}\right) \\
\frac{d p_{1}}{d w_{1}} &=0 \text { if } z_{1}>z_{2} \text { else } x_{1} \\
p_{2} &=\max \left(z_{3}, z_{4}\right) \\
\frac{d p_{2}}{d w_{1}} &=x_{2} \text { if } z_{3}>z_{4} \text { else } x_{3} \\
\frac{d P}{d w_{1}} &=\left[\frac{d p_{1}}{d w_{1}}, \frac{d p_{2}}{d w_{1}}\right]^{T} \\
w_{1} &:=w_{1}-\eta \frac{d L^{T}}{d P} \quad \frac{d P}{d w_{1}}
\end{aligned}
$$
"
MIT Spring 2019,4,d,2,CNNs,Text,"Conne von Lucien has many pictures from her trip to Flatland and wants to determine which ones have her in the image. All of the pictures are arrays of size 4x1, with array values of either 0 or 1. Conne looks like the vector [1,0,1] in one dimension, so if a picture contains the pattern [1,0,1] anywhere inside it, it should be classified as a positive example, otherwise as a negative example.
Fortunately, you learned about CNNs and have helped Conne by designing the following network architecture with three layers:
1. A convolutional layer with one filter W that is size 3x1, and stride 1, and a single bias w_0 (where the output pixel corresponds to the input pixel that the filter is centered on). Input values of 0 should be assumed beyond the boundaries of the input.
2. A max-pooling layer P with size 2x1 and stride 2.
3. A fully connected layer $\sigma(\cdot)$ with a single output unit having a sigmoidal activation function.
Given $\frac{d L}{d P}$, provide the update rule for $w_{0}$, the bias to the filter.",w_{0}:=w_{0}-\eta \frac{d L^{T}}{d P} \frac{d P}{d w_{0}}
MIT Spring 2019,4,e,2,CNNs,Text,"Conne von Lucien has many pictures from her trip to Flatland and wants to determine which ones have her in the image. All of the pictures are arrays of size 4x1, with array values of either 0 or 1. Conne looks like the vector [1,0,1] in one dimension, so if a picture contains the pattern [1,0,1] anywhere inside it, it should be classified as a positive example, otherwise as a negative example.
Fortunately, you learned about CNNs and have helped Conne by designing the following network architecture with three layers:
1. A convolutional layer with one filter W that is size 3x1, and stride 1, and a single bias w_0 (where the output pixel corresponds to the input pixel that the filter is centered on). Input values of 0 should be assumed beyond the boundaries of the input.
2. A max-pooling layer P with size 2x1 and stride 2.
3. A fully connected layer $\sigma(\cdot)$ with a single output unit having a sigmoidal activation function.
Conne decides to use the neural network code as written by a $6.036$ student for the $6.036$ homework (and that actually was a correct implementation) to train her CNN using SGD. The sgd procedure may be called multiple times from elsewhere (e.g., to implement multiple epochs of SGD). Conne thinks she has a better sgd python procedure than that given in the package; her code is:
def sgd (nn , X, Y, iters =100 , lrate =0.005) :
    D, N = X.shape
    sum loss = 0
    for k in range(iters) :
        Xt = X[ : , k : k+1]
        Yt = Y[ : , k : k+1]
        Ypred = nn.forward(Xt)
        sum_loss += nn.loss.forward(Ypred , Yt)
        err = nn.loss.backward()
        nn.backward(err)
        nn.sgd_step(lrate)
Here, $n n$ is an instance of the Sequential class implementing the CNN. She knows from the unit tests that the nn routines function properly. In particular, nn.forward properly computes the predicted outputs Ypred from input data Xt, nn.loss.forward also properly computes the forward loss, $\mathrm{nn}$.loss.backward properly computes the backward loss, nn. backward properly computes the backward gradients, and nn.sgd_step properly applies an SGD update step with the specified learning rate lrate. And the $N$ sets of dimension $D$ input data $X$, and labels $Y$ are known to be correct.
However, Conne's procedure consistently gives poor results (and occasionally throws errors), compared with the $6.036$ student's correct SGD routine, when run with identical arguments.
Why? Specify the line(s) which have errors, and describe how the code should be improved to do as well as the correct implementation of the $6.036$ student","Lines 5 and 6. The SGD algorithm needs a random data point to be selected for the gradient computation. Thus, the Xt and Yt assignments should draw from a randomly chosen $j$, e.g.
for k in range ( iters ) :
    j = np.random.randint(N)
    Xt = X[ : , j : j +1]
    Yt = Y[ : , j : j +1]
. . .
Note that Conne's code may throw errors when iters $\geq N$."
MIT Spring 2019,5,a,2,MDPs,Image,"Consider the following deterministic Markow Decision Process (MDP), describing a simple robot grid world. Notice that the values of the irnmediate rewards $r$ for two transitions are written next to thern; the other transitions, with no value written next to them, have an immedinte reward of $r=0$. Assume the discount factor $\gamma$ is $0.8$.
For states $s \in\left\{s 6\right.$, $s 5$, s2\}, write the value for $V_{\pi^{*}}(s)$, the discounted inflinite horizon value of state $s$ using an optimal policy $\pi^{*}$. It is flne to write a mumerical expression-you don't have to evaluate it-but it shouldin't contain any variables.","$$
V_{a^{*}}(a 6)=100
$$
$$
V_{n^{*}}(s 5)=V_{x^{*}}(s 6)=80
$$
$$
V_{\pi^{*}}(s 2)=\gamma V_{\pi^{*}}(s 5)=64
$$"
MIT Spring 2019,5,c,2,MDPs,Image,"Consider the following deterministic Markow Decision Process (MDP), describing a simple robot grid world. Notice that the values of the irnmediate rewards $r$ for two transitions are written next to thern; the other transitions, with no value written next to them, have an immedinte reward of $r=0$. Assume the discount factor $\gamma$ is $0.8$.
For each state in the state diagram below, circle exactly one outgoing arrow, invicating an optimal action $\pi^{*}(\mathrm{~s})$ to take from that state. If there is a tie, it is flne to select any action with optimnl value.",Image filling
MIT Spring 2019,5,d,2,MDPs,Image,"Consider the following deterministic Markow Decision Process (MDP), describing a simple robot grid world. Notice that the values of the irnmediate rewards $r$ for two transitions are written next to thern; the other transitions, with no value written next to them, have an immedinte reward of $r=0$. Assume the discount factor $\gamma$ is $0.8$.
Give a value for $\gamma$ (constrained by $0<\gamma<1$ ) that results in a different optirnal policy, and describe the resulting policy by indicating which $\pi^{*}(s)$ values (i.e., which policy actions) change.","A small $\gamma=0.001$ will make it not worthwhile to defer gains for very long. In this problem, if $\gamma^{2} 100<50$, then it will be better to directly take the 50 rewrard. So valid answers here are $0<\gamma<\frac{\sqrt{2}}{2}$.
Now $\pi^{*}\left(s^{2}\right)$ is to go right (east)."
MIT Spring 2019,5,e,2,MDPs,Image,"Consider the following deterministic Markow Decision Process (MDP), describing a simple robot grid world. Notice that the values of the irnmediate rewards $r$ for two transitions are written next to thern; the other transitions, with no value written next to them, have an immedinte reward of $r=0$. Assume the discount factor $\gamma$ is $0.8$.
Assume $p=0.75$. For each of the states $s \in\{s 2, s 5, s 6\}$, write the value for $V_{\pi^{*}}(s)$. It is flne to write a numerical expression, but it shouldn't contain any variables.","Solution:
$$
\begin{aligned}
V_{x^{*}}(s 6) &=100 p+(1-p) \gamma V_{\pi^{*}}(s 6) \\
V_{z^{*}}(s 6)(1-(1-p) \gamma) &=100 p \\
V_{x^{*}}(s 6) &=\frac{100 p}{1-(1-p) \gamma}=93.75
\end{aligned}
$$
Solution:
$$
V_{\pi^{*}}(35)=V_{x^{*}}(s 6)=75
$$
Solution:
$$
V_{\pi^{*}}(s 2)=V_{\mathrm{m}^{*}}(s 5)=60
$$"
MIT Spring 2019,5,f.i,2,MDPs,Image,"Consider the following deterministic Markow Decision Process (MDP), describing a simple robot grid world. Notice that the values of the irnmediate rewards $r$ for two transitions are written next to thern; the other transitions, with no value written next to them, have an immedinte reward of $r=0$. Assume the discount factor $\gamma$ is $0.8$.
How bad does the ice have to get before the robot will prefer to completely avoid the ice? Let us answer the question by giving a value for $p$ for which the optimal policy chooses actions that completely avoid the ice, i.e., choosing the action ""go left"" over ""go up"" when
the robot is in the state a6. Approach this in four parts. The answer to each of the flrst three parts ean be a numerical expression; the answer to the last part can be an expression involving numbers and $p$.
i. What is the value $V$ of going right in state $z 2 ?$",50
MIT Spring 2019,5,f.ii,2,MDPs,Image,"Consider the following deterministic Markow Decision Process (MDP), describing a simple robot grid world. Notice that the values of the irnmediate rewards $r$ for two transitions are written next to thern; the other transitions, with no value written next to them, have an immedinte reward of $r=0$. Assume the discount factor $\gamma$ is $0.8$.
How bad does the ice have to get before the robot will prefer to completely avoid the ice? Let us answer the question by giving a value for $p$ for which the optimal policy chooses actions that completely avoid the ice, i.e., choosing the action ""go left"" over ""go up"" when
the robot is in the state a6. Approach this in four parts. The answer to each of the flrst three parts ean be a numerical expression; the answer to the last part can be an expression involving numbers and $p$.
ii. What is the value $V$ of going up in state $a 5$, if you're going to go right in state $z 2$ ?",$\gamma \cdot 50=40$
MIT Spring 2019,5,f.iii,2,MDPs,Image,"Consider the following deterministic Markow Decision Process (MDP), describing a simple robot grid world. Notice that the values of the irnmediate rewards $r$ for two transitions are written next to thern; the other transitions, with no value written next to them, have an immedinte reward of $r=0$. Assume the discount factor $\gamma$ is $0.8$.
How bad does the ice have to get before the robot will prefer to completely avoid the ice? Let us answer the question by giving a value for $p$ for which the optimal policy chooses actions that completely avoid the ice, i.e., choosing the action ""go left"" over ""go up"" when
the robot is in the state a6. Approach this in four parts. The answer to each of the flrst three parts ean be a numerical expression; the answer to the last part can be an expression involving numbers and $p$.
iii. What is the value $V$ of going left in state $a 6$, if you're going to go up in state a5 and right in state a?",$\gamma^{2}-50=32$
MIT Spring 2019,5,f.iv,2,MDPs,Image,"Consider the following deterministic Markow Decision Process (MDP), describing a simple robot grid world. Notice that the values of the irnmediate rewards $r$ for two transitions are written next to thern; the other transitions, with no value written next to them, have an immedinte reward of $r=0$. Assume the discount factor $\gamma$ is $0.8$.
How bad does the ice have to get before the robot will prefer to completely avoid the ice? Let us answer the question by giving a value for $p$ for which the optimal policy chooses actions that completely avoid the ice, i.e., choosing the action ""go left"" over ""go up"" when
the robot is in the state a6. Approach this in four parts. The answer to each of the flrst three parts ean be a numerical expression; the answer to the last part can be an expression involving numbers and $p$.
iv. Under what condition on $p$ is it better to go left in state $a 6$ (then up in state a5 and right in state $a$ 2) than it is to go up in state $z 6$ ?","$$
\begin{aligned}
\frac{p \cdot 100}{1-(1-p) \cdot 0.8} &<32 \\
p &<\frac{8}{93} \approx 0.086
\end{aligned}
$$"
MIT Spring 2019,6,a,1.75,Classifiers,Text,"After taking 6.036, Bob decides to train a recommender system to predict what ratings different customers will give to different movies. Currently, he knows of three really popular movies, and he knows of two potential customers who have ranked some of these movies. The data matrix currently looks like: $Y=[[2, ?, 3],[4,2, ?]]$ where, as in class, rows correspond to customers and columns correspond to movies, and ? indicates a missing or unknown ranking. He decides to find a low rank factorization of $Y$ using the alternating least squares algorithm implemented in class. Assume for this question that offsets are set to $0 .$
Bob starts out by trying a rank 1 factorization of $Y$ as $U V^{T}$. He initializes $U=[1,2]^{T}$. Assume there is no regularization. In the first iteration of alternating least squares, we will find the best $V$ given the current $U$. What is the objective function $J(V)$ in terms of $V$ ? Write it in terms of $V_{1}, V_{2}, V_{3}$ and specific numerical values from $Y$.",J(V)=\left(1 \cdot V_{1}-2\right)^{2}+\left(1 \cdot V_{3}-3\right)^{2}+\left(2 \cdot V_{1}-4\right)^{2}+\left(2 \cdot V_{2}-2\right)^{2}
MIT Spring 2019,6,b,1.75,Classifiers,Text,"After taking 6.036, Bob decides to train a recommender system to predict what ratings different customers will give to different movies. Currently, he knows of three really popular movies, and he knows of two potential customers who have ranked some of these movies. The data matrix currently looks like: $Y=[[2, ?, 3],[4,2, ?]]$ where, as in class, rows correspond to customers and columns correspond to movies, and ? indicates a missing or unknown ranking. He decides to find a low rank factorization of $Y$ using the alternating least squares algorithm implemented in class. Assume for this question that offsets are set to $0.$
What is the optimal value of $V$?","The optimal value is $V=[2,1,3]^{T}$. We are fortunate in being able to exactly match all of the non-empty $Y$ elements."
MIT Spring 2019,6,c,1.75,Classifiers,Text,"After taking 6.036, Bob decides to train a recommender system to predict what ratings different customers will give to different movies. Currently, he knows of three really popular movies, and he knows of two potential customers who have ranked some of these movies. The data matrix currently looks like: $Y=[[2, ?, 3],[4,2, ?]]$ where, as in class, rows correspond to customers and columns correspond to movies, and ? indicates a missing or unknown ranking. He decides to find a low rank factorization of $Y$ using the alternating least squares algorithm implemented in class. Assume for this question that offsets are set to $0.$
What is the associated overall training error?",The training error is 0 .
MIT Spring 2019,6,d,1.75,Classifiers,Text,"After taking 6.036, Bob decides to train a recommender system to predict what ratings different customers will give to different movies. Currently, he knows of three really popular movies, and he knows of two potential customers who have ranked some of these movies. The data matrix currently looks like: $Y=[[2, ?, 3],[4,2, ?]]$ where, as in class, rows correspond to customers and columns correspond to movies, and ? indicates a missing or unknown ranking. He decides to find a low rank factorization of $Y$ using the alternating least squares algorithm implemented in class. Assume for this question that offsets are set to $0.$
Bob is happy about what he has accomplished, until he realizes that there are a bunch of movies and users that he still needs to add to his database! He sees that his database will slowly grow over time, and that it will be time-consuming to train a completely new model every single time he updates his database. If Bob has an $m \times n$ data matrix which he wants to find a rank $k$ factorization for, his analysis indicates that the worst-case run-time (in terms of number of expensive multiplications) of performing alternating least squares for $t$ iterations (where each iteration updates both $U$ and $V)$ will be $O(k^{2}*m*n*t)$.
Instead, Bob comes upon the following idea: whenever he gets information about a new movie, he adds an extra row to $V$ but does not alter the existing entries of $U$ or $V$. He then finds the values of the entries in that extra row that minimize the objective function (with no regularization). He performs a similar procedure when he gets a new user, but instead adds an extra row to $U$. Working from the dataset in the first part, say Bob receives a new movie to which his first user has given the rating 4 . What is the updated value of $V$ ?","V=[2,1,3,4]^{T}"
MIT Spring 2019,6,e,1.75,Classifiers,Text,"After taking 6.036, Bob decides to train a recommender system to predict what ratings different customers will give to different movies. Currently, he knows of three really popular movies, and he knows of two potential customers who have ranked some of these movies. The data matrix currently looks like: $Y=[[2, ?, 3],[4,2, ?]]$ where, as in class, rows correspond to customers and columns correspond to movies, and ? indicates a missing or unknown ranking. He decides to find a low rank factorization of $Y$ using the alternating least squares algorithm implemented in class. Assume for this question that offsets are set to $0.$
Bob is happy about what he has accomplished, until he realizes that there are a bunch of movies and users that he still needs to add to his database! He sees that his database will slowly grow over time, and that it will be time-consuming to train a completely new model every single time he updates his database. If Bob has an $m \times n$ data matrix which he wants to find a rank $k$ factorization for, his analysis indicates that the worst-case run-time (in terms of number of expensive multiplications) of performing alternating least squares for $t$ iterations (where each iteration updates both $U$ and $V)$ will be $O(k^{2}*m*n*t)$.
Instead, Bob comes upon the following idea: whenever he gets information about a new movie, he adds an extra row to $V$ but does not alter the existing entries of $U$ or $V$. He then finds the values of the entries in that extra row that minimize the objective function (with no regularization). He performs a similar procedure when he gets a new user, but instead adds an extra row to $U$. Working from the dataset in the first part, say Bob receives a new movie to which his first user has given the rating 4, resulting in an updated $V$ of $V=[2,1,3,4]^{T}$. With this updated $V$, what rating does Bob predict that the second user will give this movie? ",8
MIT Spring 2019,6,f,1.75,Classifiers,Text,"After taking 6.036, Bob decides to train a recommender system to predict what ratings different customers will give to different movies. Currently, he knows of three really popular movies, and he knows of two potential customers who have ranked some of these movies. The data matrix currently looks like: $Y=[[2, ?, 3],[4,2, ?]]$ where, as in class, rows correspond to customers and columns correspond to movies, and ? indicates a missing or unknown ranking. He decides to find a low rank factorization of $Y$ using the alternating least squares algorithm implemented in class. Assume for this question that offsets are set to $0.$
Bob is happy about what he has accomplished, until he realizes that there are a bunch of movies and users that he still needs to add to his database! He sees that his database will slowly grow over time, and that it will be time-consuming to train a completely new model every single time he updates his database. If Bob has an $m \times n$ data matrix which he wants to find a rank $k$ factorization for, his analysis indicates that the worst-case run-time (in terms of number of expensive multiplications) of performing alternating least squares for $t$ iterations (where each iteration updates both $U$ and $V)$ will be $O(k^{2}*m*n*t)$.
Instead, Bob comes upon the following idea: whenever he gets information about a new movie, he adds an extra row to $V$ but does not alter the existing entries of $U$ or $V$. He then finds the values of the entries in that extra row that minimize the objective function (with no regularization). He performs a similar procedure when he gets a new user, but instead adds an extra row to $U$. 
Bob continues using this update scheme whenever he adds new movies and users. Does the order in which Bob receives new information affect the final values of $U$ and $V$ that he learns? Explain.","Yes. Let us say Bob gets information about movie $k$ and person $a$ in that order. Based on this new update scheme, the row in $V$ corresponding to movie $k$ will be frozen after the information is received, and will not be updated when the information about person $a$ is received. On the other hand, the learned row in $U$ corresponding to person $a$ will depend in part on the previously updated row in $V$ corresponding to movie $k$.

If the information was received in the opposite order, we would have the opposite result. The row in $U$ corresponding to person $a$ would be frozen after the first piece of information was received, and not be influenced by the information about movie $k$. Meanwhile, the row in $V$ corresponding to movie $k$ would be learned in part based on the information gained about person a previously.

Thus, the order of new information matters a lot in this new scheme, because $U$ and $V$ aren't jointly optimized completely every time new information is received.
"
MIT Spring 2019,6,g,1.75,Classifiers,Text,"After taking 6.036, Bob decides to train a recommender system to predict what ratings different customers will give to different movies. Currently, he knows of three really popular movies, and he knows of two potential customers who have ranked some of these movies. The data matrix currently looks like: $Y=[[2, ?, 3],[4,2, ?]]$ where, as in class, rows correspond to customers and columns correspond to movies, and ? indicates a missing or unknown ranking. He decides to find a low rank factorization of $Y$ using the alternating least squares algorithm implemented in class. Assume for this question that offsets are set to $0.$
Bob is happy about what he has accomplished, until he realizes that there are a bunch of movies and users that he still needs to add to his database! He sees that his database will slowly grow over time, and that it will be time-consuming to train a completely new model every single time he updates his database. If Bob has an $m \times n$ data matrix which he wants to find a rank $k$ factorization for, his analysis indicates that the worst-case run-time (in terms of number of expensive multiplications) of performing alternating least squares for $t$ iterations (where each iteration updates both $U$ and $V)$ will be $O(k^{2}*m*n*t)$.
Instead, Bob comes upon the following idea: whenever he gets information about a new movie, he adds an extra row to $V$ but does not alter the existing entries of $U$ or $V$. He then finds the values of the entries in that extra row that minimize the objective function (with no regularization). He performs a similar procedure when he gets a new user, but instead adds an extra row to $U$. 
Bob modifies this procedure so that he still adds new movies and users in this way, but after every 100 new additions, he retrains $U$ and $V$ from scratch using alternating least squares. Would you expect that this method would make better predictions than if we just used Bob's original procedure? Explain.","Yes.

Whenever we retrain $U$ and $V$ from scratch, we are minimizing the objective function over all variables in the problem (all entries of $U$ and $V$ ) so the minimum of the objective will be lower than we could obtain by just retraining a subset of variables, as we were doing in the previous part to lower computational costs. Name:

Thus, this method will make better predictions"
MIT Spring 2019,6,h,1.75,Classifiers,Text,"After taking 6.036, Bob decides to train a recommender system to predict what ratings different customers will give to different movies. Currently, he knows of three really popular movies, and he knows of two potential customers who have ranked some of these movies. The data matrix currently looks like: $Y=[[2, ?, 3],[4,2, ?]]$ where, as in class, rows correspond to customers and columns correspond to movies, and ? indicates a missing or unknown ranking. He decides to find a low rank factorization of $Y$ using the alternating least squares algorithm implemented in class. Assume for this question that offsets are set to $0.$
Bob is happy about what he has accomplished, until he realizes that there are a bunch of movies and users that he still needs to add to his database! He sees that his database will slowly grow over time, and that it will be time-consuming to train a completely new model every single time he updates his database. If Bob has an $m \times n$ data matrix which he wants to find a rank $k$ factorization for, his analysis indicates that the worst-case run-time (in terms of number of expensive multiplications) of performing alternating least squares for $t$ iterations (where each iteration updates both $U$ and $V)$ will be $O(k^{2}*m*n*t)$.
Instead, Bob comes upon the following idea: whenever he gets information about a new movie, he adds an extra row to $V$ but does not alter the existing entries of $U$ or $V$. He then finds the values of the entries in that extra row that minimize the objective function (with no regularization). He performs a similar procedure when he gets a new user, but instead adds an extra row to $U$. 
Bob modifies this procedure so that he still adds new movies and users in this way, but after every 100 new additions, he retrains $U$ and $V$ from scratch using alternating least squares.
After having added a few thousand users and movies to his database, Bob wants to try analyzing the user and movie vectors that he has learned, in order to see whether he can interpret what is causing customers to like certain movies over others. However, some of the numbers in $U$ and $V$ have a very high magnitude, which may lead to problems with numerical precision. How might Bob adjust his training process to fix the problem of high magnitude numbers in $U$ and $V$ ?
","In order to have fewer numbers of large magnitude, Bob can employ regularization of both $U, V$."
MIT Spring 2019,7,a,0.75,Regression,Image,"In this problem, we consider using linear regression with a regularization term. Assume a datnset of $n$ samples $\left\{\left(x^{(i)}, y^{(i)}\right)\right\}$ with $x^{(i)} \in \mathbb{R}^{2}$ and output values $y^{(i)} \in \mathbb{R}$. Recall that the ridge regression objective is deflned as follows:
$$
J_{\text {ridge }}(\theta)=J_{\text {data }}(\theta)+J_{\text {reg }}(\theta)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}-y^{(i)}\right)^{2}+\left.\lambda|| \theta\right|^{2}
$$
where $\theta=\left[\theta_{1}, \theta_{2}\right]$ and $\lambda$ is the regularization trude-off parameter.
Chris would like to solve the problem of computing $\theta$ that minimizes the ridge regression objective. He will exnploy graphical methods to obtrin the solution. When plotting just the data error term, $J_{\text {data }}(\theta)$, as a function of $\theta_{1}$ and $\theta_{2}$, the following set of isocontour lines (curves connecting sets of $\theta_{1}, \theta_{2}$ for which the objective value is constant) is obtained, for his dataset:
What is the optimum solution $\theta^{*}$ when you minimize only the data error term, $J_{\text {data }}(\theta)$, i.e., for $\lambda=0$ ? Give an approximate value, for Chris's data.","$[0.5,1.0]$"
MIT Spring 2019,7,b,1.75,Regression,Image,"In this problem, we consider using linear regression with a regularization term. Assume a datnset of $n$ samples $\left\{\left(x^{(i)}, y^{(i)}\right)\right\}$ with $x^{(i)} \in \mathbb{R}^{2}$ and output values $y^{(i)} \in \mathbb{R}$. Recall that the ridge regression objective is deflned as follows:
$$
J_{\text {ridge }}(\theta)=J_{\text {data }}(\theta)+J_{\text {reg }}(\theta)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}-y^{(i)}\right)^{2}+\left.\lambda|| \theta\right|^{2}
$$
where $\theta=\left[\theta_{1}, \theta_{2}\right]$ and $\lambda$ is the regularization trude-off parameter.
Chris would like to solve the problem of computing $\theta$ that minimizes the ridge regression objective. He will exnploy graphical methods to obtrin the solution. When plotting just the data error term, $J_{\text {data }}(\theta)$, as a function of $\theta_{1}$ and $\theta_{2}$, the following set of isocontour lines (curves connecting sets of $\theta_{1}, \theta_{2}$ for which the objective value is constant) is obtained, for his dataset:
In general, is the data error term $J_{\mathrm{dena}}\left(\theta^{*}\right)$ guaranteed to be zero for the optimal value of $\theta$, for the case when $\lambda=0$ ? Explain.","No. Since we are not likely to perfectly flt all of the data, the data term error is libely to be larger than zero even for the optimal $\theta$ value."
MIT Spring 2019,7,c,1.75,Regression,Image,"In this problem, we consider using linear regression with a regularization term. Assume a datnset of $n$ samples $\left\{\left(x^{(i)}, y^{(i)}\right)\right\}$ with $x^{(i)} \in \mathbb{R}^{2}$ and output values $y^{(i)} \in \mathbb{R}$. Recall that the ridge regression objective is deflned as follows:
$$
J_{\text {ridge }}(\theta)=J_{\text {data }}(\theta)+J_{\text {reg }}(\theta)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}-y^{(i)}\right)^{2}+\left.\lambda|| \theta\right|^{2}
$$
where $\theta=\left[\theta_{1}, \theta_{2}\right]$ and $\lambda$ is the regularization trude-off parameter.
Chris would like to solve the problem of computing $\theta$ that minimizes the ridge regression objective. He will exnploy graphical methods to obtrin the solution. When plotting just the data error term, $J_{\text {data }}(\theta)$, as a function of $\theta_{1}$ and $\theta_{2}$, the following set of isocontour lines (curves connecting sets of $\theta_{1}, \theta_{2}$ for which the objective value is constant) is obtained, for his dataset:
Recall that $\nabla J_{\text {dema }}(\theta)$ is a vector in 2D. In general, at any parameter vector $\theta$, describe the geometric relationship between $\nabla J_{\text {data }}(\theta)$ and the isocontour line of the data error term $J_{\text {data }}(\theta)$ that passes through $\theta$.","The vector $\nabla J_{\text {data }}(\theta)$ is locally perpendicular to the isocontour line of the data error tern $J_{\text {data }}(\theta)$ at $\theta$. The gradient points in the ""uphill"" direction."
MIT Spring 2019,7,d,1,Regression,Image,"In this problem, we consider using linear regression with a regularization term. Assume a datnset of $n$ samples $\left\{\left(x^{(i)}, y^{(i)}\right)\right\}$ with $x^{(i)} \in \mathbb{R}^{2}$ and output values $y^{(i)} \in \mathbb{R}$. Recall that the ridge regression objective is deflned as follows:
$$
J_{\text {ridge }}(\theta)=J_{\text {data }}(\theta)+J_{\text {reg }}(\theta)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}-y^{(i)}\right)^{2}+\left.\lambda|| \theta\right|^{2}
$$
where $\theta=\left[\theta_{1}, \theta_{2}\right]$ and $\lambda$ is the regularization trude-off parameter.
Chris would like to solve the problem of computing $\theta$ that minimizes the ridge regression objective. He will exnploy graphical methods to obtrin the solution. When plotting just the data error term, $J_{\text {data }}(\theta)$, as a function of $\theta_{1}$ and $\theta_{2}$, the following set of isocontour lines (curves connecting sets of $\theta_{1}, \theta_{2}$ for which the objective value is constant) is obtained, for his dataset:
What is $\nabla J_{\text {deta }}\left(\theta^{*}\right)$ at the optimumn $\theta^{*}$, when $\lambda=0$ ?",$\nabla J_{\text {deta }}\left(\theta^{*}\right)=0$.
MIT Spring 2019,7,e,1.75,Regression,Image,"In this problem, we consider using linear regression with a regularization term. Assume a datnset of $n$ samples $\left\{\left(x^{(i)}, y^{(i)}\right)\right\}$ with $x^{(i)} \in \mathbb{R}^{2}$ and output values $y^{(i)} \in \mathbb{R}$. Recall that the ridge regression objective is deflned as follows:
$$
J_{\text {ridge }}(\theta)=J_{\text {data }}(\theta)+J_{\text {reg }}(\theta)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}-y^{(i)}\right)^{2}+\left.\lambda|| \theta\right|^{2}
$$
where $\theta=\left[\theta_{1}, \theta_{2}\right]$ and $\lambda$ is the regularization trude-off parameter.
Chris would like to solve the problem of computing $\theta$ that minimizes the ridge regression objective. He will exnploy graphical methods to obtrin the solution. When plotting just the data error term, $J_{\text {data }}(\theta)$, as a function of $\theta_{1}$ and $\theta_{2}$, the following set of isocontour lines (curves connecting sets of $\theta_{1}, \theta_{2}$ for which the objective value is constant) is obtained, for his dataset:
Now we consider regularization. Sketch the isocontour lines for just the regularization term, $J_{\text {reg }}(\theta)$. Clearly label the contour line corresponding to the values of $\theta$ for which this term has value 1 , when $\lambda=1$.",Image filling
MIT Spring 2019,7,f,1.75,Regression,Image,"In this problem, we consider using linear regression with a regularization term. Assume a datnset of $n$ samples $\left\{\left(x^{(i)}, y^{(i)}\right)\right\}$ with $x^{(i)} \in \mathbb{R}^{2}$ and output values $y^{(i)} \in \mathbb{R}$. Recall that the ridge regression objective is deflned as follows:
$$
J_{\text {ridge }}(\theta)=J_{\text {data }}(\theta)+J_{\text {reg }}(\theta)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}-y^{(i)}\right)^{2}+\left.\lambda|| \theta\right|^{2}
$$
where $\theta=\left[\theta_{1}, \theta_{2}\right]$ and $\lambda$ is the regularization trude-off parameter.
Chris would like to solve the problem of computing $\theta$ that minimizes the ridge regression objective. He will exnploy graphical methods to obtrin the solution. When plotting just the data error term, $J_{\text {data }}(\theta)$, as a function of $\theta_{1}$ and $\theta_{2}$, the following set of isocontour lines (curves connecting sets of $\theta_{1}, \theta_{2}$ for which the objective value is constant) is obtained, for his dataset:
What is the effect of the regularization trade-off parameter $\lambda$ on the shape and value of the isocontour lines of the regularization term $J_{\text {reg }}(\theta)$ ?","The shape remains concentric circles centered at the origin. $\lambda$ scales the isocontour value for each radius of these concentric circles. (Note that for a constant isocontour value, the radius then decreases.) Visualizing the shape as a bowl, larger $\lambda$ makes the bowl steeper."
MIT Spring 2019,7,g,1.75,Regression,Image,"In this problem, we consider using linear regression with a regularization term. Assume a datnset of $n$ samples $\left\{\left(x^{(i)}, y^{(i)}\right)\right\}$ with $x^{(i)} \in \mathbb{R}^{2}$ and output values $y^{(i)} \in \mathbb{R}$. Recall that the ridge regression objective is deflned as follows:
$$
J_{\text {ridge }}(\theta)=J_{\text {data }}(\theta)+J_{\text {reg }}(\theta)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}-y^{(i)}\right)^{2}+\left.\lambda|| \theta\right|^{2}
$$
where $\theta=\left[\theta_{1}, \theta_{2}\right]$ and $\lambda$ is the regularization trude-off parameter.
Chris would like to solve the problem of computing $\theta$ that minimizes the ridge regression objective. He will exnploy graphical methods to obtrin the solution. When plotting just the data error term, $J_{\text {data }}(\theta)$, as a function of $\theta_{1}$ and $\theta_{2}$, the following set of isocontour lines (curves connecting sets of $\theta_{1}, \theta_{2}$ for which the objective value is constant) is obtained, for his dataset:
Now consider the gradient of the regularization term $\nabla J_{\text {rag }}(\theta)$. Tuwards what specifle point does the $-\nabla J_{\text {reg }}(\theta)$ vector point to?","The origin, $(0,0)$."
MIT Spring 2019,7,h,1.75,Regression,Image,"In this problem, we consider using linear regression with a regularization term. Assume a datnset of $n$ samples $\left\{\left(x^{(i)}, y^{(i)}\right)\right\}$ with $x^{(i)} \in \mathbb{R}^{2}$ and output values $y^{(i)} \in \mathbb{R}$. Recall that the ridge regression objective is deflned as follows:
$$
J_{\text {ridge }}(\theta)=J_{\text {data }}(\theta)+J_{\text {reg }}(\theta)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}-y^{(i)}\right)^{2}+\left.\lambda|| \theta\right|^{2}
$$
where $\theta=\left[\theta_{1}, \theta_{2}\right]$ and $\lambda$ is the regularization trude-off parameter.
Chris would like to solve the problem of computing $\theta$ that minimizes the ridge regression objective. He will exnploy graphical methods to obtrin the solution. When plotting just the data error term, $J_{\text {data }}(\theta)$, as a function of $\theta_{1}$ and $\theta_{2}$, the following set of isocontour lines (curves connecting sets of $\theta_{1}, \theta_{2}$ for which the objective value is constant) is obtained, for his dataset:
If $\lambda$ is very large, what is the $\theta^{*}$ that minimizes $J_{\text {ridge }}\left(\theta^{*}\right)$ ? What approximate mumerical value does $J_{d e a}\left(\theta^{*}\right)$ have for Chris's data?","$\lambda$ being very large forces $\theta^{*}$ to be very nearly $[0,0]$. Looking at the plot at the start of the problem, we see that the $J_{\text {data }}$ at the origin is approximately 20 ."
MIT Spring 2019,7,i,1.75,Regression,Image,"In this problem, we consider using linear regression with a regularization term. Assume a datnset of $n$ samples $\left\{\left(x^{(i)}, y^{(i)}\right)\right\}$ with $x^{(i)} \in \mathbb{R}^{2}$ and output values $y^{(i)} \in \mathbb{R}$. Recall that the ridge regression objective is deflned as follows:
$$
J_{\text {ridge }}(\theta)=J_{\text {data }}(\theta)+J_{\text {reg }}(\theta)=\frac{1}{n} \sum_{i=1}^{n}\left(\theta^{T} x^{(i)}-y^{(i)}\right)^{2}+\left.\lambda|| \theta\right|^{2}
$$
where $\theta=\left[\theta_{1}, \theta_{2}\right]$ and $\lambda$ is the regularization trude-off parameter.
Chris would like to solve the problem of computing $\theta$ that minimizes the ridge regression objective. He will exnploy graphical methods to obtrin the solution. When plotting just the data error term, $J_{\text {data }}(\theta)$, as a function of $\theta_{1}$ and $\theta_{2}$, the following set of isocontour lines (curves connecting sets of $\theta_{1}, \theta_{2}$ for which the objective value is constant) is obtained, for his dataset:
Given a general optimal solution $\theta^{*}$ for $J_{\text {ridge }}(\theta)$ for a given (flnite) $\lambda$, what is the algebraic relationship between $\nabla J_{\text {data }}\left(\theta^{*}\right)$ and $\nabla J_{\text {reg }}\left(\theta^{*}\right)$ ?",We know that $\nabla J_{\text {ridgse }}\left(\theta^{*}\right)=0$ at the optimal point. This forces $\nabla J_{d e a}\left(\theta^{*}\right)$ $=-\nabla J_{\mathrm{reg}}\left(\theta^{*}\right)$.
MIT Spring 2019,8,a,2.666666667,Neural Networks,Text,"In this problem we will investigate regularization for neural networks.
Kim constructs a fully connected neural network with $L=2$ layers using mean squared error (MSE) loss and ReLU activation functions for the hidden layer, and a linear activation for the output layer. The network is trained with a gradient descent algorithm on a data set of $n$ points $\left\{\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)\right\}$.
Recall that the update rule for weights $W^{1}$ can be specified in terms of step size $\eta$ and the gradient of the loss function with respect to weights $W^{1}$. This gradient can be expressed in terms of the activations $A^{l}$, weights $W^{l}$, pre-activations $Z^{l}$, and partials $\frac{\partial L}{\partial A^{2}}$, $\frac{\partial A^{l}}{\partial Z^{l}}$, for $l=1,2$ :
$$
W^{1}:=W^{1}-\eta \sum_{i=1}^{n} \frac{\partial L\left(h\left(x^{(i)} ; W\right), y^{(i)}\right)}{\partial W^{1}}
$$
where $h(\cdot)$ is the input-output mapping implemented by the entire neural network, and
$$
\frac{\partial L}{\partial W^{1}}=\frac{\partial Z^{1}}{\partial W^{1}} \cdot \frac{\partial A^{1}}{\partial Z^{1}} \cdot W^{2} \cdot \frac{\partial A^{2}}{\partial Z^{2}} \cdot \frac{\partial L}{\partial A^{2}}
$$
Derive a new update rule for weights $W^{1}$ which also penalizes the sum of squared values of all individual weights in the network:
$$
L^{n e w}=L\left(h\left(x^{(i)} ; W\right), y^{(i)}\right)+\lambda\|W\|^{2}
$$
where $\lambda$ denotes the regularization trade-off parameter. You can express the new update rule as follows:
$$
W^{1}:=\alpha W^{1}-\eta \sum_{i=1}^{n} \frac{\partial L\left(h\left(x^{(i)} ; W\right), y^{(i)}\right)}{\partial W^{1}}
$$
where $L(\cdot)$ represents the previous prediction error loss.
What is the value of $\alpha$ in terms of $\lambda$ and $\eta$ ?","W^{1}:=(1-2 \lambda \eta) W^{1}-\eta \sum âL/âW^{1}
Thus $\alpha=1-2 \lambda \eta$"
MIT Spring 2019,8,b,2.666666667,Neural Networks,Text,"In this problem we will investigate regularization for neural networks.
Kim constructs a fully connected neural network with $L=2$ layers using mean squared error (MSE) loss and ReLU activation functions for the hidden layer, and a linear activation for the output layer. The network is trained with a gradient descent algorithm on a data set of $n$ points $\left\{\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)\right\}$.
Recall that the update rule for weights $W^{1}$ can be specified in terms of step size $\eta$ and the gradient of the loss function with respect to weights $W^{1}$. This gradient can be expressed in terms of the activations $A^{l}$, weights $W^{l}$, pre-activations $Z^{l}$, and partials $\frac{\partial L}{\partial A^{2}}$, $\frac{\partial A^{l}}{\partial Z^{l}}$, for $l=1,2$ :
$$
W^{1}:=W^{1}-\eta \sum_{i=1}^{n} \frac{\partial L\left(h\left(x^{(i)} ; W\right), y^{(i)}\right)}{\partial W^{1}}
$$
where $h(\cdot)$ is the input-output mapping implemented by the entire neural network, and
$$
\frac{\partial L}{\partial W^{1}}=\frac{\partial Z^{1}}{\partial W^{1}} \cdot \frac{\partial A^{1}}{\partial Z^{1}} \cdot W^{2} \cdot \frac{\partial A^{2}}{\partial Z^{2}} \cdot \frac{\partial L}{\partial A^{2}}
$$
The new update rule for weights $W^{1}$ which also penalizes the sum of squared values of all individual weights in the network:
$$
L^{n e w}=L\left(h\left(x^{(i)} ; W\right), y^{(i)}\right)+\lambda\|W\|^{2}
$$
where $\lambda$ denotes the regularization trade-off parameter is W^{1}:=(1-2 \lambda \eta) W^{1}-\eta \sum \frac{\partial L}{\partial W^{1}}, where $\alpha=1-2 \lambda \eta$. Explain how this new update rule helps the neural network reduce overtting to the data.","For reasonable $\lambda$ and $\eta$, the weights are scaled by a factor less than 1 at each iteration. (If $1-2 \lambda \eta>1$, the weights will rapidly grow and diverge.) A value of $|\alpha|<1$ pushes the weights toward zero in general, except those weights that are needed to fit substantial subsets of the data (i.e., those weights that are needed to keep the data loss term $L$ low)."
MIT Spring 2019,8,c,2.666666667,Neural Networks,Text,"In this problem we will investigate regularization for neural networks.
Kim constructs a fully connected neural network with $L=2$ layers using mean squared error (MSE) loss and ReLU activation functions for the hidden layer, and a linear activation for the output layer. The network is trained with a gradient descent algorithm on a data set of $n$ points $\left\{\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)\right\}$.
Recall that the update rule for weights $W^{1}$ can be specified in terms of step size $\eta$ and the gradient of the loss function with respect to weights $W^{1}$. This gradient can be expressed in terms of the activations $A^{l}$, weights $W^{l}$, pre-activations $Z^{l}$, and partials $\frac{\partial L}{\partial A^{2}}$, $\frac{\partial A^{l}}{\partial Z^{l}}$, for $l=1,2$ :
$$
W^{1}:=W^{1}-\eta \sum_{i=1}^{n} \frac{\partial L\left(h\left(x^{(i)} ; W\right), y^{(i)}\right)}{\partial W^{1}}
$$
where $h(\cdot)$ is the input-output mapping implemented by the entire neural network, and
$$
\frac{\partial L}{\partial W^{1}}=\frac{\partial Z^{1}}{\partial W^{1}} \cdot \frac{\partial A^{1}}{\partial Z^{1}} \cdot W^{2} \cdot \frac{\partial A^{2}}{\partial Z^{2}} \cdot \frac{\partial L}{\partial A^{2}}
$$
Given that we are training a neural network with gradient descent, what happens when we increase the regularization trade-off parameter $\lambda$ too much, while holding the step size $\eta$ fixed?","With too large a $\lambda, \alpha$ may approach zero and the weights would be too strongly penalized and thus tend to zero, preventing the neural network from fitting the available training data. That is to say, the network is pushed towards an overly ""generalized"" constant output based on zero or near-zero weights. With even larger values of $\lambda, \alpha$ may become negative causing oscillations in weights. With $|\alpha|$ larger than 1 , the weights will grow in magnitude without bound."
MIT Spring 2019,9,a,2,RNNs,Text,"We want to make an RNN to translate English to Martian. We have a training set of pairs $\left(e^{(i)}, m^{(i)}\right)$, where $e^{(i)}$ is a sequence of length $J^{(i)}$ of English words and $m^{(i)}$ is a sequence of length $K^{(i)}$ of Martian words. The sequences, even within a pair, do not need to be of the same length, i.e., $J^{(i)}$ need not equal $K^{(i)}$. We are considering two different strategies for turning this into a transduction or sequence-to-sequence learning problem for an RNN.
Method 1: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{L}, stop)$, $y=\left(m_{1}, m_{2}, \ldots, m_{L}, \text { stop }). In Method 1, we assume that if the original $e$ and $m$ had different numbers of words, then the shorter sentence is padded with enough time-wasting words (""ummm"" for English, ""grlork"" for Martian) so that they now have equal length, L. Any needed padding words are inserted at the end of $e^{(i)}$, and at the start of $m^{(i)}$.
Method 2: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{J}, \text { stop, blank }, \ldots, \text { blank })$, $y=(\text { blank }, \ldots, \text { blank, } m_{1}, m_{2}, \ldots, m_{K}, \text { stop })$. In Method 2 , blanks are inserted at the end of $e$ and start of $m$ such that the length of $x$ and $y$ are now both $J+K+1$.
Assume an element-wise loss function $L_{elt}(p, y)$ on predicted versus true Martian words. What is an appropriate sequence loss function for Method 1? Assume that the predicted sequence $p$ has the same length as the target sequence $y$.","$$L_{seq}=\sum_{i=1}^{L+1} L_{e l t}\left(p_{i}, y_{i}\right)$$
The RNN should seek to output the correct Martian words, as well as the stop indicator."
MIT Spring 2019,9,b,2,RNNs,Text,"We want to make an RNN to translate English to Martian. We have a training set of pairs $\left(e^{(i)}, m^{(i)}\right)$, where $e^{(i)}$ is a sequence of length $J^{(i)}$ of English words and $m^{(i)}$ is a sequence of length $K^{(i)}$ of Martian words. The sequences, even within a pair, do not need to be of the same length, i.e., $J^{(i)}$ need not equal $K^{(i)}$. We are considering two different strategies for turning this into a transduction or sequence-to-sequence learning problem for an RNN.
Method 1: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{L}, stop)$, $y=\left(m_{1}, m_{2}, \ldots, m_{L}, \text { stop }). In Method 1, we assume that if the original $e$ and $m$ had different numbers of words, then the shorter sentence is padded with enough time-wasting words (""ummm"" for English, ""grlork"" for Martian) so that they now have equal length, L. Any needed padding words are inserted at the end of $e^{(i)}$, and at the start of $m^{(i)}$.
Method 2: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{J}, \text { stop, blank }, \ldots, \text { blank })$, $y=(\text { blank }, \ldots, \text { blank, } m_{1}, m_{2}, \ldots, m_{K}, \text { stop })$. In Method 2 , blanks are inserted at the end of $e$ and start of $m$ such that the length of $x$ and $y$ are now both $J+K+1$.
Assume an element-wise loss function $L_{elt}(p, y)$ on predicted versus true Martian words. What is an appropriate sequence loss function for Method 2? Assume the predicted sequence $p$ has the same length as the target sequence $y$.","L_{seq}=\sum_{i=J+1}^{J+K+1} L_{elt}(p_{i}, y_{i})
It's really only necessary that the RNN correctly outputs the whole Martian sequence and the final stop indicator. But, it's okay if you sum starting from the first token, $i=1$."
MIT Spring 2019,9,c,2,RNNs,Text,"We want to make an RNN to translate English to Martian. We have a training set of pairs $\left(e^{(i)}, m^{(i)}\right)$, where $e^{(i)}$ is a sequence of length $J^{(i)}$ of English words and $m^{(i)}$ is a sequence of length $K^{(i)}$ of Martian words. The sequences, even within a pair, do not need to be of the same length, i.e., $J^{(i)}$ need not equal $K^{(i)}$. We are considering two different strategies for turning this into a transduction or sequence-to-sequence learning problem for an RNN.
Method 1: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{L}, stop)$, $y=\left(m_{1}, m_{2}, \ldots, m_{L}, \text { stop }). In Method 1, we assume that if the original $e$ and $m$ had different numbers of words, then the shorter sentence is padded with enough time-wasting words (""ummm"" for English, ""grlork"" for Martian) so that they now have equal length, L. Any needed padding words are inserted at the end of $e^{(i)}$, and at the start of $m^{(i)}$.
Method 2: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{J}, \text { stop, blank }, \ldots, \text { blank })$, $y=(\text { blank }, \ldots, \text { blank, } m_{1}, m_{2}, \ldots, m_{K}, \text { stop })$. In Method 2 , blanks are inserted at the end of $e$ and start of $m$ such that the length of $x$ and $y$ are now both $J+K+1$.
Which method is likely to need a higher dimensional state? Explain why.","Method 2 likely needs to have a larger state to hold a representation of the full input sentence $e$, while Method 1 might have a shorter state that enables mapping of individual words or shorter sub-sequences of words to corresponding output words or sub-sequences."
MIT Spring 2019,9,d,2,RNNs,Text,"We want to make an RNN to translate English to Martian. We have a training set of pairs $\left(e^{(i)}, m^{(i)}\right)$, where $e^{(i)}$ is a sequence of length $J^{(i)}$ of English words and $m^{(i)}$ is a sequence of length $K^{(i)}$ of Martian words. The sequences, even within a pair, do not need to be of the same length, i.e., $J^{(i)}$ need not equal $K^{(i)}$. We are considering two different strategies for turning this into a transduction or sequence-to-sequence learning problem for an RNN.
Method 1: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{L}, stop)$, $y=\left(m_{1}, m_{2}, \ldots, m_{L}, \text { stop }). In Method 1, we assume that if the original $e$ and $m$ had different numbers of words, then the shorter sentence is padded with enough time-wasting words (""ummm"" for English, ""grlork"" for Martian) so that they now have equal length, L. Any needed padding words are inserted at the end of $e^{(i)}$, and at the start of $m^{(i)}$.
Method 2: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{J}, \text { stop, blank }, \ldots, \text { blank })$, $y=(\text { blank }, \ldots, \text { blank, } m_{1}, m_{2}, \ldots, m_{K}, \text { stop })$. In Method 2 , blanks are inserted at the end of $e$ and start of $m$ such that the length of $x$ and $y$ are now both $J+K+1$.
Which method is better if English and Martian have very different word order? Explain why.","Method 2 since it can first parse the entire input sentence, and then output in a different word order."
MIT Spring 2019,9,e,2,RNNs,Text,"We want to make an RNN to translate English to Martian. We have a training set of pairs $\left(e^{(i)}, m^{(i)}\right)$, where $e^{(i)}$ is a sequence of length $J^{(i)}$ of English words and $m^{(i)}$ is a sequence of length $K^{(i)}$ of Martian words. The sequences, even within a pair, do not need to be of the same length, i.e., $J^{(i)}$ need not equal $K^{(i)}$. We are considering two different strategies for turning this into a transduction or sequence-to-sequence learning problem for an RNN.
Method 1: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{L}, stop)$, $y=\left(m_{1}, m_{2}, \ldots, m_{L}, \text { stop }). In Method 1, we assume that if the original $e$ and $m$ had different numbers of words, then the shorter sentence is padded with enough time-wasting words (""ummm"" for English, ""grlork"" for Martian) so that they now have equal length, L. Any needed padding words are inserted at the end of $e^{(i)}$, and at the start of $m^{(i)}$.
Method 2: Construct a training-sequence pair $(x, y)$ from an example $(e, m)$ by letting $x=(e_{1}, e_{2}, \ldots, e_{J}, \text { stop, blank }, \ldots, \text { blank })$, $y=(\text { blank }, \ldots, \text { blank, } m_{1}, m_{2}, \ldots, m_{K}, \text { stop })$. In Method 2 , blanks are inserted at the end of $e$ and start of $m$ such that the length of $x$ and $y$ are now both $J+K+1$.
Martian linguist Grlymp thinks it is also important to pad the original English and Martian sentences with time-wasting word to be of the same length for Method 2 (i.e., so that $J=K$, but English linguist Chome Nimsky disagrees. Who is correct, and why?","Chome Nimsky is right: Method 2 already has full flexibility in processing the entire sentence $e$ before outputting $m$, so additional time-wasting words would not help (and may hurt) in expressiveness and/or training."
MIT Fall 2019,1,a.i,2,Features,Text,"General Ization is consulting for a shop that sells shoes, and the General is building a model to predict what color of sneakers a given customer will buy, given information about their age and the color of the shoes they're wearing when they enter the store. The shoe shop asks for a classifier, as well as an indication of how well the classifier will perform once deployed.
There are several sneaker colors that customers might wear or buy. The General wishes to train a neural network classifier. What representation is best for the input (the color of shoes a customer is wearing when they enter the store)?",One-hot encoding of shoe color or RGB encoding of color.
MIT Fall 2019,1,a.ii,2,Classifiers,Text,"General Ization is consulting for a shop that sells shoes, and the General is building a model to predict what color of sneakers a given customer will buy, given information about their age and the color of the shoes they're wearing when they enter the store. The shoe shop asks for a classifier, as well as an indication of how well the classifier will perform once deployed.
There are several sneaker colors that customers might wear or buy. The General wishes to train a neural network classifier. What kind of output layer should the General use?",Softmax is most appropriate for multiclass classification output.
MIT Fall 2019,1,b,2,Classifiers,Image,"The store gives the General data from the past year of sales, which she splits into three distinct parts: training data, validation data, and test data. While training the neural network classifier, the General gets the following learning curves. This graph indicates that she should use the classifier resulting from training after fewer than 80 iterations. Unfortunately, she forgot to put the legend in, but luckily you can fix it! Fill in the legend with the appropriate two among training_time, training_loss, validation_loss.",Solid line = validation_loss; Dashed line $=$ training_loss
MIT Fall 2019,1,c,2,Classifiers,Text,"General Ization is consulting for a shop that sells shoes, and the General is building a model to predict what color of sneakers a given customer will buy, given information about their age and the color of the shoes they're wearing when they enter the store. The shoe shop asks for a classifier, as well as an indication of how well the classifier will perform once deployed.
Around how many iterations should the General use to train the classifier she delivers to the shoe shop? Explain why.","Around 40 , as this is when the validation loss starts to increase."
MIT Fall 2019,1,d,2,Classifiers,Text,"General Ization is consulting for a shop that sells shoes, and the General is building a model to predict what color of sneakers a given customer will buy, given information about their age and the color of the shoes they're wearing when they enter the store. The shoe shop asks for a classifier, as well as an indication of how well the classifier will perform once deployed.
The General made a grave mistake. It turns out that though she thought she had split the data into three parts, she had only split it into two and used both those splits in training and selecting her classifier. Now, she needs to collect the third split in order to indicate how well her classifier will perform when deployed. Which of the following would be the best to use? Provide a short justification for your choice.
1. Go to a nearby school and ask the students what color sneakers they used to own and note what color sneakers they are currently wearing.
2. Go to a nearby construction site and ask the workers what color shoes they used to own and note what color shoes they are currently wearing.
3. Ask the shoe store to give her more data in two months.
4. Ask a different shoe store for their data.","Either 3 or 4 would be best. 3 would better mirror the distribution they would see in that store (though there is risk of covariate shift over time), but if the store is in a rush to deploy the model, then the delay might not be possible. 4 would be faster but might not match the distribution of the original store as well."
MIT Fall 2019,1,e,2,Classifiers,Text,"General Ization is consulting for a shop that sells shoes, and the General is building a model to predict what color of sneakers a given customer will buy, given information about their age and the color of the shoes they're wearing when they enter the store. The shoe shop asks for a classifier, as well as an indication of how well the classifier will perform once deployed.
The store goes back to the General and says they've discovered a new feature they think might be useful: the color of shoes that a famous celebrity, Keslie Laelbling, is wearing that day. (Due to social media, both the customer and the store know exactly what color of shoes Keslie is wearing each day.) Unfortunately, the General is close to her deadline: she has time to train a new linear model, but not to train another deep neural network like she did before. How might the General produce an augmented model that incorporates this new feature?","Train a linear classifier that uses the neural network plus the color of Keslie's shoes as input, and produces a new prediction for what color of shoes the customer will buy."
MIT Fall 2019,2,a,3.6,Reinforcement Learning,Image,"You have designed a robot curling stone to enter a modified curling contest. 1 In an attempt to get your robot stone to perform well, you have designed a state and action space, a reward structure, and a transition model. The goal of the robot stone is to slide upwards on an ioe sheet and stop in a target region. Your robot stone likes to show off; after each state transition, it displays the reward it receives. In addition to your robot stone, there will be a number of opponent stones on the ice, as shown below. For simplicity's salke, we will consider the opponent stones to be fixed.
(image here)
Your model for the state and action spaces is as follows:
$$
\begin{aligned}
&S \in\{t, 0,1,2,3\} \\
&a \in\{\text { ""go"", ""stop"" }\}
\end{aligned}
$$
where the states refer to the robot stone being in either a terminal state (denoted as $t$ ) or within one of the four regions below: (image here)


You design the following reward function and (deterministic) transition model for your robot stone: (table here is an image)
and all other transition probsbilities are 0 . Here * indicates any state. Note that once the robot stone enters state $t$ the game ends: there is no transition and zero reward out of state $t$ (and hence no action to decide once in state $t$.) Together with this reward function and transition model, you specify a discount factor $\gamma=1$. 
In order to enable decision making by your robot stone, you need to give it the optimal policy $\pi^{*}(s)$. For your reward and transition structure and discount factor $\gamma=1$, what are the optimal Q-values, $Q^{*}(s, a)$ ? What is the optimal policy $\pi^{*}(s)$ ? Fill in the following two tables.
(table here)",Image filling
MIT Fall 2019,2,b,3.6,Reinforcement Learning,Image,"You have designed a robot curling stone to enter a modified curling contest. 1 In an attempt to get your robot stone to perform well, you have designed a state and action space, a reward structure, and a transition model. The goal of the robot stone is to slide upwards on an ioe sheet and stop in a target region. Your robot stone likes to show off; after each state transition, it displays the reward it receives. In addition to your robot stone, there will be a number of opponent stones on the ice, as shown below. For simplicity's salke, we will consider the opponent stones to be fixed.
(image here)
Your model for the state and action spaces is as follows:
$$
\begin{aligned}
&S \in\{t, 0,1,2,3\} \\
&a \in\{\text { ""go"", ""stop"" }\}
\end{aligned}
$$
where the states refer to the robot stone being in either a terminal state (denoted as $t$ ) or within one of the four regions below: (image here)


You design the following reward function and (deterministic) transition model for your robot stone: (table here is an image)
and all other transition probsbilities are 0 . Here * indicates any state. Note that once the robot stone enters state $t$ the game ends: there is no transition and zero reward out of state $t$ (and hence no action to decide once in state $t$.) Together with this reward function and transition model, you specify a discount factor $\gamma=1$.
c
Your competitor runs their robot through a first game, exhibiting the following experience:
\begin{tabular}{c|c|c|c|c} 
step # & $s$ & $a$ & $r$ & $s^{\prime}$ \\
\hline 1 & 0 & ""go"" & 1 & 1 \\
2 & 1 & ""stop"" & 0 & $\mathrm{t}$
\end{tabular}
You perform Q-learning updates based on the experience above. After observing steps 1 and 2 (the first game), what is the learned $Q(0$, ""go"" $)$ ?

Solution: We know $Q(s, a):=\alpha Q(0, a)+\alpha\left(r+\gamma \max _{a^{4}} Q\left(0, a_{i}\right)\right.$ So step #1 causes the following update:
$$
Q(0, "" \mathrm{go} "")=0.5 \cdot 0+0.5(1+1 \cdot 0)=0.5
$$
What is the learned $Q(1$, ""stop"" $)$ ?","$$
Q(1, \text { ""stop"" })=0.5 \cdot 0+0.5(0+1 \cdot 0)=0
$$"
MIT Fall 2019,2,c,3.6,Reinforcement Learning,Image,"You have designed a robot curling stone to enter a modified curling contest. 1 In an attempt to get your robot stone to perform well, you have designed a state and action space, a reward structure, and a transition model. The goal of the robot stone is to slide upwards on an ioe sheet and stop in a target region. Your robot stone likes to show off; after each state transition, it displays the reward it receives. In addition to your robot stone, there will be a number of opponent stones on the ice, as shown below. For simplicity's salke, we will consider the opponent stones to be fixed.
(image here)
Your model for the state and action spaces is as follows:
$$
\begin{aligned}
&S \in\{t, 0,1,2,3\} \\
&a \in\{\text { ""go"", ""stop"" }\}
\end{aligned}
$$
where the states refer to the robot stone being in either a terminal state (denoted as $t$ ) or within one of the four regions below: (image here)


You design the following reward function and (deterministic) transition model for your robot stone: (table here is an image)
and all other transition probsbilities are 0 . Here * indicates any state. Note that once the robot stone enters state $t$ the game ends: there is no transition and zero reward out of state $t$ (and hence no action to decide once in state $t$.) Together with this reward function and transition model, you specify a discount factor $\gamma=1$.
Unfortunately, your competitor has also designed a robot stone. You do not know your competitor's reward structure $R(s, a)$ or transition model $T\left(s, a, s^{\prime}\right)$; however, you do know they use the same state and actions spaces. Instead, you decide to use Q-learning to observe their robot stone and learn from it! For your Q-learning, use discount factor $\gamma=1$ and learning rate $\alpha=0.5$, with a $\mathrm{Q}$ table initialized to zero for all $(s, a)$ pairs.
Your competitor runs their robot through a second game, exhibiting the following additional experience:
\begin{tabular}{c|c|c|c|c} 
step # & $s$ & $a$ & $r$ & $s^{\prime}$ \\
\hline 3 & 0 & ""go"" & 1 & 1 \\
4 & 1 & âgo"" & 1 & 2 \\
5 & 2 & ""go"" & 1 & 3 \\
6 & 3 & ""stop"" & 2 & $t$
\end{tabular}
You perform additional Q-learning updates based on this additional experience. After completion of both games (all six steps), what are the full set of $Q$ values you have learned for their robot? Fill in the following table.
(image here)
",Image filling
MIT Fall 2019,2,d,3.6,Reinforcement Learning,Image,"You have designed a robot curling stone to enter a modified curling contest. 1 In an attempt to get your robot stone to perform well, you have designed a state and action space, a reward structure, and a transition model. The goal of the robot stone is to slide upwards on an ioe sheet and stop in a target region. Your robot stone likes to show off; after each state transition, it displays the reward it receives. In addition to your robot stone, there will be a number of opponent stones on the ice, as shown below. For simplicity's salke, we will consider the opponent stones to be fixed.
(image here)
Your model for the state and action spaces is as follows:
$$
\begin{aligned}
&S \in\{t, 0,1,2,3\} \\
&a \in\{\text { ""go"", ""stop"" }\}
\end{aligned}
$$
where the states refer to the robot stone being in either a terminal state (denoted as $t$ ) or within one of the four regions below: (image here)


You design the following reward function and (deterministic) transition model for your robot stone: (table here is an image)
and all other transition probsbilities are 0 . Here * indicates any state. Note that once the robot stone enters state $t$ the game ends: there is no transition and zero reward out of state $t$ (and hence no action to decide once in state $t$.) Together with this reward function and transition model, you specify a discount factor $\gamma=1$.
Unfortunately, your competitor has also designed a robot stone. You do not know your competitor's reward structure $R(s, a)$ or transition model $T\left(s, a, s^{\prime}\right)$; however, you do know they use the same state and actions spaces. Instead, you decide to use Q-learning to observe their robot stone and learn from it! For your Q-learning, use discount factor $\gamma=1$ and learning rate $\alpha=0.5$, with a $\mathrm{Q}$ table initialized to zero for all $(s, a)$ pairs.
We can think of learning the Q-value function for a given action as a regression problem with each state $s$ mapped to a one-hot feature vector $x=\phi_{A}(s)$, where $x=\left[\begin{array}{lll}1 & 0 & 0\end{array}\right.$ state $0, x=\left[\begin{array}{llll}0 & 1 & 0 & 0\end{array}\right]^{T}$ for 1 , etc., and $x=\left[\begin{array}{llll}0 & 0 & 0 & 0\end{array}\right]^{T}$ for state $t$.

We'll focus on the action ""go"". We would like to come up with parameters $\theta, \theta_{0}$ such that $Q\left(s, "" g o^{\prime \prime}\right)=\theta \cdot \phi_{A}(s)+\theta_{0}=\theta \cdot x+\theta_{0}$. Is there in general - for arbitrary values of our $Q(s$, ""go"" $)$ - a setting of $\theta, \theta_{0}$ that enables representation of $Q(s$, ""go"") with perfect accuracy? If so, provide the corresponding $\theta$ and $\theta_{0}$. If not, explain why. (Note that we do not need to model $Q(t, a)$, since the game is over once state $t$ has been reached.)","Yes; $\theta_{i}$ is simply the value for $Q(s=i$, ""go"" $)$ and $\theta_{0}=0$.
Note: $\theta=\left[\begin{array}{llll}5 & 4 & 3 & 0\end{array}\right]^{T}$ and $\theta_{0}=0$ would work for our optimal $Q^{*}(s, a)$, but we seek a more general $\theta$ corresponding to arbitrary or general $Q(s, a)$."
MIT Fall 2019,2,e,3.6,Reinforcement Learning,Image,"You have designed a robot curling stone to enter a modified curling contest. 1 In an attempt to get your robot stone to perform well, you have designed a state and action space, a reward structure, and a transition model. The goal of the robot stone is to slide upwards on an ioe sheet and stop in a target region. Your robot stone likes to show off; after each state transition, it displays the reward it receives. In addition to your robot stone, there will be a number of opponent stones on the ice, as shown below. For simplicity's salke, we will consider the opponent stones to be fixed.
(image here)
Your model for the state and action spaces is as follows:
$$
\begin{aligned}
&S \in\{t, 0,1,2,3\} \\
&a \in\{\text { ""go"", ""stop"" }\}
\end{aligned}
$$
where the states refer to the robot stone being in either a terminal state (denoted as $t$ ) or within one of the four regions below: (image here)


You design the following reward function and (deterministic) transition model for your robot stone: (table here is an image)
and all other transition probsbilities are 0 . Here * indicates any state. Note that once the robot stone enters state $t$ the game ends: there is no transition and zero reward out of state $t$ (and hence no action to decide once in state $t$.) Together with this reward function and transition model, you specify a discount factor $\gamma=1$.
Unfortunately, your competitor has also designed a robot stone. You do not know your competitor's reward structure $R(s, a)$ or transition model $T\left(s, a, s^{\prime}\right)$; however, you do know they use the same state and actions spaces. Instead, you decide to use Q-learning to observe their robot stone and learn from it! For your Q-learning, use discount factor $\gamma=1$ and learning rate $\alpha=0.5$, with a $\mathrm{Q}$ table initialized to zero for all $(s, a)$ pairs.
Unfortunately, your robot's GPS system suddenly breaks, and it is no longer able to tell which of the four regions it is in. However, the robot has side cameras which can detect the opponent stones as it travels through the center of the ice, encoded as [(number of stones to immediate left) (number of stones to immediate right) $]^{T}$. You decide to use this information as state, giving the following feature transformation $\phi_{B}$ on your original state:
$$
\begin{aligned}
\phi_{B}(3) &=\left[\begin{array}{ll}
1 & 1
\end{array}\right]^{T} \\
\phi_{B}(2) &=\left[\begin{array}{ll}
0 & 0
\end{array}\right]^{T} \\
\phi_{B}(1) &=\left[\begin{array}{ll}
1 & 0
\end{array}\right]^{T} \\
\phi_{B}(0) &=\left[\begin{array}{ll}
0 & 1
\end{array}\right]^{T}
\end{aligned}
$$
We would still like to come up with parameters $\theta, \theta_{0}$ such that $Q\left(s, "" g \mathrm{go}^{""}\right)=\theta \cdot \phi_{B}(s)+\theta_{0}$, for general values of $Q\left(s\right.$, ""go"" ). Is there a setting of $\theta, \theta_{0}$ that enables representation of this encoding of $Q\left(s, "" g o^{""}\right)$ with perfect accuracy? If so, provide the corresponding $\theta$ and $\theta_{0}$. If not, explain why this is not possible, and provide a feature transformation $\phi_{C}(\cdot)$ that does enable representation of $Q\left(s, "" g 0^{\prime \prime}\right)=\theta \cdot \phi_{C}\left(\phi_{B}(s)\right)+\theta_{0}$ with perfect accuracy.","No. Let $\left[\begin{array}{ll}x_{1} & x_{2}\end{array}\right]=\phi_{B}(s)$, so $\theta_{1} x_{1}+\theta_{2} x_{2}+\theta_{0}=Q(s$, ""go"" $) . \phi_{B}(2)$ forces $\theta_{0}=Q(2$, ""go"" $) ; \phi_{B}(1)$ forces $\theta_{1} ; \phi_{B}(0)$ forces $\theta_{2}$; and we no longer have the ability to find $\theta$ for $\phi_{B}(3)$.

We can create $\phi_{C}$ as a one-hot encoding of state such that $\phi_{C}\left(\phi_{B}(s)\right)=\phi_{A}(s)$ to uniquely identify our four states (with corresponding $\theta$ and $\theta_{0}$ as in the previous part) to regain perfect representationsl power.
"
MIT Fall 2019,3,a,2,Regression,Text,"Mitch Mitdiddle has kept track of ratings of three items from three different users
of his new e-commerce website. These items are essential, and so each user has purchased and rated all three of the items. However, Mitch has lost one of the ratings, r: The (almost) complete ratings matrix is here:
$$
Y=\left[\begin{array}{ccc}
6 & 8 & 10 \\
9 & 12 & r \\
3 & 4 & 5
\end{array}\right]
$$
Note, users $a$ and items $i$ are indexed from 1 , i.e., the first row of $Y$ corresponds to user $1(a=1)$, and the first column of $Y$ corresponds to item $1(i=1)$.
Treating $r$ as a missing value, is there a rank-1 representation of $Y$ as $U V^{T}$ (i.e., such that $U V^{T}$ produces a matrix that perfectly matches the non-missing elements of $\left.Y\right)$ ? If yes, provide matrices $U$ and $V$ of shape $3 \times 1$ such that $Y=U V^{T}$. If no, explain why not.
","U = [2; 3; 1], V = [3; 4; 5]. Other solutions exist if student scales $U$ by $s$ and $V$ by 1/s."
MIT Fall 2019,3,b,2,Regression,Text,"Mitch Mitdiddle has kept track of ratings of three items from three different users
of his new e-commerce website. These items are essential, and so each user has purchased and rated all three of the items. However, Mitch has lost one of the ratings, r: The (almost) complete ratings matrix is here:
$$
Y=\left[\begin{array}{ccc}
6 & 8 & 10 \\
9 & 12 & r \\
3 & 4 & 5
\end{array}\right]
$$
Note, users $a$ and items $i$ are indexed from 1 , i.e., the first row of $Y$ corresponds to user $1(a=1)$, and the first column of $Y$ corresponds to item $1(i=1)$.
An e-commerce expert explains to you that users only care about one particular feature when it comes to rating products, and provides you with the value of this feature for each item, which we set as $V$:
$$
V=\left[\begin{array}{c}
6 \\
8 \\
10
\end{array}\right]
$$
Mitch remembers that we can use the alternating least squares method to solve for $U$, minimizing:
$$
J(U, V)=\frac{1}{2} \sum_{(a, i) \epsilon D}\left(U^{(a)} \cdot V^{(i)}-Y_{a, i}\right)^{2}
$$
where $D$ is the set of all user $a$ item $i$ rating pairs $(a, i)$. Here $U^{(a)}$ is the $a^{t h}$ row of $U$, and $V^{(i)}$ is the $i^{t h}$ row of $V$. Note that offsets are fixed at $b_{U}=0$ and $b_{V}=0$ in this problem.
Using the same data matrix $Y$ with missing value $r$ and holding $V$ constant, what is the value of $U^{(2)}$ (the second row of $U$ ) that minimizes $J$ ? Identify what $(a, i)$ pairs and $Y_{a, i}$ values matter in this minimization, remembering that $r$ (value of $Y_{2,3}$ ) is not involved.",U^(2) = 1.5
MIT Fall 2019,3,c,2,Regression,Text,"Mitch Mitdiddle has kept track of ratings of three items from three different users
of his new e-commerce website. These items are essential, and so each user has purchased and rated all three of the items. However, Mitch has lost one of the ratings, r: The (almost) complete ratings matrix is here:
$$
Y=\left[\begin{array}{ccc}
6 & 8 & 10 \\
9 & 12 & r \\
3 & 4 & 5
\end{array}\right]
$$
Note, users $a$ and items $i$ are indexed from 1 , i.e., the first row of $Y$ corresponds to user $1(a=1)$, and the first column of $Y$ corresponds to item $1(i=1)$.
An e-commerce expert explains to you that users only care about one particular feature when it comes to rating products, and provides you with the value of this feature for each item, which we set as $V$:
$$
V=\left[\begin{array}{c}
6 \\
8 \\
10
\end{array}\right]
$$
Mitch remembers that we can use the alternating least squares method to solve for $U$, minimizing:
$$
J(U, V)=\frac{1}{2} \sum_{(a, i) \epsilon D}\left(U^{(a)} \cdot V^{(i)}-Y_{a, i}\right)^{2}
$$
where $D$ is the set of all user $a$ item $i$ rating pairs $(a, i)$. Here $U^{(a)}$ is the $a^{t h}$ row of $U$, and $V^{(i)}$ is the $i^{t h}$ row of $V$. Note that offsets are fixed at $b_{U}=0$ and $b_{V}=0$ in this problem.
What is our prediction for $r$, given the $V$ and $U^{(2)}$ ?",r = 15
MIT Fall 2019,3,d,2,Regression,Text,"Mitch Mitdiddle has kept track of ratings of three items from three different users
of his new e-commerce website. These items are essential, and so each user has purchased and rated all three of the items. However, Mitch has lost one of the ratings, r: The (almost) complete ratings matrix is here:
$$
Y=\left[\begin{array}{ccc}
6 & 8 & 10 \\
9 & 12 & r \\
3 & 4 & 5
\end{array}\right]
$$
Note, users $a$ and items $i$ are indexed from 1 , i.e., the first row of $Y$ corresponds to user $1(a=1)$, and the first column of $Y$ corresponds to item $1(i=1)$.
An e-commerce expert explains to you that users only care about one particular feature when it comes to rating products, and provides you with the value of this feature for each item, which we set as $V$:
$$
V=\left[\begin{array}{c}
6 \\
8 \\
10
\end{array}\right]
$$
Mitch remembers that we can use the alternating least squares method to solve for $U$, minimizing:
$$
J(U, V)=\frac{1}{2} \sum_{(a, i) \epsilon D}\left(U^{(a)} \cdot V^{(i)}-Y_{a, i}\right)^{2}
$$
where $D$ is the set of all user $a$ item $i$ rating pairs $(a, i)$. Here $U^{(a)}$ is the $a^{t h}$ row of $U$, and $V^{(i)}$ is the $i^{t h}$ row of $V$. Note that offsets are fixed at $b_{U}=0$ and $b_{V}=0$ in this problem.
Mark all that are true for our U, V , and Y above:
A. There are infnitely many settings of U and V that minimize J.
B. For any constant (non-zero) V , J(U) has a unique global minimum.
C. For any constant (non-zero) V , there exists a U such that J(U; V ) = 0.
D. For any m x n matrix Y of rank 1, there exist matrices U and V of
sizes m x 1 and n x 1 such that J(U, V ) = 0. ","A, B, D true; C false"
MIT Fall 2019,4,a.i,0.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. What are the dimensions of $W^{(1)}$?",m x d
MIT Fall 2019,4,a.ii,0.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. What are the dimensions of $b^{(1)}$?",m x 1
MIT Fall 2019,4,a.iii,0.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. What are the dimensions of $W^{(2)}$?",d x m
MIT Fall 2019,4,a.iv,0.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. What are the dimensions of $b^{(2)}$?",d x 1
MIT Fall 2019,4,b,1.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. 
Otto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:
$$
J(x, y)=\frac{1}{2}\left\|y^{\text {pred }}-y\right\|^{2}=\frac{1}{2}\left(y^{\text {pred }}-y\right)^{T}\left(y^{\text {pred }}-y\right)
$$
Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:
1. Let $\partial f^{(1)} / \partial z^{(1)}$ be an $m \times 1$ matrix, provided to you.
2. Let $\partial f^{(2)} / \partial z^{(2)}$ be a $d \times 1$ matrix, provided to you.
3. If $A x=y$ where $A$ is a $m \times n$ matrix and $x$ is $n \times 1$ and $y$ is $m \times 1$, then let $\partial y / \partial A=x$.
4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.
Find $\partial J / \partial y^{\text {pred }}$, a $d \times 1$ matrix.",âJ/ây^{pred}=(y^{pred}-y)
MIT Fall 2019,4,c,1.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. 
Otto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:
$$
J(x, y)=\frac{1}{2}\left\|y^{\text {pred }}-y\right\|^{2}=\frac{1}{2}\left(y^{\text {pred }}-y\right)^{T}\left(y^{\text {pred }}-y\right)
$$
Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:
1. Let $\partial f^{(1)} / \partial z^{(1)}$ be an $m \times 1$ matrix, provided to you.
2. Let $\partial f^{(2)} / \partial z^{(2)}$ be a $d \times 1$ matrix, provided to you.
3. If $A x=y$ where $A$ is a $m \times n$ matrix and $x$ is $n \times 1$ and $y$ is $m \times 1$, then let $\partial y / \partial A=x$.
4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.
Find $\partial J / \partial z^{(2)}$, a $d \times 1$ matrix. You may use $\partial J / \partial y^{\text {pred }}$ and $*$ for element-wise multiplication.",âJ/âz^{(2)} =âJ/ây^{pred} * âf^{(2)}/âz^{(2)}
MIT Fall 2019,4,d,1.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. 
Otto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:
$$
J(x, y)=\frac{1}{2}\left\|y^{\text {pred }}-y\right\|^{2}=\frac{1}{2}\left(y^{\text {pred }}-y\right)^{T}\left(y^{\text {pred }}-y\right)
$$
Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:
1. Let $\partial f^{(1)} / \partial z^{(1)}$ be an $m \times 1$ matrix, provided to you.
2. Let $\partial f^{(2)} / \partial z^{(2)}$ be a $d \times 1$ matrix, provided to you.
3. If $A x=y$ where $A$ is a $m \times n$ matrix and $x$ is $n \times 1$ and $y$ is $m \times 1$, then let $\partial y / \partial A=x$.
4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.
Find $\partial J / \partial W^{(2)}$, a $d \times m$ matrix. You may use $\partial J / \partial z^{(2)}$.",âJ/âW^{(2)} = âJ/âz^{(2)}*f^{(1)}*(W^{(1)}x+b^{(1)})^{T}
MIT Fall 2019,4,e,1.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. 
Otto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:
$$
J(x, y)=\frac{1}{2}\left\|y^{\text {pred }}-y\right\|^{2}=\frac{1}{2}\left(y^{\text {pred }}-y\right)^{T}\left(y^{\text {pred }}-y\right)
$$
Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:
1. Let $\partial f^{(1)} / \partial z^{(1)}$ be an $m \times 1$ matrix, provided to you.
2. Let $\partial f^{(2)} / \partial z^{(2)}$ be a $d \times 1$ matrix, provided to you.
3. If $A x=y$ where $A$ is a $m \times n$ matrix and $x$ is $n \times 1$ and $y$ is $m \times 1$, then let $\partial y / \partial A=x$.
4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.
Write the gradient descent update step for just $W^{(2)}$ for one datapoint $(x, y)$ given learning rate $\eta$ and $\partial J / \partial W^{(2)}$.","W^{(2)}:=W^{(2)}-\eta âJ(x,y)/âW^{(2)}"
MIT Fall 2019,4,f,1.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. 
Otto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:
$$
J(x, y)=\frac{1}{2}\left\|y^{\text {pred }}-y\right\|^{2}=\frac{1}{2}\left(y^{\text {pred }}-y\right)^{T}\left(y^{\text {pred }}-y\right)
$$
Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:
1. Let $\partial f^{(1)} / \partial z^{(1)}$ be an $m \times 1$ matrix, provided to you.
2. Let $\partial f^{(2)} / \partial z^{(2)}$ be a $d \times 1$ matrix, provided to you.
3. If $A x=y$ where $A$ is a $m \times n$ matrix and $x$ is $n \times 1$ and $y$ is $m \times 1$, then let $\partial y / \partial A=x$.
4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.
Otto's friend Bigsby believes that bigger is better. He takes a look at Otto's neural network and tells Otto that he should make the number of hidden units $m$ in the hidden layer very large: $m=10 d$. (Recall that $z^{(1)}$ has dimensions $m \times 1$.) Is Bigsby correct? What would you expect to see with training and test accuracy using Bigsby's approach?","No; training accuracy might be high, but this would likely be due to overfitting and lead to worse test accuracy."
MIT Fall 2019,4,g,1.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. 
Otto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:
$$
J(x, y)=\frac{1}{2}\left\|y^{\text {pred }}-y\right\|^{2}=\frac{1}{2}\left(y^{\text {pred }}-y\right)^{T}\left(y^{\text {pred }}-y\right)
$$
Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:
1. Let $\partial f^{(1)} / \partial z^{(1)}$ be an $m \times 1$ matrix, provided to you.
2. Let $\partial f^{(2)} / \partial z^{(2)}$ be a $d \times 1$ matrix, provided to you.
3. If $A x=y$ where $A$ is a $m \times n$ matrix and $x$ is $n \times 1$ and $y$ is $m \times 1$, then let $\partial y / \partial A=x$.
4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.
Leila says having more layers is better. Let $m$ be much smaller than d. Leila adds 10 more hidden layers all with linear activation before Otto's current hidden layer (which has sigmoid activation function $f^{(1)}$ ) such that each hidden layer has $m$ units. What would you expect to see with your training and test accuracy, compared to just having one hidden layer with activation $f^{(1)}$ ?","The intermediary hidden layers do not add any expressivity to the network, and we would expect similar training and test accuracy as compared to the single $f^{(1)}$ hidden layer network. This may, however, require different number of training iterations with the same available data, in order to achieve similar accuracy."
MIT Fall 2019,4,h,1.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. 
Otto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:
$$
J(x, y)=\frac{1}{2}\left\|y^{\text {pred }}-y\right\|^{2}=\frac{1}{2}\left(y^{\text {pred }}-y\right)^{T}\left(y^{\text {pred }}-y\right)
$$
Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:
1. Let $\partial f^{(1)} / \partial z^{(1)}$ be an $m \times 1$ matrix, provided to you.
2. Let $\partial f^{(2)} / \partial z^{(2)}$ be a $d \times 1$ matrix, provided to you.
3. If $A x=y$ where $A$ is a $m \times n$ matrix and $x$ is $n \times 1$ and $y$ is $m \times 1$, then let $\partial y / \partial A=x$.
4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.
Otto's other friend Leila says having more layers is better. Let $m$ be much smaller than d. Leila adds 10 more hidden layers all with linear activation before Neil suggests to have several layers with non-linear activation function. He says Otto should regularize the number of active hidden units. Loosely speaking, we consider the average activation of a hidden unit $j$ in our hidden layer 1 (which has sigmoid activation function $\left.f^{(1)}\right)$ to be the average of the activation of $a_{j}^{(1)}$ over the points $x_{i}$ in our training dataset of size $N$ :
$$
\hat{p}_{j}=\frac{1}{N} \Sigma_{i=1}^{N} a_{j}^{(1)}\left(x_{i}\right)
$$
Assume we would like to enforce the constraint that the average activation for each hidden unit $\hat{p}_{j}$ is close to some hyperparameter $p$. Usually, $p$ is very small (say $p<0.05$ ).
What is the best format for a regularization penalty given hyperparameter $p$ and the average activation for all our hidden units: $\hat{p}_{j}$ ? Select one of the following:
A. Hinge loss: $\Sigma_{j} \max \left(0,\left(1-\hat{p}_{j}\right) p\right)$
B. NLL: $\Sigma_{j}\left(-p \log \frac{p}{\hat{p}_{j}}-(1-p) \log \frac{(1-p)}{\left(1-\hat{p}_{j}\right)}\right)$
C. Squared loss: $\Sigma_{j}\left(\hat{p}_{j}-p\right)^{2}$
D. l2 norm: $\Sigma_{j}\left(\hat{p}_{j}\right)^{2}$  ","Either NLL or squared loss should work, encouraging $p$ and $\hat{p}_{j}$ to be close. NLL loss might better handle wide range in the magnitudes of $\hat{p}_{j}$."
MIT Fall 2019,4,i,1.5,Neural Networks,Text,"Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \in \mathbb{R}^{d}$ and output $y^{\text {pred }} \in \mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \in \mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\text {pred }}$ have dimensions $d \times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \times 1$. 
Otto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:
$$
J(x, y)=\frac{1}{2}\left\|y^{\text {pred }}-y\right\|^{2}=\frac{1}{2}\left(y^{\text {pred }}-y\right)^{T}\left(y^{\text {pred }}-y\right)
$$
Compute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:
1. Let $\partial f^{(1)} / \partial z^{(1)}$ be an $m \times 1$ matrix, provided to you.
2. Let $\partial f^{(2)} / \partial z^{(2)}$ be a $d \times 1$ matrix, provided to you.
3. If $A x=y$ where $A$ is a $m \times n$ matrix and $x$ is $n \times 1$ and $y$ is $m \times 1$, then let $\partial y / \partial A=x$.
4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.
Which pass should Otto compute $\hat{p}_{j}$ on? Select one of the following:
1. Forwards pass
2. Backwards pass
3. Gradient descent step (weight update) pass  ",Forwards pass
MIT Fall 2019,5,a,2,MDPs,Image,"Consider the following simple MDP: Positive Reward
First consider the case where the MDP has positive reward. In this scenario, there is only one action $($ next $)$; we name this decision policy $\pi_{A}$ with $\pi_{A}(s)=n e x t$ for all $s$. The reward is $R(s, n e x t)=0$ for all states $s$, except for state $s_{k}$ where reward is $R\left(s_{k}\right.$, next $)=10$. We always start at state $s_{1}$ and each arrow indicates a deterministic transition probability $p=1$. There is no transition out of the end state $E N D$, and 0 reward for any action from the end state. 
Calculate $V_{\pi}(s)$ for each state in the finite-horizon case with horizon $h=1, k=4$, and discount factor $\gamma=1$.","$$
\begin{aligned}
&V_{\pi}^{1}\left(s_{4}\right)=10 \\
&V_{\pi}^{1}\left(s_{3}\right)=0 \\
&V_{\pi}^{1}\left(s_{2}\right)=0 \\
&V_{\pi}^{1}\left(s_{1}\right)=0
\end{aligned}
$$"
MIT Fall 2019,5,b,2,MDPs,Image,"Consider the following simple MDP: Positive Reward
First consider the case where the MDP has positive reward. In this scenario, there is only one action $($ next $)$; we name this decision policy $\pi_{A}$ with $\pi_{A}(s)=n e x t$ for all $s$. The reward is $R(s, n e x t)=0$ for all states $s$, except for state $s_{k}$ where reward is $R\left(s_{k}\right.$, next $)=10$. We always start at state $s_{1}$ and each arrow indicates a deterministic transition probability $p=1$. There is no transition out of the end state $E N D$, and 0 reward for any action from the end state. 
Calculate $V_{\pi}(s)$ for each state in the infinite horizon case with $k=4$ and discount factor $\gamma=0.9$","$$
\begin{aligned}
&V_{\pi}\left(s_{4}\right)=10 \\
&V_{\pi}\left(s_{3}\right)=0+\gamma * 10=0.9 * 10=9 \\
&V_{\pi}\left(s_{2}\right)=0.9 * 9=8.1 \\
&V_{\pi}\left(s_{1}\right)=0.9 * 8.1=7.29
\end{aligned}
$$"
MIT Fall 2019,5,c,2,MDPs,Image,"Consider the following simple MDP: Positive Reward
First consider the case where the MDP has positive reward. In this scenario, there is only one action $($ next $)$; we name this decision policy $\pi_{A}$ with $\pi_{A}(s)=n e x t$ for all $s$. The reward is $R(s, n e x t)=0$ for all states $s$, except for state $s_{k}$ where reward is $R\left(s_{k}\right.$, next $)=10$. We always start at state $s_{1}$ and each arrow indicates a deterministic transition probability $p=1$. There is no transition out of the end state $E N D$, and 0 reward for any action from the end state. 
Derive a formula for $V_{\pi}\left(s_{1}\right)$ that works for any value of (is expressed as a function of) $k$ and $\gamma$ for the above positive reward MDP, in the infinite horizon case.","At each step, we receive a reward of 0 , except after the $k^{\text {th }}$ step, when we get a reward of 10 . Therefore, the summation is
$$
\sum_{i=0}^{k-1} 0 * \gamma^{i}+10 * \gamma^{k-1}=0 * \gamma^{0}+0 * \gamma^{1}+0 * \gamma^{2}+0 * \gamma^{3}+\ldots+10 * \gamma^{k-1}=10 \gamma^{k-1}
$$"
MIT Fall 2019,5,d,2,MDPs,Image,"Consider the following simple MDP: Positive Reward
First consider the case where the MDP has positive reward. In this scenario, there is only one action $($ next $)$; we name this decision policy $\pi_{A}$ with $\pi_{A}(s)=n e x t$ for all $s$. The reward is $R(s, n e x t)=0$ for all states $s$, except for state $s_{k}$ where reward is $R\left(s_{k}\right.$, next $)=10$. We always start at state $s_{1}$ and each arrow indicates a deterministic transition probability $p=1$. There is no transition out of the end state $E N D$, and 0 reward for any action from the end state.
Now consider the case where this MDP has negative reward. In this scenario, the reward is $R(s, n e x t)=-1$ for all states, except for state $s_{k}$ where the reward is $R\left(s_{k}, n e x t\right)=0$. Again, there is only one action, next, and the decision policy remains $\pi_{A}(s)=n e x t$ for all s. We always start at state $s_{1}$ and each arrow has a deterministic transition probsbility $p=1$. There is no transition out of the end state $E N D$, and zero reward for any action from the end state, i.e., $R(E N D$, next $)=0$.
Calculate $V_{\pi}(s)$ for each state in the finite-horizon case with horizon $h=1, k=4$, and discount factor $\gamma=1$.","$$
\begin{aligned}
&V_{\pi}^{1}\left(s_{4}\right)=0 \\
&V_{\pi}^{1}\left(s_{3}\right)=-1 \\
&V_{\pi}^{1}\left(s_{2}\right)=-1 \\
&V_{\pi}^{1}\left(s_{1}\right)=-1
\end{aligned}
$$"
MIT Fall 2019,5,e,2,MDPs,Image,"Consider the following simple MDP: Positive Reward
First consider the case where the MDP has positive reward. In this scenario, there is only one action $($ next $)$; we name this decision policy $\pi_{A}$ with $\pi_{A}(s)=n e x t$ for all $s$. The reward is $R(s, n e x t)=0$ for all states $s$, except for state $s_{k}$ where reward is $R\left(s_{k}\right.$, next $)=10$. We always start at state $s_{1}$ and each arrow indicates a deterministic transition probability $p=1$. There is no transition out of the end state $E N D$, and 0 reward for any action from the end state.
Now consider the case where this MDP has negative reward. In this scenario, the reward is $R(s, n e x t)=-1$ for all states, except for state $s_{k}$ where the reward is $R\left(s_{k}, n e x t\right)=0$. Again, there is only one action, next, and the decision policy remains $\pi_{A}(s)=n e x t$ for all s. We always start at state $s_{1}$ and each arrow has a deterministic transition probsbility $p=1$. There is no transition out of the end state $E N D$, and zero reward for any action from the end state, i.e., $R(E N D$, next $)=0$.
Calculate $V_{\pi}(s)$ for each state in the infinite horizon case with $k=4$ and discount factor $\gamma=0.9$","$$
\begin{aligned}
&V_{\pi}\left(s_{4}\right)=0 \\
&V_{\pi}\left(s_{3}\right)=-1+\gamma * 0=-1 \\
&V_{\pi}\left(s_{2}\right)=-1+0.9(-1)=-1.9 \\
&V_{\pi}\left(s_{1}\right)=-1+0.9(-1.9)=-2.71
\end{aligned}
$$"
MIT Fall 2019,5,f,3,MDPs,Image,"Consider the following simple MDP: Positive Reward
First consider the case where the MDP has positive reward. In this scenario, there is only one action $($ next $)$; we name this decision policy $\pi_{A}$ with $\pi_{A}(s)=n e x t$ for all $s$. The reward is $R(s, n e x t)=0$ for all states $s$, except for state $s_{k}$ where reward is $R\left(s_{k}\right.$, next $)=10$. We always start at state $s_{1}$ and each arrow indicates a deterministic transition probability $p=1$. There is no transition out of the end state $E N D$, and 0 reward for any action from the end state.
Now consider the case where this MDP has negative reward. In this scenario, the reward is $R(s, n e x t)=-1$ for all states, except for state $s_{k}$ where the reward is $R\left(s_{k}, n e x t\right)=0$. Again, there is only one action, next, and the decision policy remains $\pi_{A}(s)=n e x t$ for all s. We always start at state $s_{1}$ and each arrow has a deterministic transition probsbility $p=1$. There is no transition out of the end state $E N D$, and zero reward for any action from the end state, i.e., $R(E N D$, next $)=0$.
Derive a formula for $V_{\pi}\left(s_{1}\right)$ that works for any value of (is expressed as a function of) $k$ and $\gamma$ for this negative reward MDP with infinite horizon. Recall that $\sum_{i=0}^{n} \gamma^{i}=\frac{\left(1-\gamma^{n+1}\right)}{(1-\gamma)}$.","At every step, we receive a reward of $-1$, except for the $h^{\text {th }}$ step, where we receive a reward of 0 . Therefore, the summation is
$$
\sum_{i=0}^{k-1}-1 * \gamma^{i}+0 * \gamma^{k-1}=-1 * \gamma^{0}-1 * \gamma^{1}-1 * \gamma^{2}+\ldots-1 * \gamma^{k-2}+0 * \gamma^{k-1}=-\frac{1-\gamma^{k-1}}{1-\gamma}
$$"
MIT Fall 2019,5,g,3,MDPs,Image,"Consider the following simple MDP: Positive Reward
First consider the case where the MDP has positive reward. In this scenario, there is only one action $($ next $)$; we name this decision policy $\pi_{A}$ with $\pi_{A}(s)=n e x t$ for all $s$. The reward is $R(s, n e x t)=0$ for all states $s$, except for state $s_{k}$ where reward is $R\left(s_{k}\right.$, next $)=10$. We always start at state $s_{1}$ and each arrow indicates a deterministic transition probability $p=1$. There is no transition out of the end state $E N D$, and 0 reward for any action from the end state.
Consider the MDP below with negative rewards for some $R(s, a)$ and positive rewards for others. Now there are two actions, next and stop. The solid arrows show the probabilities of state transitions under action next; the dashed arrows show the probability of state transitions under action stop. (If there is no dashed arrow from a state, that indicates a probability $p=0$ of transitioning out of that state under action stop.) The corresponding rewards $R\left(s_{i}, a\right)$ are also indicated on the figure below. Note that the rewards are $R\left(s_{i}, n e x t\right)=-1$ for all $s_{i}$, except for state $s_{4}$, where the reward is $R\left(s_{4}\right.$, next $)=10$. Finally, under action stop, we have reward $R\left(s_{1}\right.$, stop $)=r$ (some unknown value $r$ ), and $R(s, s t o p)=0$ for all other states. As before, we always start in state $s_{1}$. There is no transition out of the end state $E N D$, and zero reward for any action from the end state, i.e., $R(E N D$, next $)=R(E N D, g o)=0$. Assume discount factor $\gamma$ and infinite horizon.
We consider two possible policies: $\pi_{A}(s)=n e x t$ for all $s$, and $\pi_{B}(s)=s t o p$ for all $s$. Your goal is to maximize your reward. When you start at $s_{1}$, you have reward 0 before taking any actions. Determine what $r$ should be, so that it is best to run this MDP under policy $\pi_{B}$ rather than policy $\pi_{A}$. Give your answer as an expression for $r$ involving $p$ and $\gamma$.","Under policy $\pi_{A}$ :
$$
\begin{aligned}
&V_{\pi}\left(s_{4}\right)=10 \\
&V_{\pi}\left(s_{3}\right)=-1+p \gamma V_{\pi}\left(s_{4}\right)+(1-p) \gamma V_{\pi}(\text { end })=-1+p \gamma \cdot 10 \\
&V_{\pi}\left(s_{2}\right)=-1+p \gamma V_{\pi}\left(s_{3}\right)=-1-p \gamma+(p \gamma)^{2} \cdot 10 \\
&V_{\pi}\left(s_{1}\right)=-1+p \gamma V_{\pi}\left(s_{2}\right)=-1-p \gamma-(p \gamma)^{2}+(p \gamma)^{3} \cdot 10
\end{aligned}
$$
Under policy $\pi_{B}$, we simply have $V_{\pi}\left(s_{1}\right)=r$. So we should choose policy $\pi_{B}$ when
$$
r>-1-p \gamma-(p \gamma)^{2}+(p \gamma)^{3} \cdot 10
$$
As an example, for $\gamma=1$ and $p=0.9, r$ is $4.58$."
MIT Fall 2019,6,a,2,Decision Trees,Text,"Consider the following 2D dataset in the (x,y) format: ((0,1), +1), ((1,1),+1), ((-1,-2),-1), ((0,-1),-1), ((1,-1),+1), ((2,-1),-1). Consider the following splits: Split A: x2 >= 0
Split B: x1 >= 0:5
Split C: x1 >=ô0:5
Paul Bunyan works to construct trees using the algorithm discussed in the lecture notes, i.e., a greedy algorithm that recursively minimizes weighted average entropy, considering only combinations of the three splits mentioned above. He wants the output of the tree for any input $\left(x_{1}, x_{2}\right)$ to be the probability that the input is a positive $(+1)$ example.
Recall that the weighted average entropy $\bar{H}$ of a split into subsets $R_{1}$ and $R_{2}$ is
$$
\bar{H}(\text { split })=\left(\text { fraction of points in } R_{1}\right) \cdot H\left(R_{1}\right)+\left(\text { fraction of points in } R_{2}\right) \cdot H\left(R_{2}\right)
$$
where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by
$$
H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}
$$
Here $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$.
Considering the entire data set, Paul finds that the best first split of these three is Split A, with $\bar{H}(A)=0.54$, compared to $\bar{H}(B)=0.92$ and $\bar{H}(C)=0.81$, resulting in a region $R_{A^{+}}$ with all positive examples, and a region $R_{A^{-}}$with mixed positive and negative examples. Given Split A, however, Paul is not sure which is the next split to include in his tree. Calculate the weighted average entropy of Split $\mathrm{B}$ for region $R_{A^{-}}, \bar{H}\left(B \mid R_{A^{-}}\right)$, versus Split $\mathrm{C}$ for the same region, $\bar{H}\left(C \mid R_{A^{-}}\right)$, and identify which of Split B or Split $\mathrm{C}$ Paul should choose for his second split. ",Split B
MIT Fall 2019,6,b,2,Decision Trees,Image,"Draw the decision tree boundaries represented by this decision tree (with two splits) on 
the data plot figure below.",Draw on image
MIT Fall 2019,6,c,1,Decision Trees,Image,"Draw the decision tree corresponding to this tree with two splits. Clearly label the test 
in each node, which case (yes or no) each branch corresponds to, and the output at a leaf
 node represented as a probability of having a positive label, +1.",Draw on image
MIT Fall 2019,6,d,1,Decision Trees,Text,"Consider the following 2D dataset in the (x,y) format: ((0,1), +1), ((1,1),+1), ((-1,-2),-1), ((0,-1),-1), ((1,-1),+1), ((2,-1),-1). Consider the following splits: Split A: x2 >= 0
Split B: x1 >= 0:5
Split C: x1 >=ô0:5
Paul Bunyan works to construct trees using the algorithm discussed in the lecture notes, i.e., a greedy algorithm that recursively minimizes weighted average entropy, considering only combinations of the three splits mentioned above. He wants the output of the tree for any input $\left(x_{1}, x_{2}\right)$ to be the probability that the input is a positive $(+1)$ example.
Recall that the weighted average entropy $\bar{H}$ of a split into subsets $R_{1}$ and $R_{2}$ is
$$
\bar{H}(\text { split })=\left(\text { fraction of points in } R_{1}\right) \cdot H\left(R_{1}\right)+\left(\text { fraction of points in } R_{2}\right) \cdot H\left(R_{2}\right)
$$
where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by
$$
H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}
$$
Here $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$.
What probability of being a positive example does Paul's decision tree using Split B return for the new point (-1, 1)?",1
MIT Fall 2019,6,e,2,Decision Trees,Text,"Consider the following 2D dataset in the (x,y) format: ((0,1), +1), ((1,1),+1), ((-1,-2),-1), ((0,-1),-1), ((1,-1),+1), ((2,-1),-1). Consider the following splits: Split A: x2 >= 0
Split B: x1 >= 0:5
Split C: x1 >=ô0:5
Paul Bunyan works to construct trees using the algorithm discussed in the lecture notes, i.e., a greedy algorithm that recursively minimizes weighted average entropy, considering only combinations of the three splits mentioned above. He wants the output of the tree for any input $\left(x_{1}, x_{2}\right)$ to be the probability that the input is a positive $(+1)$ example.
Recall that the weighted average entropy $\bar{H}$ of a split into subsets $R_{1}$ and $R_{2}$ is
$$
\bar{H}(\text { split })=\left(\text { fraction of points in } R_{1}\right) \cdot H\left(R_{1}\right)+\left(\text { fraction of points in } R_{2}\right) \cdot H\left(R_{2}\right)
$$
where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by
$$
H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}
$$
Here $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$.
What probability of being a positive example does Paul's decision tree using Split B return for the new point (1, -2)?",0.5
MIT Fall 2019,6,f,2,Decision Trees,Text,"Consider the following 2D dataset in the (x,y) format: ((0,1), +1), ((1,1),+1), ((-1,-2),-1), ((0,-1),-1), ((1,-1),+1), ((2,-1),-1). Consider the following splits: Split A: x2 >= 0
Split B: x1 >= 0:5
Split C: x1 >=ô0:5
Paul Bunyan works to construct trees using the algorithm discussed in the lecture notes, i.e., a greedy algorithm that recursively minimizes weighted average entropy, considering only combinations of the three splits mentioned above. He wants the output of the tree for any input $\left(x_{1}, x_{2}\right)$ to be the probability that the input is a positive $(+1)$ example.
Recall that the weighted average entropy $\bar{H}$ of a split into subsets $R_{1}$ and $R_{2}$ is
$$
\bar{H}(\text { split })=\left(\text { fraction of points in } R_{1}\right) \cdot H\left(R_{1}\right)+\left(\text { fraction of points in } R_{2}\right) \cdot H\left(R_{2}\right)
$$
where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by
$$
H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}
$$
Here $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$.
Paul decides to consider a particular type of ""random forest,"" which is an ensemble or collection of decision trees, where each tree might only have a subset of split features. Paul restricts his trees to only use Splits A, B, C, or some combination of these splits. The final output of the random forest is the average of the output across the collection of $n$ trees (i.e., with equal weight $1 / n$ for each tree in the random forest). Paul's random forest consists of three trees:
- The tree consisting of the best single split using feature $x_{2}$ only.
- The tree consisting of the best single split using feature $x_{1}$ only.
- The tree consisting of the best two splits (in total) using both features $x_{1}$ and $x_{2}$ (this is the tree from part (a) in this problem).
For this random forest, what is the output for the probability that an input point at $(-1,1)$ is a positive $(+1)$ example? Note: Paul's calculations in part (a) may be of help.","The first tree corresponds to just Split A on $x_{2}$ from Paul's original tree; this tree gives $p=1.0$ for the point being a positive example. As noted in part (a) the best tree splitting only on $x_{1}$ is Split $\mathrm{C}$, since $\bar{H}(C)=0.81$ is less than $\bar{H}(B)=0.92)$;this tree has $p=0.0$ for the point $(-1,1)$ being a positive example. Finally, the two-split tree as derived in part (a) had $p=1.0$. Thus the aggregate (average) probability is that $(-1,1)$ is a positive example is $p=2 / 3$."
MIT Fall 2019,6,g,2,Decision Trees,Text,"Consider the following 2D dataset in the (x,y) format: ((0,1), +1), ((1,1),+1), ((-1,-2),-1), ((0,-1),-1), ((1,-1),+1), ((2,-1),-1). Consider the following splits: Split A: x2 >= 0
Split B: x1 >= 0:5
Split C: x1 >=ô0:5
Paul Bunyan works to construct trees using the algorithm discussed in the lecture notes, i.e., a greedy algorithm that recursively minimizes weighted average entropy, considering only combinations of the three splits mentioned above. He wants the output of the tree for any input $\left(x_{1}, x_{2}\right)$ to be the probability that the input is a positive $(+1)$ example.
Recall that the weighted average entropy $\bar{H}$ of a split into subsets $R_{1}$ and $R_{2}$ is
$$
\bar{H}(\text { split })=\left(\text { fraction of points in } R_{1}\right) \cdot H\left(R_{1}\right)+\left(\text { fraction of points in } R_{2}\right) \cdot H\left(R_{2}\right)
$$
where the entropy $H\left(R_{m}\right)$ of data in a region $R_{m}$ is given by
$$
H\left(R_{m}\right)=-\sum_{k} \hat{P}_{m k} \log _{2} \hat{P}_{m k}
$$
Here $\hat{P}_{m k}$ is the empirical probability, which is in this case the fraction of items in region $m$ that are of class $k$.
Would you expect the accuracy for Paul's random forest generated decision to be better, or for the decision made by Paul's single two-split decision tree from part (a) to be better, when evaluated against held-out test data? Explain.","We would expect that the random forest generated decision will generalize better. Using all the features available to us can lead to over-fitting. For random forests, although each individual decision tree can have a higher error rate on the training data, the averaging effect (or majority vote for classification trees) can serve as a filter on noise vs. true signal."
MIT Fall 2019,7,a,2,RNNs,Text,"We have seen in class recurrent neural networks ( $\mathrm{RNNs}$ ) that are structured as:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\
s_{t} &=f_{1}\left(z_{t}^{1}\right) \\
z_{t}^{2} &=W^{o} s_{t} \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\left(x_{t}, y_{t}\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).
Assume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.
For $\mathrm{RNN}-\mathrm{A}$, give dimensions of the weights for W^{s s}, W^{s x}, and W^{0}","W^{s s} is 2x2, W^{s x} is 2x2, and W^{0} is 2x2"
MIT Fall 2019,7,b,2,RNNs,Text,"We have seen in class recurrent neural networks ( $\mathrm{RNNs}$ ) that are structured as:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\
s_{t} &=f_{1}\left(z_{t}^{1}\right) \\
z_{t}^{2} &=W^{o} s_{t} \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\left(x_{t}, y_{t}\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).
Assume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.
We have finished training RNN-A, using some overall loss $J=\sum_{t} \operatorname{Loss}\left(y_{t}, p_{t}\right)$ given the per-element loss function $\operatorname{Loss}\left(y_{t}, p_{t}\right)$. We are now interested in the derivative of the overall loss with respect to $x_{t}$; for example, we might want to know how sensitive the loss is to a particular input (perhaps to identify an outlier input). What is the derivative of overall loss at time $t$ with respect to $x_{t}, \partial J / \partial x_{t}$, with dimensions $2 \times 1$, in terms of the weights $W^{s s}, W^{s x}, W^{0}$ and the input $x_{t}$ ? Assume we have $\partial Loss / \partial z_{t}^{2}$, with dimensions $2 \times 1$. Use $*$ to indicate element-wise multiplication.",\frac{\partial J}{\partial x_{t}}=W^{s x T} W^{o T} \frac{\partial Loss}{\partial z_{t}^{2}}
MIT Fall 2019,7,c,2,RNNs,Text,"We have seen in class recurrent neural networks (RNNs) that are structured as:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\
s_{t} &=f_{1}\left(z_{t}^{1}\right) \\
z_{t}^{2} &=W^{o} s_{t} \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\left(x_{t}, y_{t}\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).
Assume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.
Now consider a modified RNN, call it RNN-B, that does the following:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s x}\left[\begin{array}{c}
s_{t-1} \\
x_{t}
\end{array}\right] \\
s_{t} &=z_{t}^{1} \\
z_{t}^{2} &=W^{o x}\left[\begin{array}{l}
s_{t} \\
x_{t}
\end{array}\right] \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where $s_{t}, x_{t}, p_{t}$ are all vectors of shape $2 \times 1,\left[\begin{array}{c}s_{t-1} \\ x_{t}\end{array}\right]$ and $\left[\begin{array}{l}s_{t} \\ x_{t}\end{array}\right]$ are vectors of shape $4 \times 1$.
Now consider a modified $\mathrm{RNN}$, call it $\mathrm{RNN}-\mathrm{B}$, that does the following:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s x}\left[\begin{array}{c}
s_{t-1} \\
x_{t}
\end{array}\right] \\
s_{t} &=z_{t}^{1} \\
z_{t}^{2} &=W^{o x}\left[\begin{array}{l}
s_{t} \\
x_{t}
\end{array}\right] \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where $s_{t}, x_{t}, p_{t}$ are all vectors of shape $2 \times 1,\left[\begin{array}{c}s_{t-1} \\ x_{t}\end{array}\right]$ and $\left[\begin{array}{l}s_{t} \\ x_{t}\end{array}\right]$ are vectors of shape $4 \times 1$.
For RNN-B, give dimensions of the weights of W^{s s x} and W^{o x}
","W^{s s x} is 2x4, W^{o x} is 2x4"
MIT Fall 2019,7,d,2,RNNs,Text,"We have seen in class recurrent neural networks (RNNs) that are structured as:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\
s_{t} &=f_{1}\left(z_{t}^{1}\right) \\
z_{t}^{2} &=W^{o} s_{t} \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\left(x_{t}, y_{t}\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).
Assume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.
Now consider a modified RNN, call it RNN-B, that does the following:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s x}\left[\begin{array}{c}
s_{t-1} \\
x_{t}
\end{array}\right] \\
s_{t} &=z_{t}^{1} \\
z_{t}^{2} &=W^{o x}\left[\begin{array}{l}
s_{t} \\
x_{t}
\end{array}\right] \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where $s_{t}, x_{t}, p_{t}$ are all vectors of shape $2 \times 1,\left[\begin{array}{c}s_{t-1} \\ x_{t}\end{array}\right]$ and $\left[\begin{array}{l}s_{t} \\ x_{t}\end{array}\right]$ are vectors of shape $4 \times 1$.
Imagine we are using RNN-B to generate a description sentence given an input word, as in language modeling. The input is a single $2 \times 1$ vector embedding, $x_{1}$, that encodes the input word. The output will be a sequence of words $p_{1}, p_{2}, \ldots, p_{n}$ that provide a description of that word. In this setting, what would be an appropriate activation function $f_{2}$ ?
",Softmax to select a best next word.
MIT Fall 2019,7,e,2,RNNs,Text,"We have seen in class recurrent neural networks (RNNs) that are structured as:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\
s_{t} &=f_{1}\left(z_{t}^{1}\right) \\
z_{t}^{2} &=W^{o} s_{t} \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\left(x_{t}, y_{t}\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).
Assume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.
Now consider a modified RNN, call it RNN-B, that does the following:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s x}\left[\begin{array}{c}
s_{t-1} \\
x_{t}
\end{array}\right] \\
s_{t} &=z_{t}^{1} \\
z_{t}^{2} &=W^{o x}\left[\begin{array}{l}
s_{t} \\
x_{t}
\end{array}\right] \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where $s_{t}, x_{t}, p_{t}$ are all vectors of shape $2 \times 1,\left[\begin{array}{c}s_{t-1} \\ x_{t}\end{array}\right]$ and $\left[\begin{array}{l}s_{t} \\ x_{t}\end{array}\right]$ are vectors of shape $4 \times 1$.
Continuing with RNN-B for one-to-many description generation using our language modeling approach, we calculate $p_1$ in a forward pass. How do we calculate $x_2$ (what is $x_2$ equal to)?
",$x_2$ = $p_1$
MIT Fall 2019,7,f.i,0.5,RNNs,Text,"We have seen in class recurrent neural networks (RNNs) that are structured as:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\
s_{t} &=f_{1}\left(z_{t}^{1}\right) \\
z_{t}^{2} &=W^{o} s_{t} \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\left(x_{t}, y_{t}\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).
Assume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.
Now consider a modified RNN, call it RNN-B, that does the following:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s x}\left[\begin{array}{c}
s_{t-1} \\
x_{t}
\end{array}\right] \\
s_{t} &=z_{t}^{1} \\
z_{t}^{2} &=W^{o x}\left[\begin{array}{l}
s_{t} \\
x_{t}
\end{array}\right] \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where $s_{t}, x_{t}, p_{t}$ are all vectors of shape $2 \times 1,\left[\begin{array}{c}s_{t-1} \\ x_{t}\end{array}\right]$ and $\left[\begin{array}{l}s_{t} \\ x_{t}\end{array}\right]$ are vectors of shape $4 \times 1$.
For RNN-B, we are also interested in the derivative of loss at time $t$ with respect to $x_{t}$, $\partial$ Loss $/ \partial x_{t}$. does $\partial$ Loss $/ \partial x_{t}$ depend on $W^{ox}$? Indicate true or false.
",TRUE
MIT Fall 2019,7,f.ii,0.5,RNNs,Text,"We have seen in class recurrent neural networks (RNNs) that are structured as:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\
s_{t} &=f_{1}\left(z_{t}^{1}\right) \\
z_{t}^{2} &=W^{o} s_{t} \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\left(x_{t}, y_{t}\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).
Assume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.
Now consider a modified RNN, call it RNN-B, that does the following:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s x}\left[\begin{array}{c}
s_{t-1} \\
x_{t}
\end{array}\right] \\
s_{t} &=z_{t}^{1} \\
z_{t}^{2} &=W^{o x}\left[\begin{array}{l}
s_{t} \\
x_{t}
\end{array}\right] \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where $s_{t}, x_{t}, p_{t}$ are all vectors of shape $2 \times 1,\left[\begin{array}{c}s_{t-1} \\ x_{t}\end{array}\right]$ and $\left[\begin{array}{l}s_{t} \\ x_{t}\end{array}\right]$ are vectors of shape $4 \times 1$.
For RNN-B, we are also interested in the derivative of loss at time $t$ with respect to $x_{t}$, $\partial$ Loss $/ \partial x_{t}$. does $\partial$ Loss $/ \partial x_{t}$ depend on all elements $W^{ox}$? Indicate true or false.
",TRUE
MIT Fall 2019,7,f.iii,0.5,RNNs,Text,"We have seen in class recurrent neural networks (RNNs) that are structured as:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\
s_{t} &=f_{1}\left(z_{t}^{1}\right) \\
z_{t}^{2} &=W^{o} s_{t} \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\left(x_{t}, y_{t}\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).
Assume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.
Now consider a modified RNN, call it RNN-B, that does the following:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s x}\left[\begin{array}{c}
s_{t-1} \\
x_{t}
\end{array}\right] \\
s_{t} &=z_{t}^{1} \\
z_{t}^{2} &=W^{o x}\left[\begin{array}{l}
s_{t} \\
x_{t}
\end{array}\right] \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where $s_{t}, x_{t}, p_{t}$ are all vectors of shape $2 \times 1,\left[\begin{array}{c}s_{t-1} \\ x_{t}\end{array}\right]$ and $\left[\begin{array}{l}s_{t} \\ x_{t}\end{array}\right]$ are vectors of shape $4 \times 1$.
For RNN-B, we are also interested in the derivative of loss at time $t$ with respect to $x_{t}$, $\partial$ Loss $/ \partial x_{t}$. Does $\partial$ Loss $/ \partial x_{t}$ depend on $W^{ssx}$? Indicate true or false.
",TRUE
MIT Fall 2019,7,f.iv,0.5,RNNs,Text,"We have seen in class recurrent neural networks (RNNs) that are structured as:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\
s_{t} &=f_{1}\left(z_{t}^{1}\right) \\
z_{t}^{2} &=W^{o} s_{t} \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\left(x_{t}, y_{t}\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).
Assume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.
Now consider a modified RNN, call it RNN-B, that does the following:
$$
\begin{aligned}
z_{t}^{1} &=W^{s s x}\left[\begin{array}{c}
s_{t-1} \\
x_{t}
\end{array}\right] \\
s_{t} &=z_{t}^{1} \\
z_{t}^{2} &=W^{o x}\left[\begin{array}{l}
s_{t} \\
x_{t}
\end{array}\right] \\
p_{t} &=f_{2}\left(z_{t}^{2}\right)
\end{aligned}
$$
where $s_{t}, x_{t}, p_{t}$ are all vectors of shape $2 \times 1,\left[\begin{array}{c}s_{t-1} \\ x_{t}\end{array}\right]$ and $\left[\begin{array}{l}s_{t} \\ x_{t}\end{array}\right]$ are vectors of shape $4 \times 1$.
For RNN-B, we are also interested in the derivative of loss at time $t$ with respect to $x_{t}$, $\partial$ Loss $/ \partial x_{t}$. does $\partial$ Loss $/ \partial x_{t}$ depend on all elements $W^{ssx}$? Indicate true or false.
",FALSE
MIT Fall 2019,8,a,2,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. First consider the lasso regularizer for this specific case: $$ R_{\alpha}(\theta)=\alpha \sum_{j=1}^{k}\left|\theta_{j}\right|=\alpha\left(\theta_{1}+\theta_{2}\right) $$ where $R_{\alpha}(\theta)=\alpha\left(\theta_{1}+\theta_{2}\right)$ in this case since both $\theta_{1}$ and $\theta_{2}$ are positive. We consider reducing $\theta_{1}$ by a small $\delta$, where $\delta>0$, versus reducing $\theta_{2}$ by $\delta$. (You can assume $\delta$ is smaller than $\theta_{1}$ and $\theta_{2}$.) What is true, if our goal is to minimize $R_{\alpha}(\theta)$? Choose one of the following options:
It is better to reduce $\theta_{1}$ by $\delta$ 
It is better to reduce $\theta_{2}$ by $\delta$ 
It is equally beneficial to reduce $\theta_{1}$ or $\theta_{2}$ by $\delta$.",It is equally beneficial to reduce $\theta_{1}$ or $\theta_{2}$ by $\delta$.
MIT Fall 2019,8,b,1.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Now we are interested in the behavior of $R_{\lambda}(\theta)$ for this specific case: $$ R_{\lambda}(\theta)=\frac{\lambda}{2}\|\theta\|^{2}=\frac{\lambda}{2}\left(\theta_{1}^{2}+\theta_{2}^{2}\right) . $$ We consider reducing $\theta_{1}$ by a small $\delta$, where $\delta>0$, versus reducing $\theta_{2}$ by $\delta$. (You can assume $\delta$ is smaller than $\theta_{1}$ and $\theta_{2}$.) What is true, if our goal is to minimize $R_{\lambda}(\theta)$ ? Choose one from the following options
It is better to reduce $\theta_{1}$ by $\delta$ 
It is better to reduce $\theta_{2}$ by $\delta$
It is equally beneficial to reduce $\theta_{1}$ or $\theta_{2}$ by $\delta$",It is better to reduce $\theta_{1}$ by $\delta$
MIT Fall 2019,8,c.i,0.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Rega Lizer is interested in the behavior of these two regularizers, when used to fit a linear model by minimizing $J_{1}$ and $J_{2}$. We compare the ridge regularizer $R_{\lambda}$ and the lasso regularizer $R_{\alpha}$, for general $k$. Assume $\alpha$ and $\lambda$ are positive. Indicate True or False about the following $R_{\lambda}$ when minimizing $J_{1}$ (with sum of squares loss and $R_{\lambda}(\theta)$ terms): $R_{\lambda}$ pushes $\theta$ to have smaller magnitude $\theta_{i}$ ",TRUE
MIT Fall 2019,8,c.ii,0.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Rega Lizer is interested in the behavior of these two regularizers, when used to fit a linear model by minimizing $J_{1}$ and $J_{2}$. We compare the ridge regularizer $R_{\lambda}$ and the lasso regularizer $R_{\alpha}$, for general $k$. Assume $\alpha$ and $\lambda$ are positive. Indicate True or False about the following $R_{\lambda}$ when minimizing $J_{1}$ (with sum of squares loss and $R_{\lambda}(\theta)$ terms): $R_{\lambda}$ favors reducing the magnitude of the largest magnitude $\theta_{i}$ over reducing the magnitude of smaller magnitude $\theta_{i}$",TRUE
MIT Fall 2019,8,c.iii,0.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Rega Lizer is interested in the behavior of these two regularizers, when used to fit a linear model by minimizing $J_{1}$ and $J_{2}$. We compare the ridge regularizer $R_{\lambda}$ and the lasso regularizer $R_{\alpha}$, for general $k$. Assume $\alpha$ and $\lambda$ are positive. Indicate True or False about the following $R_{\lambda}$ when minimizing $J_{1}$ (with sum of squares loss and $R_{\lambda}(\theta)$ terms):  $R_{\lambda}$ inhibits sparsity (i.e., disfavors finding $\theta$ such that some $\theta_{i}$ are zero) for $\theta$ with equivalent sum of squares loss",TRUE
MIT Fall 2019,8,d.i,0.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Rega Lizer is interested in the behavior of these two regularizers, when used to fit a linear model by minimizing $J_{1}$ and $J_{2}$. We compare the ridge regularizer $R_{\lambda}$ and the lasso regularizer $R_{\alpha}$, for general $k$. Assume $\alpha$ and $\lambda$ are positive. Indicate True or False, $R_{\alpha}$ when minimizing $J_{2}$ (with sum of squares loss and $R_{\lalpha}(\theta)$ terms): $R_{\lambda}$ pushes $\theta$ to have smaller magnitude $\theta_{i}$ ",TRUE
MIT Fall 2019,8,d.ii,0.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Rega Lizer is interested in the behavior of these two regularizers, when used to fit a linear model by minimizing $J_{1}$ and $J_{2}$. We compare the ridge regularizer $R_{\lambda}$ and the lasso regularizer $R_{\alpha}$, for general $k$. Assume $\alpha$ and $\lambda$ are positive. Indicate True or False, $R_{\alpha}$ when minimizing $J_{2}$ (with sum of squares loss and $R_{\lalpha}(\theta)$ terms): $R_{\lambda}$ favors reducing the magnitude of the largest magnitude $\theta_{i}$ over reducing the magnitude of smaller magnitude $\theta_{i}$",FALSE
MIT Fall 2019,8,d.iii,0.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Rega Lizer is interested in the behavior of these two regularizers, when used to fit a linear model by minimizing $J_{1}$ and $J_{2}$. We compare the ridge regularizer $R_{\lambda}$ and the lasso regularizer $R_{\alpha}$, for general $k$. Assume $\alpha$ and $\lambda$ are positive. Indicate True or False, $R_{\alpha}$ when minimizing $J_{2}$ (with sum of squares loss and $R_{\lalpha}(\theta)$ terms): $R_{\lambda}$ inhibits sparsity (i.e., disfavors finding $\theta$ such that some $\theta_{i}$ are zero) for $\theta$ with equivalent sum of squares loss",FALSE
MIT Fall 2019,8,e.i,0.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Rega Lizer is interested in the behavior of these two regularizers, when used to fit a linear model by minimizing $J_{1}$ and $J_{2}$. We compare the ridge regularizer $R_{\lambda}$ and the lasso regularizer $R_{\alpha}$, for general $k$. Assume $\alpha$ and $\lambda$ are positive. Rega proposes combining the two regularizers with a sum of squares loss to form the $J_{3}$ objective: $$ \begin{aligned} J_{3}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta)+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right|+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Indicate true of false about using both of these regularizers when minimizing $J_{3}$: This is a bad idea, as the two regularizers will compete against each other. ",FALSE
MIT Fall 2019,8,e.ii,0.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Rega Lizer is interested in the behavior of these two regularizers, when used to fit a linear model by minimizing $J_{1}$ and $J_{2}$. We compare the ridge regularizer $R_{\lambda}$ and the lasso regularizer $R_{\alpha}$, for general $k$. Assume $\alpha$ and $\lambda$ are positive. Rega proposes combining the two regularizers with a sum of squares loss to form the $J_{3}$ objective: $$ \begin{aligned} J_{3}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta)+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right|+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Indicate true of false about using both of these regularizers when minimizing $J_{3}$: This is a reasonable idea, to achieve some controllable mixture of the behavior of the two regularizers based on the two hyperparameters, $\alpha$ and $\lambda$. ",TRUE
MIT Fall 2019,8,e.iii,0.5,Regression,Text,"We previously examined ridge regression, where a regularizer term $R_{\lambda}(\theta)$ is added to a sum of squares loss to form the $J_{1}$ objective function as below. Throughout this problem, we will assume zero offset $\theta_{0}=0$ and linear models of output $y$ as a function of input $x$. $$ \begin{aligned} J_{1}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Laz Zo prefers an alternative approach (called ""lasso"" regularization), where a different regularizer $R_{\alpha}(\theta)$ is added to the sum of squares loss: $$ \begin{aligned} J_{2}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right| \end{aligned} $$ Consider the two-dimensional case, $k=2$, so that our vector $\theta$ has just two components, $\theta_{1}$ and $\theta_{2}$. Suppose also that $\theta_{1}>\theta_{2}$ and both are positive $\left(\theta_{1}, \theta_{2}>0\right)$. We are interested in the behavior of $R_{\alpha}(\theta)$ and $R_{\lambda}(\theta)$. Assume both $\lambda$ and $\alpha$ are positive. Rega Lizer is interested in the behavior of these two regularizers, when used to fit a linear model by minimizing $J_{1}$ and $J_{2}$. We compare the ridge regularizer $R_{\lambda}$ and the lasso regularizer $R_{\alpha}$, for general $k$. Assume $\alpha$ and $\lambda$ are positive. Rega proposes combining the two regularizers with a sum of squares loss to form the $J_{3}$ objective: $$ \begin{aligned} J_{3}(\theta) &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+R_{\alpha}(\theta)+R_{\lambda}(\theta) \\ &=\frac{1}{2} \sum_{i=1}^{n}\left(y_{i}-\theta \cdot x_{i}\right)^{2}+\alpha \sum_{j=1}^{k}\left|\theta_{j}\right|+\frac{\lambda}{2}\|\theta\|^{2} \end{aligned} $$ Indicate true of false about using both of these regularizers when minimizing $J_{3}$: This is a bad idea, as the two regularizers are redundant, and only add complexity in training because now there are two hyperparameters, $\alpha$ and $\lambda$, that need to be decided.",FALSE
MIT Spring 2021,1,a,1,Name,Image,Write down your name,Write down your name
MIT Spring 2021,2,a,4,Features,Image,"For each of the datasets below, find a transformation from the original data into a single new feature $\phi\left(\left(x_{1}, x_{2}\right)\right)$ such that the data is linearly separable in the new space, and specify the parameters $\theta$ and $\theta_{0}$ of the separator in the transformed space.
(image here)","$\phi\left(\left(x_{1}, x_{2}\right)\right)=x_{1}^{2}+x_{2}^{2}$
$\theta=[-1]$
$\theta_{0}=-4$ (or any value between $-2$ and $-8$ )"
MIT Spring 2021,2,b,4,Features,Image,"For each of the datssets below, find a transformation from the original data into a single new feature $\phi\left(\left(x_{1}, x_{2}\right)\right)$ such that the data is linearly separable in the new space, and specify the parameters $\theta$ and $\theta_{0}$ of the separator in the transformed space.
(image here)","$\phi\left(\left(x_{1}, x_{2}\right)\right)=\left(x_{1}-x_{2}\right)^{2}$
$\theta=[-1]$
$\theta_{0}=-2$ (or any value between 0 and $-4$ )"
MIT Spring 2021,3,a,2,Features,Text,"You are trying to predict whether start-up companies will succeed or fail, based on the name of the company. You have the following dataset in the format (x, y): (Aardvarkia, +1), (Fro, +1), (Rodotopo, -1), (Whoodo, -1).  You have a one hot encoding for each of the names (Aardvarkia, Fro, Rodotopo, Whoodo). You consider two different encodings of the features:
â¢ One-hot encoding, with the first feature corresponding to âAardvarkia,â the second to âFro,â third to âRodotopo,â and fourth to âWhoodo.â
â¢ Numerical encoding, using the numerical place of the first letter of the name in the English alphabet (so âAâ is 0 and âZâ is 25).
Provide parameters of a 0-error linear separator using one-hot encoding.","All that matters is that the first two components of Î¸ are positive and
the last two are negative. "
MIT Spring 2021,3,b,2,Features,Text,"You are trying to predict whether start-up companies will succeed or fail, based on the name of the company. You have the following dataset in the format (x, y): (Aardvarkia, +1), (Fro, +1), (Rodotopo, -1), (Whoodo, -1).  You have a one hot encoding for each of the names (Aardvarkia, Fro, Rodotopo, Whoodo). You consider two different encodings of the features:
â¢ One-hot encoding, with the first feature corresponding to âAardvarkia,â the second to âFro,â third to âRodotopo,â and fourth to âWhoodo.â
â¢ Numerical encoding, using the numerical place of the first letter of the name in the English alphabet (so âAâ is 0 and âZâ is 25). Provide parameters of a 0-error linear separator using the numerical encoding.",Î¸ = [â1]T and Î¸0 = a for a > 5
MIT Spring 2021,3,c,2,Features,Text,"You are trying to predict whether start-up companies will succeed or fail, based on the name of the company. You have the following dataset in the format (x, y): (Aardvarkia, +1), (Fro, +1), (Rodotopo, -1), (Whoodo, -1).  You have a one hot encoding for each of the names (Aardvarkia, Fro, Rodotopo, Whoodo). You consider two different encodings of the features:
â¢ One-hot encoding, with the first feature corresponding to âAardvarkia,â the second to âFro,â third to âRodotopo,â and fourth to âWhoodo.â
â¢ Numerical encoding, using the numerical place of the first letter of the name in the English alphabet (so âAâ is 0 and âZâ is 25).You add a new company with name âZzyyzygyâ and class +1. If you extend the one-hot encoding to add another feature corresponding to this company name, will this new data set be linearly separable using the one-hot encoding? Explain briefly.","Yes. With the one hot encoding, thereâs a dimension for each point x(i), y(i), with y (i) â {â1, 1}, so we can always pick Î¸ = [y
(0), ..., y(n)] and Î¸0 = 0"
MIT Spring 2021,3,d,2,Features,Text,"You are trying to predict whether start-up companies will succeed or fail, based on the name of the company. You have the following dataset in the format (x, y): (Aardvarkia, +1), (Fro, +1), (Rodotopo, -1), (Whoodo, -1).  You have a one hot encoding for each of the names (Aardvarkia, Fro, Rodotopo, Whoodo). You consider two different encodings of the features:
â¢ One-hot encoding, with the first feature corresponding to âAardvarkia,â the second to âFro,â third to âRodotopo,â and fourth to âWhoodo.â
â¢ Numerical encoding, using the numerical place of the first letter of the name in the English alphabet (so âAâ is 0 and âZâ is 25). 
If you add the company ""Zzyyzygy"" to your data set but use the numeric encoding, is the new data set linearly separable? Explain briefly","No. The encoding remains one dimensional and now the data is not linearly separable, there are positive points on both sides of negative points."
MIT Spring 2021,4,a,0.5,Logistic Regression,Text,"It is common in classification problems for the cost of a false positive (predicting positive when the true answer is negative) to be different from the cost of a false negative (predicting negative when the true answer is positive). This might happen, for example, when the task is to predict the presence of a serious disease.
Letâs say that the cost of a correct classification is 0, the cost of a false positive is 1, and the
cost of a false negative (that is, you predict 0 when the correct answer was +1) is Î±. Recall that the usual logistic regression loss is: \begin{equation*} 
\mathcal{L}_\text{nll}(g, y) = 
-\left(y \cdot \log g + (1 - y)\cdot\log (1 -
 g)\right) \;\;.
\end{equation*}
What is the usual loss, as a function of guess g, when the true label y = 0?","Lnll(g, y = 0) = â log(1 â g)"
MIT Spring 2021,4,b,0.5,Logistic Regression,Text,"It is common in classification problems for the cost of a false positive (predicting positive when the true answer is negative) to be different from the cost of a false negative (predicting negative when the true answer is positive). This might happen, for example, when the task is to predict the presence of a serious disease.
Letâs say that the cost of a correct classification is 0, the cost of a false positive is 1, and the
cost of a false negative (that is, you predict 0 when the correct answer was +1) is Î±. Recall that the usual logistic regression loss is: \begin{equation*} 
\mathcal{L}_\text{nll}(g, y) = 
-\left(y \cdot \log g + (1 - y)\cdot\log (1 -
 g)\right) \;\;.
\end{equation*}
What is the usual loss, as a function of guess g, when the true label y = 1?","Lnll(g, y = 1) = â log(g)"
MIT Spring 2021,4,c,1,Logistic Regression,Text,"It is common in classification problems for the cost of a false positive (predicting positive when the true answer is negative) to be different from the cost of a false negative (predicting negative when the true answer is positive). This might happen, for example, when the task is to predict the presence of a serious disease.
Letâs say that the cost of a correct classification is 0, the cost of a false positive is 1, and the
cost of a false negative (that is, you predict 0 when the correct answer was +1) is Î±. Recall that the usual logistic regression loss is: \begin{equation*} 
\mathcal{L}_\text{nll}(g, y) = 
-\left(y \cdot \log g + (1 - y)\cdot\log (1 -
 g)\right) \;\;.
\end{equation*}
Write down a loss function that penalizes false negatives Î± times more than false positives.","Lnll(g, y) = â(Î±y log(g) + (1 â y) log(1 â g))"
MIT Spring 2021,4,d,2,Logistic Regression,Text,"It is common in classification problems for the cost of a false positive (predicting positive when the true answer is negative) to be different from the cost of a false negative (predicting negative when the true answer is positive). This might happen, for example, when the task is to predict the presence of a serious disease.
Letâs say that the cost of a correct classification is 0, the cost of a false positive is 1, and the
cost of a false negative (that is, you predict 0 when the correct answer was +1) is Î±. Recall that the usual logistic regression loss is: \begin{equation*} 
\mathcal{L}_\text{nll}(g, y) = 
-\left(y \cdot \log g + (1 - y)\cdot\log (1 -
 g)\right) \;\;.
\end{equation*}
Jun proposes that we can find a classifier that optimizes the classification cost without changing our logistic regression loss function, by rebalancing the training data, that is, adding multiple copies of each of the points in one of the classes. For Î± = 3, explain briefly how you would change the data.","For each data point with a true label which is positive, i.e. y = 1, add the
point two more times. This means that each data point with positive label is present
three times in the dataset."
MIT Spring 2021,4,e,2,Logistic Regression,Text,"It is common in classification problems for the cost of a false positive (predicting positive when the true answer is negative) to be different from the cost of a false negative (predicting negative when the true answer is positive). This might happen, for example, when the task is to predict the presence of a serious disease.
Letâs say that the cost of a correct classification is 0, the cost of a false positive is 1, and the
cost of a false negative (that is, you predict 0 when the correct answer was +1) is Î±. Recall that the usual logistic regression loss is: \begin{equation*} 
\mathcal{L}_\text{nll}(g, y) = 
-\left(y \cdot \log g + (1 - y)\cdot\log (1 -
 g)\right) \;\;.
\end{equation*}
Jun proposes that we can find a classifier that optimizes the classification cost without changing our logistic regression loss function, by rebalancing the training data, that is, adding multiple copies of each of the points in one of the classes. Would Junâs approach result in a classifier that optimizes the classification cost when using the Perceptron algorithm when the data are linearly separable? Explain briefly why or why not.
","For the separable case, repeating existing data points will keep the dataset
separable (for the perceptron). The classification error should remain at 0. This should
be similar to Junâs approach."
MIT Spring 2021,4,f,2,Logistic Regression,Text,"It is common in classification problems for the cost of a false positive (predicting positive when the true answer is negative) to be different from the cost of a false negative (predicting negative when the true answer is positive). This might happen, for example, when the task is to predict the presence of a serious disease.
Letâs say that the cost of a correct classification is 0, the cost of a false positive is 1, and the
cost of a false negative (that is, you predict 0 when the correct answer was +1) is Î±. Recall that the usual logistic regression loss is: \begin{equation*} 
\mathcal{L}_\text{nll}(g, y) = 
-\left(y \cdot \log g + (1 - y)\cdot\log (1 -
 g)\right) \;\;.
\end{equation*}
Jun proposes that we can find a classifier that optimizes the classification cost without changing our logistic regression loss function, by rebalancing the training data, that is, adding multiple copies of each of the points in one of the classes. Would Junâs approach result in a classifier that optimizes the classification cost when using the Perceptron algorithm when the data are linearly separable? Would Junâs approach result in a classifier that optimizes the classification cost when using the Perceptron algorithm when the data are not linearly separable?","For the non-separable case, the answer depends on how long we let the
perceptron run (because it will never converge). But roughly, since there are three
times more points of one label, the perceptronâs separator should have a similar effect
compared to Junâs approach (but they may not always match exactly, depending on
the order of iteration and number of iterations)."
MIT Spring 2021,4,g,1,Logistic Regression,Text,"It is common in classification problems for the cost of a false positive (predicting positive when the true answer is negative) to be different from the cost of a false negative (predicting negative when the true answer is positive). This might happen, for example, when the task is to predict the presence of a serious disease.
Letâs say that the cost of a correct classification is 0, the cost of a false positive is 1, and the
cost of a false negative (that is, you predict 0 when the correct answer was +1) is Î±. Recall that the usual logistic regression loss is: \begin{equation*} 
\mathcal{L}_\text{nll}(g, y) = 
-\left(y \cdot \log g + (1 - y)\cdot\log (1 -
 g)\right) \;\;.
\end{equation*}
Usually, in logistic regression, we predict class +1 when a > 0.5 and -1 otherwise. Jin proposes that we can use the standard logistic regression loss function and the same data set, but change the threshold of 0.5 that we use to select a prediction. Would you increase or decrease the threshold when Î± = 3? ",decrease
MIT Spring 2021,4,h,1,Logistic Regression,Text,"It is common in classification problems for the cost of a false positive (predicting positive when the true answer is negative) to be different from the cost of a false negative (predicting negative when the true answer is positive). This might happen, for example, when the task is to predict the presence of a serious disease.
Letâs say that the cost of a correct classification is 0, the cost of a false positive is 1, and the
cost of a false negative (that is, you predict 0 when the correct answer was +1) is Î±. Recall that the usual logistic regression loss is: \begin{equation*} 
\mathcal{L}_\text{nll}(g, y) = 
-\left(y \cdot \log g + (1 - y)\cdot\log (1 -
 g)\right) \;\;.
\end{equation*}
Suggest a strategy that Jin can use for picking a new threshold that minimizes our average asymmetric cost of classification.",Try several values and and find the one that minimizes the training loss.
MIT Spring 2021,5,a,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the training error dependency on the weight on the regularization term in logistic regression: monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.","monotonically increasing, monotonically increasing step"
MIT Spring 2021,5,a,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the testing error dependency on the weight on the regularization term in logistic regression: monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.",convex parabola
MIT Spring 2021,5,b,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the training error dependency on the step-size in gradient descent for neural networks (assuming a âfixed number of 
iterations): monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.",convex parabola
MIT Spring 2021,5,b,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the testing error dependency on the step-size in gradient descent for neural networks (assuming a âfixed number of 
iterations): monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.",convex parabola
MIT Spring 2021,5,c,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the training error dependency on the maximum depth of a decision tree: monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.",monotonically decreasing
MIT Spring 2021,5,c,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the testing error dependency on the maximum depth of a decision tree: monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.",convex parabola
MIT Spring 2021,5,d,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the training error dependency on the number of neighbors in nearest-neighbor classification: monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.","monotonically increasing, monotonically increasing step"
MIT Spring 2021,5,d,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the testing error dependency on the number of neighbors in nearest-neighbor classification: monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.",convex parabola
MIT Spring 2021,5,e,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the training error dependency on the number of epochs of gradient-descent to perform: monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.",monotonically decreasing
MIT Spring 2021,5,e,1,Loss Functions,Image,"We have looked at many machine-learning algorithms with hyper-parameters. Varying each of them has an effect on the loss on both the training data and on unseen testing data. What plot would describe the most typical behavior for the testing error dependency on the number of epochs of gradient-descent to perform: monotonically decreasing function, convex parabola, monotonically increasing function, monotonically decreasing step function, monotonically increasing step function? If none of them is appropriate, explain.",convex parabola
MIT Spring 2021,6,a,1,CNNs,Text,"PairNet definition: The PairNet is parameterized by (a) the weights $W$ of a small network $\textrm{NN}$ and (b) one more scalar $w_0$. If there are a total of $d$ features in the input, it has the final form of:
\[y = \sigma(w_0 + \sum_{j \in \{1, \ldots, d\}, k \in \{1, \ldots, d\}, j \neq k} \textrm{NN}([x_j, x_k]; W))\]
The small neural network, parameterized by weights $W$, takes a two-dimensional vector as input and generates a scalar output.   We write it as $\text{NN}([x_i, x_j]; W)$.   If this smaller neural network has multiple layers, then $W$ includes all the weights of all the layers, including offsets.  We apply PairNet to an image by letting $x_i$ and $x_j$ be pairs of pixel values drawn from throughout the input image, and $d$ is the total number of pixels.
HairNet definition:  A hairnet has hairy layers and max pooling layers.  A hairy layer is a lot like a convolutional layer, but it uses a different set of weights on each image patch.
Define the local 3 x 3 region of the (zero-padded) input image $I$ around pixel $i, j$:
    \begin{align*}
    R(I,i,j) = (&I_{i-1, j-1}, I_{i-1, j}, I_{i-1, j+1}, \\
                 &I_{i, j-1}, I_{i, j}, I_{i, j+1}, \\
                 &I_{i+1, j-1}, I_{i+1, j}, I_{i+1, j+1})
    \end{align*}
where $I_{i,j}$ is the pixel $i, j$ of $I$.
Pixel $i,j$ of the output image is computed as the dot product of $R(I,i,j)$ and a weight vector and offset for each image location, $W^{i,j}$ and $W^{i,j}_0$.  So output pixel $O_{i,j} = {W^{i,j}}^T  R(I, i, j) + W^{i,j}_0$.
Consider the number of parameters in a HairNet. Is it bigger or smaller than a fully connected network on an image of size 100 x 100? Explain briefly.",Smaller. A fully connected network has N params per output pixel while a HairNet has 9 params per output pixel.
MIT Spring 2021,6,b,1,CNNs,Text,"PairNet definition: The PairNet is parameterized by (a) the weights $W$ of a small network $\textrm{NN}$ and (b) one more scalar $w_0$. If there are a total of $d$ features in the input, it has the final form of:
\[y = \sigma(w_0 + \sum_{j \in \{1, \ldots, d\}, k \in \{1, \ldots, d\}, j \neq k} \textrm{NN}([x_j, x_k]; W))\]
The small neural network, parameterized by weights $W$, takes a two-dimensional vector as input and generates a scalar output.   We write it as $\text{NN}([x_i, x_j]; W)$.   If this smaller neural network has multiple layers, then $W$ includes all the weights of all the layers, including offsets.  We apply PairNet to an image by letting $x_i$ and $x_j$ be pairs of pixel values drawn from throughout the input image, and $d$ is the total number of pixels.
HairNet definition:  A hairnet has hairy layers and max pooling layers.  A hairy layer is a lot like a convolutional layer, but it uses a different set of weights on each image patch.
Define the local 3 x 3 region of the (zero-padded) input image $I$ around pixel $i, j$:
    \begin{align*}
    R(I,i,j) = (&I_{i-1, j-1}, I_{i-1, j}, I_{i-1, j+1}, \\
                 &I_{i, j-1}, I_{i, j}, I_{i, j+1}, \\
                 &I_{i+1, j-1}, I_{i+1, j}, I_{i+1, j+1})
    \end{align*}
where $I_{i,j}$ is the pixel $i, j$ of $I$.
Pixel $i,j$ of the output image is computed as the dot product of $R(I,i,j)$ and a weight vector and offset for each image location, $W^{i,j}$ and $W^{i,j}_0$.  So output pixel $O_{i,j} = {W^{i,j}}^T  R(I, i, j) + W^{i,j}_0$.
For a 100 x 100 image, is the number of parameters in a HairNet bigger or smaller than a CNN with a single convolutional layer with a 3 x 3 filter? Explain briefly.","Bigger.
A CNN with a single convolutional layer (3x3 filter) has 9 params in total while HairNet has 9 params per output pixel."
MIT Spring 2021,6,c,1,CNNs,Text,"PairNet definition: The PairNet is parameterized by (a) the weights $W$ of a small network $\textrm{NN}$ and (b) one more scalar $w_0$. If there are a total of $d$ features in the input, it has the final form of:
\[y = \sigma(w_0 + \sum_{j \in \{1, \ldots, d\}, k \in \{1, \ldots, d\}, j \neq k} \textrm{NN}([x_j, x_k]; W))\]
The small neural network, parameterized by weights $W$, takes a two-dimensional vector as input and generates a scalar output.   We write it as $\text{NN}([x_i, x_j]; W)$.   If this smaller neural network has multiple layers, then $W$ includes all the weights of all the layers, including offsets.  We apply PairNet to an image by letting $x_i$ and $x_j$ be pairs of pixel values drawn from throughout the input image, and $d$ is the total number of pixels.
HairNet definition:  A hairnet has hairy layers and max pooling layers.  A hairy layer is a lot like a convolutional layer, but it uses a different set of weights on each image patch.
Define the local 3 x 3 region of the (zero-padded) input image $I$ around pixel $i, j$:
    \begin{align*}
    R(I,i,j) = (&I_{i-1, j-1}, I_{i-1, j}, I_{i-1, j+1}, \\
                 &I_{i, j-1}, I_{i, j}, I_{i, j+1}, \\
                 &I_{i+1, j-1}, I_{i+1, j}, I_{i+1, j+1})
    \end{align*}
where $I_{i,j}$ is the pixel $i, j$ of $I$.
Pixel $i,j$ of the output image is computed as the dot product of $R(I,i,j)$ and a weight vector and offset for each image location, $W^{i,j}$ and $W^{i,j}_0$.  So output pixel $O_{i,j} = {W^{i,j}}^T  R(I, i, j) + W^{i,j}_0$.
CNNs are often described as exploiting spatial locality and translation invariance. Does Pairnet explot spatial locality, translation invariance, both, or neither?",Translation Invariance
MIT Spring 2021,6,d,1,CNNs,Text,"PairNet definition: The PairNet is parameterized by (a) the weights $W$ of a small network $\textrm{NN}$ and (b) one more scalar $w_0$. If there are a total of $d$ features in the input, it has the final form of:
\[y = \sigma(w_0 + \sum_{j \in \{1, \ldots, d\}, k \in \{1, \ldots, d\}, j \neq k} \textrm{NN}([x_j, x_k]; W))\]
The small neural network, parameterized by weights $W$, takes a two-dimensional vector as input and generates a scalar output.   We write it as $\text{NN}([x_i, x_j]; W)$.   If this smaller neural network has multiple layers, then $W$ includes all the weights of all the layers, including offsets.  We apply PairNet to an image by letting $x_i$ and $x_j$ be pairs of pixel values drawn from throughout the input image, and $d$ is the total number of pixels.
HairNet definition:  A hairnet has hairy layers and max pooling layers.  A hairy layer is a lot like a convolutional layer, but it uses a different set of weights on each image patch.
Define the local 3 x 3 region of the (zero-padded) input image $I$ around pixel $i, j$:
    \begin{align*}
    R(I,i,j) = (&I_{i-1, j-1}, I_{i-1, j}, I_{i-1, j+1}, \\
                 &I_{i, j-1}, I_{i, j}, I_{i, j+1}, \\
                 &I_{i+1, j-1}, I_{i+1, j}, I_{i+1, j+1})
    \end{align*}
where $I_{i,j}$ is the pixel $i, j$ of $I$.
Pixel $i,j$ of the output image is computed as the dot product of $R(I,i,j)$ and a weight vector and offset for each image location, $W^{i,j}$ and $W^{i,j}_0$.  So output pixel $O_{i,j} = {W^{i,j}}^T  R(I, i, j) + W^{i,j}_0$.. CNNs are often described as exploiting spatial locality and translation invariance. Does Hairnet explot spatial locality, translation invariance, both or neither?",Spatial locality
MIT Spring 2021,6,e,1,CNNs,Text,"PairNet definition: The PairNet is parameterized by (a) the weights $W$ of a small network $\textrm{NN}$ and (b) one more scalar $w_0$. If there are a total of $d$ features in the input, it has the final form of:
\[y = \sigma(w_0 + \sum_{j \in \{1, \ldots, d\}, k \in \{1, \ldots, d\}, j \neq k} \textrm{NN}([x_j, x_k]; W))\]
The small neural network, parameterized by weights $W$, takes a two-dimensional vector as input and generates a scalar output.   We write it as $\text{NN}([x_i, x_j]; W)$.   If this smaller neural network has multiple layers, then $W$ includes all the weights of all the layers, including offsets.  We apply PairNet to an image by letting $x_i$ and $x_j$ be pairs of pixel values drawn from throughout the input image, and $d$ is the total number of pixels.
HairNet definition:  A hairnet has hairy layers and max pooling layers.  A hairy layer is a lot like a convolutional layer, but it uses a different set of weights on each image patch.
Define the local 3 x 3 region of the (zero-padded) input image $I$ around pixel $i, j$:
    \begin{align*}
    R(I,i,j) = (&I_{i-1, j-1}, I_{i-1, j}, I_{i-1, j+1}, \\
                 &I_{i, j-1}, I_{i, j}, I_{i, j+1}, \\
                 &I_{i+1, j-1}, I_{i+1, j}, I_{i+1, j+1})
    \end{align*}
where $I_{i,j}$ is the pixel $i, j$ of $I$.
Pixel $i,j$ of the output image is computed as the dot product of $R(I,i,j)$ and a weight vector and offset for each image location, $W^{i,j}$ and $W^{i,j}_0$.  So output pixel $O_{i,j} = {W^{i,j}}^T  R(I, i, j) + W^{i,j}_0$.
The parameters of a CNN trained on images of one size can often be applied successfully to images of another size. Is this true of PairNet?","Yes and no:
â¢ Yes: because the same weights are applied to every pair of inputs, that part of
the network is insensitive to the total number of inputs and can be applied to
images of different sizes.
â¢ No: the threshold at the last âlayerâ might need to vary depending on the total
number of outputs of the pair network being combined.
PairNet only depends on pairs of pixels. Images of another size simply control the
number of pairs and not the count of pairs.
"
MIT Spring 2021,6,f,1,CNNs,Text,"PairNet definition: The PairNet is parameterized by (a) the weights $W$ of a small network $\textrm{NN}$ and (b) one more scalar $w_0$. If there are a total of $d$ features in the input, it has the final form of:
\[y = \sigma(w_0 + \sum_{j \in \{1, \ldots, d\}, k \in \{1, \ldots, d\}, j \neq k} \textrm{NN}([x_j, x_k]; W))\]
The small neural network, parameterized by weights $W$, takes a two-dimensional vector as input and generates a scalar output.   We write it as $\text{NN}([x_i, x_j]; W)$.   If this smaller neural network has multiple layers, then $W$ includes all the weights of all the layers, including offsets.  We apply PairNet to an image by letting $x_i$ and $x_j$ be pairs of pixel values drawn from throughout the input image, and $d$ is the total number of pixels.
HairNet definition:  A hairnet has hairy layers and max pooling layers.  A hairy layer is a lot like a convolutional layer, but it uses a different set of weights on each image patch.
Define the local 3 x 3 region of the (zero-padded) input image $I$ around pixel $i, j$:
    \begin{align*}
    R(I,i,j) = (&I_{i-1, j-1}, I_{i-1, j}, I_{i-1, j+1}, \\
                 &I_{i, j-1}, I_{i, j}, I_{i, j+1}, \\
                 &I_{i+1, j-1}, I_{i+1, j}, I_{i+1, j+1})
    \end{align*}
where $I_{i,j}$ is the pixel $i, j$ of $I$.
Pixel $i,j$ of the output image is computed as the dot product of $R(I,i,j)$ and a weight vector and offset for each image location, $W^{i,j}$ and $W^{i,j}_0$.  So output pixel $O_{i,j} = {W^{i,j}}^T  R(I, i, j) + W^{i,j}_0$.
The parameters of a CNN trained on images of one size can often be applied successfully to images of another size. Is this true of HairNet?",No.  A HairNetâs params are dependent on the size of the image.
MIT Spring 2021,7,a.i,0.2,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i.
What is the optimal horizon 1 policy in s(1) climb or quit?",quit
MIT Spring 2021,7,a.ii,0.2,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i.
What is the optimal horizon 1 policy in s(2) climb or quit?",quit
MIT Spring 2021,7,a.iii,0.2,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i.
What is the optimal horizon 1 policy in s(4) climb or quit?",quit
MIT Spring 2021,7,a.iv,0.2,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i.
What is the optimal horizon 1 policy in s(5) climb or quit?",quit
MIT Spring 2021,7,a.v,0.2,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i.
What is the optimal horizon 1 policy in s(7) climb or quit?",quit
MIT Spring 2021,7,b,2,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. If you initialize the Q values of all the states to 0, and do one iteration of undiscounted (Î³ = 1) value iteration, what is the resulting Q value function?","Q(s, quit) = 1, 2, 4, 5, 7
Q(s, climb) = 0, 0, 0, 0, 0"
MIT Spring 2021,7,c.i,0.4,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. What is the optimal infinite horizon policy in s(1) with no discounting climb or quit?",climb
MIT Spring 2021,7,c.ii,0.4,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. What is the optimal horizon 1 policy in s(2) climb or quit?",climb
MIT Spring 2021,7,c.iii,0.4,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. What is the optimal infinite horizon policy in s(4) with no discounting climb or quit?",climb
MIT Spring 2021,7,c.iv,0.4,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. What is the optimal infinite horizon policy in s(5) with no discounting climb or quit?",climb
MIT Spring 2021,7,c.v,0.4,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. What is the optimal infinite horizon policy in s(5) with no discounting climb or quit?",climb
MIT Spring 2021,7,d,3,MDPs,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. Now letâs consider discounting. State an inequality involving numeric values, Î³, Q(s2, climb), and Q(s7, climb), specifying the condition under which the optimal action in s5 is to quit.","\[ 5 > \frac{1}{2}\gamma Q(s_2, \textbf{climb}) + \frac{7}{2}\gamma\;\;.\]"
MIT Spring 2021,8,a,1,Reinforcement Learning,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. Assume we know the rewards, but not the transition model. During learning, after executing the quit action and getting a reward, a new learning episode is initialized, after we go back to square 1. We will use a discount factor of Î³ = 1. 
Why is value iteration not a good choice of algorithm for this problem?",Because we donât know the transition model!
MIT Spring 2021,8,b,1,Reinforcement Learning,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. Assume we know the rewards, but not the transition model. During learning, after executing the quit action and getting a reward, a new learning episode is initialized, after we go back to square 1. We will use a discount factor of Î³ = 1. 
If we do purely greedy action selection during Q-learning (that is $\epsilon = 0$), starting from all 0âs in our Q table and where ties are broken in favor of the climb action, what (roughly) will the Q function be after 1000 steps?",0
MIT Spring 2021,8,c,1,Reinforcement Learning,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. Assume we know the rewards, but not the transition model. During learning, after executing the quit action and getting a reward, a new learning episode is initialized, after we go back to square 1. We will use a discount factor of Î³ = 1. If we do purely greedy action selection during Q-learning (that is $\epsilon = 0$), starting from all 0âs in our Q table and where ties are broken in favor of the quit action, what (roughly) will the Q function be after 1000 steps?","It will be all 0 except Q(s1, quit) = 1"
MIT Spring 2021,8,d.i,1.666666667,Reinforcement Learning,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. Assume we know the rewards, but not the transition model. During learning, after executing the quit action and getting a reward, a new learning episode is initialized, after we go back to square 1. We will use a discount factor of Î³ = 1. 
Assume we start with a tabular Q function on states s1, s2, s4, s5 and s7, initialized to all 0âs. Then we get the following experience, consisting of three trajectories, shown below as (state, action, reward) tuples. 
Using learning rate Î± = 0.5, what is the Q function after each of these trajectories? (You only need to fill in the non-zero entries). Trajectory: ((s(1), climb, 0),(s(2), climb, 0),(s(5), climb, 0),(s(7), quit, 7))","Q(s(7), quit) = 3.5"
MIT Spring 2021,8,d.ii,1.666666667,Reinforcement Learning,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. Assume we know the rewards, but not the transition model. During learning, after executing the quit action and getting a reward, a new learning episode is initialized, after we go back to square 1. We will use a discount factor of Î³ = 1. 
Assume we start with a tabular Q function on states s1, s2, s4, s5 and s7, initialized to all 0âs. Then we get the following experience, consisting of three trajectories, shown below as (state, action, reward) tuples. 
Using learning rate Î± = 0.5, what is the Q function after each of these trajectories? (You only need to fill in the non-zero entries). Trajectory: ((s(1), quit, 1)","Q(s(1), quit) = 0.5, Q(s(17, quit) = 3.5"
MIT Spring 2021,8,d.iii,1.666666667,Reinforcement Learning,Text,"You are moving up a 1-dimensional track with squares s(1), s(2), s(4), s(5), s(7).
The following transitions happen with 100% probability:
s(3) to s(5)
s(9) to s(1)
s(8) to s(1)
s(6) to s(2)
You have two actions: climb and quit. If you climb from state s(i) then with probability 0.5 you go up one square, and with probability 0.5 you go up two squares. So, for example, in our case, if you start in state s(5) and climb there is a .5 chance you'll end up in square s(2) (because you move up one but transition to s(2) and a .5 chance you'll end up in square s(7). If you climb from s(7) then you will go back to square s(1) with probability 1.0. If you quit then the game is over and you get to take no further actions. Each new episode starts in state s(1). The reward for choosing climb in any state is 0. The reward for choosing quit in state s(i) is i. Assume we know the rewards, but not the transition model. During learning, after executing the quit action and getting a reward, a new learning episode is initialized, after we go back to square 1. We will use a discount factor of Î³ = 1. 
Assume we start with a tabular Q function on states s1, s2, s4, s5 and s7, initialized to all 0âs. Then we get the following experience, consisting of three trajectories, shown below as (state, action, reward) tuples. 
Using learning rate Î± = 0.5, what is the Q function after each of these trajectories? (You only need to fill in the non-zero entries). Trajectory: ((s(1), climb, 0),(s(2), climb, 0),(s(5), climb, 0),(s(7), quit, 7))","Q(s(1), quit) = 0.5, Q(s(7), quit) = 5.25, Q(s(5), climb) = 1.75"
MIT Spring 2021,9,a.i,1,Reinforcement Learning,Text,"Kim is running Q learning on a simple 2D grid-world problem and visualizes the current Q value estimates and greedy policy with respect to the current Q value estimates at each location. States correspond to squares in the grid, actions are north, south, east, and west, and there is a single state with non-zero reward. Assume the transition model is deterministic and action north moves you one square up, etc., except at the boundaries of the domain. Diagonal moves are not possible. 
Define the following in terms of the current estimated action-value function, Q: The greedy policy with respect to Q for state s.","greedy = argmax_aQ(s, a)"
MIT Spring 2021,9,a.ii,1,Reinforcement Learning,Text,"Kim is running Q learning on a simple 2D grid-world problem and visualizes the current Q value estimates and greedy policy with respect to the current Q value estimates at each location. States correspond to squares in the grid, actions are north, south, east, and west, and there is a single state with non-zero reward. Assume the transition model is deterministic and action north moves you one square up, etc., except at the boundaries of the domain. Diagonal moves are not possible. 
Define the following in terms of the current estimated action-value function, Q: The estimated value of state s.","value = max_a Q(s, a)"
MIT Spring 2021,9,b,1,Reinforcement Learning,Image,"Kim is running $Q$ learning on a simple $2 D$ grid-world problem and visualizes the current $Q$ value estimates and greedy policy with respect to the current $Q$ value estimates at each location. States correspond to squares in the grid, actions are north, south, east, and west, and there is a single state with non-zero reward. Assume the transition model is deterministic and action north moves you one square up, etc., except at the boundaries of the domain. Diagonal moves are not possible. 
Kim sees the situation below while their algorithm is running. The numbers in the boxes correspond to the estimated $\hat{V}$ values for the states neighboring state $s$, and the arrow indicates the greedy action with respect to $\hat{Q}$ for state $s$. All of the states shown have 0 reward values.
Explain briefly why this situation might be concerning.","The situation is potentially concerning because the greedy action is to move north, but the neighboring state with the highest estimated value is to the south."
MIT Spring 2021,9,c,2,Reinforcement Learning,Image,"Kim is running $Q$ learning on a simple $2 D$ grid-world problem and visualizes the current $Q$ value estimates and greedy policy with respect to the current $Q$ value estimates at each location. States correspond to squares in the grid, actions are north, south, east, and west, and there is a single state with non-zero reward. Assume the transition model is deterministic and action north moves you one square up, etc., except at the boundaries of the domain. Diagonal moves are not possible. 
Does this situstion mean that there is a bug in Kim's Q-learning implementation? Explain briefly why or why not.","This is not necessarily a bug. The value of the state to the north, $s_{\text {north }}$ depends on the values $Q\left(s_{\text {north }}, a\right)$ and the policy at $s$ depends on the values $Q(s, a)$. During learning, before convergence, it is entirely possible for them to disagree in this way."
MIT Spring 2021,10,a,2,Classifiers,Text,"Given a set of data $\mathcal{D}_\text{train} = \{(\ex{x}{i}, \ex{y}{i})\}$, a weighted nearest neighbor regressor  has the form 
\[h(x , \theta) = \frac{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) \ex{y}{i}}
{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) }
\,\,.\]
A typical choice for $f$ is
\[f(x, x', \theta) = e^{-\theta \|x - x'\|^2}
\]
where $\theta$ is a scalar and $\|x-x'\|^2 = \sum_{j=1}^d (x_j - x'_j)^2$. Assume our training data $\mathcal{D}_\text{train} = ((1, 1), (2, 2), (3, 6))$. What is $h(10, 0)$?  That is, letting $\theta=0$, what is our prediction for $x = 10$?",3
MIT Spring 2021,10,b,2,Classifiers,Text,"Given a set of data $\mathcal{D}_\text{train} = \{(\ex{x}{i}, \ex{y}{i})\}$, a weighted nearest neighbor regressor  has the form 
\[h(x , \theta) = \frac{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) \ex{y}{i}}
{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) }
\,\,.\]
A typical choice for $f$ is
\[f(x, x', \theta) = e^{-\theta \|x - x'\|^2}
\]
where $\theta$ is a scalar and $\|x-x'\|^2 = \sum_{j=1}^d (x_j - x'_j)^2$. Assume our training data $\mathcal{D}_\text{train} = ((1, 1), (2, 2), (3, 6))$. Approximately what is $h(10, 1)$? 
",Approximately 6
MIT Spring 2021,10,c,2,Classifiers,Text,"Given a set of data $\mathcal{D}_\text{train} = \{(\ex{x}{i}, \ex{y}{i})\}$, a weighted nearest neighbor regressor  has the form 
\[h(x , \theta) = \frac{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) \ex{y}{i}}
{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) }
\,\,.\]
A typical choice for $f$ is
\[f(x, x', \theta) = e^{-\theta \|x - x'\|^2}
\]
where $\theta$ is a scalar and $\|x-x'\|^2 = \sum_{j=1}^d (x_j - x'_j)^2$.
How does a weighted nearest neighbor approach compare to linear regression for the same data? Why might we prefer one over the other?
","A linear regression model would fit a straight line through the training data and allow extrapolation. It would predict h(10) to be much larger because that is the trend in the training data (y is becoming larger as x is becoming large).
The Heavy Neighbor approach will keep the predictions within the limits of the training data labels (it is a weighted average of the training data points). This would be preferred if we do not want to extrapolate beyond the training data."
MIT Spring 2021,10,d,2,Classifiers,Text,"Given a set of data $\mathcal{D}_\text{train} = \{(\ex{x}{i}, \ex{y}{i})\}$, a weighted nearest neighbor regressor  has the form 
\[h(x , \theta) = \frac{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) \ex{y}{i}}
{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) }
\,\,.\]
A typical choice for $f$ is
\[f(x, x', \theta) = e^{-\theta \|x - x'\|^2}
\]
where $\theta$ is a scalar and $\|x-x'\|^2 = \sum_{j=1}^d (x_j - x'_j)^2$.
How does a wieghted nearest neighbor approach compare to linear regression for the same data? Why might we prefer one over the other? If we were only ever going to have to make predictions on the training data, what value of $\theta$ would tend to minimize our prediction error?",Use a very large theta
MIT Spring 2021,10,e,2,Classifiers,Text,"Given a set of data $\mathcal{D}_\text{train} = \{(\ex{x}{i}, \ex{y}{i})\}$, a weighted nearest neighbor regressor  has the form 
\[h(x , \theta) = \frac{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) \ex{y}{i}}
{\sum_{(\ex{x}{i}, \ex{y}{i}) \in \mathcal{D}_\text{train}} f(x, \ex{x}{i} , \theta) }
\,\,.\]
A typical choice for $f$ is
\[f(x, x', \theta) = e^{-\theta \|x - x'\|^2}
\]
where $\theta$ is a scalar and $\|x-x'\|^2 = \sum_{j=1}^d (x_j - x'_j)^2$. How does a wieghted nearest neighbor approach compare to linear regression for the same data? Why might we prefer one over the other? If we were only ever going to have to make predictions on the training data, what value of $\theta$ would tend to minimize our prediction error? Dino thinks the denominator in the definition of h is not useful and it would be fine to remove it. Is Dino right?","No. The denominator is needed for normalization (to keep the prediction
in the same range of yâs as the training data)."
MIT Spring 2021,11,a,3,RNNs,Text,"Ronnie makes a simple RNN with state dimension 1 and a {\em step} function for $f_1$, so that  \[s_t = \text{step}( w_1 x_t + w_2 s_{t-1} + b) \] where $\text{step}(z) = 1$ if $z > 0.0$ and equals $0$ otherwise, and where the output  \[y_t = s_t\,\,.\]. Assuming $s_0 = 0$, what values of $w_1$, $w_2$ and $b$ would generate output sequence  \[[0, 0, 0,  1, 1, 1, 1]\] given input sequence \[ [0, 0, 0, 1, 0, 1, 0]\] ","Since xt = 1 and st = 0 produces st = 1, we have that w1 + b > 0, for example w1 = 1 if b = 0
Since xt = 0 and st = 1 produces st = 1, we have that w2 + b > 0, for example w2 = 1 if b = 0
Since xt = 0 and st = 0 produces st = 0, we have that b â¤ 0."
MIT Spring 2021,11,b.i,0.5,RNNs,Text,"Ronnie makes a simple RNN with state dimension 1 and a {\em step} function for $f_1$, so that \[s_t = \text{step}( w_1 x_t + w_2 s_{t-1} + b) \] where $\text{step}(z) = 1$ if $z > 0.0$ and equals $0$ otherwise, and where the output  \[y_t = s_t\;\;.\]. Assuming $s_0 = 1$, we want to generate output sequence  \[[1, 1, 1, 0, 0, 0, 1, 1]\] given input sequence \[ [0, 0, 0, 1, 0, 0, 1, 0]\]. What is the value of s_t for x_t = 0 and s_{t-1} = 0?",0
MIT Spring 2021,11,b.ii,0.5,RNNs,Text,"Ronnie makes a simple RNN with state dimension 1 and a {\em step} function for $f_1$, so that \[s_t = \text{step}( w_1 x_t + w_2 s_{t-1} + b) \] where $\text{step}(z) = 1$ if $z > 0.0$ and equals $0$ otherwise, and where the output  \[y_t = s_t\;\;.\]. Assuming $s_0 = 1$, we want to generate output sequence   \[[1, 1, 1, 0, 0, 0, 1, 1]\] given input sequence \[ [0, 0, 0, 1, 0, 0, 1, 0]\]. What is the value of s_t for x_t = 0 and s_{t-1} = 1?",1
MIT Spring 2021,11,b.iii,0.5,RNNs,Text,"Ronnie makes a simple RNN with state dimension 1 and a {\em step} function for $f_1$, so that  \[s_t = \text{step}( w_1 x_t + w_2 s_{t-1} + b) \] where $\text{step}(z) = 1$ if $z > 0.0$ and equals $0$ otherwise, and where the output \[y_t = s_t\;\;.\]. Assuming $s_0 = 1$, we want to generate output sequence  \[[1, 1, 1, 0, 0, 0, 1, 1]\] given input sequence \[ [0, 0, 0, 1, 0, 0, 1, 0]\]. What is the value of s_t for x_t = 1 and s_{t-1} = 0?",1
MIT Spring 2021,11,b.iv,0.5,RNNs,Text,"Ronnie makes a simple RNN with state dimension 1 and a {\em step} function for $f_1$, so that \[s_t = \text{step}( w_1 x_t + w_2 s_{t-1} + b) \] where $\text{step}(z) = 1$ if $z > 0.0$ and equals $0$ otherwise, and where the output \[y_t = s_t\;\;.\] Assuming $s_0 = 1$, we want to generate output sequence  \[[1, 1, 1, 0, 0, 0, 1, 1]\] given input sequence \[ [0, 0, 0, 1, 0, 0, 1, 0]\]. What is the value of s_t for x_t = 1 and s_{t-1} = 1?",0
MIT Spring 2021,11,c,3,RNNs,Text,"Ronnie makes a simple RNN with state dimension 1 and a {\em step} function for $f_1$, so that \[s_t = \text{step}( w_1 x_t + w_2 s_{t-1} + b) \] where $\text{step}(z) = 1$ if $z > 0.0$ and equals $0$ otherwise, and where the output \[y_t = s_t\;\;.\] Assuming $s_0 = 1$, we want to generate output sequence  \[[1, 1, 1, 0, 0, 0, 1, 1]\] given input sequence \[ [0, 0, 0, 1, 0, 0, 1, 0]\]. Rennie thinks this is not possible using Ronnieâs architecture. Rennie makes an argument based on the relationships in the table above. Is Rennie right?",Rennie is right
MIT Spring 2021,12,a,1,Decision Trees,Image,"Here is a standard regression tree of a fixed size. It has 5 scalar parameters $\left(s_{1}, s_{2}, v_{1}, v_{2}, v_{3}\right)$ and two discrete choices of feature to split on, denoted by integers $j$ and $k$.
We are given a training data set $\mathcal{D}_{\operatorname{train}}=\left\{\left(x^{(j)}, y^{(j)}\right)\right\}$ where the dimension of $x^{(j)}$ is $d$. 
Explain briefly why we cannot use gradient descent on a squared loss to optimize all the parameters of this predictor.",The gradients are are zero or do not exist.
MIT Spring 2021,12,b,7,Decision Trees,Image,"Here is a standard regression tree of a fixed size. It has 5 scalar parameters $\left(s_{1}, s_{2}, v_{1}, v_{2}, v_{3}\right)$ and two discrete choices of feature to split on, denoted by integers $j$ and $k$.
We are given a training data set $\mathcal{D}_{\operatorname{train}}=\left\{\left(x^{(j)}, y^{(j)}\right)\right\}$ where the dimension of $x^{(j)}$ is $d$.
Terry would like to make a ""smoother"" tree by replacing the tests at the nodes with neuralnetwork logistic classifiers and by combining predictions from the branches, so that we can think of the tree as a parametric model and optimize the parameters using gradient descent. More concretely, at each internal node, the test will be replaced by $\mathrm{NN}(x ; \theta)$, a neural network that takes an entire input vector $x$, of dimension $d$, as input and generates an output in the range $[0,1]$ by using a sigmoid unit on the output.
You can think of any node $T_{i}$ of a tree as producing an output value as follows:
- If $T_{\mathrm{i}}$ is a leaf, then the output on input $x, T_{\mathrm{i}}(x)$, is a constant $v_{\mathrm{i}}$. (corresponding to ""yes"" branch), then the output on input $x$ is
$$
T_{i}(x)=\left(1-\mathrm{NN}\left(x ; \theta^{(i)}\right)\right) T_{\mathrm{na}}(x)+\mathrm{NN}\left(x ; \theta^{(i)}\right) T_{\mathrm{yas}}(x) .
$$
That is, it is a weighted combination of the results of the children, where the neural network at the parent node, with parameters $\theta^{(i)}$, modulates the combination of the results of the children.

We will consider the specific case where NN is a single unit with a sigmoidal activation function, so that
$$
\mathrm{NN}\left(x ; W^{(i)}, W_{0}^{(\mathrm{i})}\right)=\sigma\left(W^{(i)^{T}} x+W_{0}^{(i)}\right)
$$
where $W^{(i)}$ is a vector of length $d$ and $W_{0}^{(i)}$ is a scalar and $\sigma$ is the sigmoid function.

Consider the dataset shown in the plot below right, where $d=2$. Each integer value on the plot (one of $5,-2$, or 8 ) corresponds to a datapoint whose input $x$ features are the coordinates of the point on the plot and whose output $y$ value is the printed number.
Provide the parameters of a tree-predictor, corresponding to the model shown above left, that make accurate predictions on the dataset.","$W^{(1)}=[100,100]^{T}$
$W_{0}^{(1)}=0$
$W^{(2)}=[-100,100]^{T}$, or $W^{(2)}=[100,-100]^{T}$
 $W_{0}^{(2)}=100^{T}$, or $W_{0}^{(2)}=-100$ (should match with the answer above).
$v_{1}=-2$, or $v_{1}=5$ (depends on the answer above).
 $v_{2}=5$ or $v_{2}--2$ (depends on the answer above).
$v_{3}=8$"
MIT Spring 2021,12,c,3,Decision Trees,Image,"Here is a standard regression tree of a fixed size. It has 5 scalar parameters $\left(s_{1}, s_{2}, v_{1}, v_{2}, v_{3}\right)$ and two discrete choices of feature to split on, denoted by integers $j$ and $k$.
We are given a training data set $\mathcal{D}_{\operatorname{train}}=\left\{\left(x^{(j)}, y^{(j)}\right)\right\}$ where the dimension of $x^{(j)}$ is $d$.
Terry would like to make a ""smoother"" tree by replacing the tests at the nodes with neuralnetwork logistic classifiers and by combining predictions from the branches, so that we can think of the tree as a parametric model and optimize the parameters using gradient descent. More concretely, at each internal node, the test will be replaced by $\mathrm{NN}(x ; \theta)$, a neural network that takes an entire input vector $x$, of dimension $d$, as input and generates an output in the range $[0,1]$ by using a sigmoid unit on the output.
You can think of any node $T_{i}$ of a tree as producing an output value as follows:
- If $T_{\mathrm{i}}$ is a leaf, then the output on input $x, T_{\mathrm{i}}(x)$, is a constant $v_{\mathrm{i}}$. (corresponding to ""yes"" branch), then the output on input $x$ is
$$
T_{i}(x)=\left(1-\mathrm{NN}\left(x ; \theta^{(i)}\right)\right) T_{\mathrm{na}}(x)+\mathrm{NN}\left(x ; \theta^{(i)}\right) T_{\mathrm{yas}}(x) .
$$
That is, it is a weighted combination of the results of the children, where the neural network at the parent node, with parameters $\theta^{(i)}$, modulates the combination of the results of the children.

We will consider the specific case where NN is a single unit with a sigmoidal activation function, so that
$$
\mathrm{NN}\left(x ; W^{(i)}, W_{0}^{(\mathrm{i})}\right)=\sigma\left(W^{(i)^{T}} x+W_{0}^{(i)}\right)
$$
where $W^{(i)}$ is a vector of length $d$ and $W_{0}^{(i)}$ is a scalar and $\sigma$ is the sigmoid function.
What is $\partial T_{1}(x) / \partial W^{(1)}$ in this particular model? Please use the following shorthand:
- $T=T_{1}(x)$
- $O=\mathrm{NN}\left(x ; W^{(1)}, W_{0}^{(1)}\right)$
- $T_{\text {no }}=$ the output of the ""no"" branch of $T_{1}$
- $T_{\text {yes }}=$ the output of the ""yes"" branch of $T_{1}$
Express your answer in terms of these quantities, $x$, and parameters $\left(W^{(1)}, W^{(2)}, W_{0}^{(1)}, W_{0}^{(2)}, v_{1}, v_{2}, v_{3}\right)$, as needed, but do not leave any derivatives in it.","Using shorthands:
$$
T=(1-O) T_{\text {no }}+O T_{\text {yas }}
$$
Only $O$ is a function of $W^{(1)}$. Also recall that the derivative of the sigmoid can be simplified as: $\sigma^{\prime}(g(w))=\sigma(g(w))(1-\sigma(g(w))) g^{\prime}(w)$. Therefore, (more Latex here)"
MIT Spring 2021,12,d,1,Decision Trees,Image,"Here is a standard regression tree of a fixed size. It has 5 scalar parameters $\left(s_{1}, s_{2}, v_{1}, v_{2}, v_{3}\right)$ and two discrete choices of feature to split on, denoted by integers $j$ and $k$.
We are given a training data set $\mathcal{D}_{\operatorname{train}}=\left\{\left(x^{(j)}, y^{(j)}\right)\right\}$ where the dimension of $x^{(j)}$ is $d$. 
Tori thinks that since regression trees have repeated structure, similar to a CNN, that we should use the same weight vector $W$ and offset $W_{0}$ at all the internsl nodes. Explain the hypothesis class that results.","This is still a regression tree, but with a single linear split."
MIT Spring 2021,13,a,1,Neural Networks,Text,"Sam wants to build a neural network that takes in a scalar value $x$ in the range $[0, 1]$ and generates a one-hot output vector $y$ of dimension $K$, where, for $k \in \{0, 1, \ldots, K-1\}$,  $y_k = 1$ if and only if $k/K < x \leq (k+1)/K$;  that is, it discretizes the interval into $K$ equally sized sequential ranges. They choose an architecture with a single linear layer with weights $W$ and $W_0$ and a softmax activation function, so that the output  
\[a = \text{softmax}(z)\]
where 
\[z = W^T x + W_0\;\;.\]
Assume that, for prediction purposes,  we are going to take the output of the network, $a$, and convert it into a $K$-dimensional one-hot vector $(y_0, \ldots, y_{k-1})$ where
\begin{align*}
    y_i = \begin{cases} 1 & \text{if $i = \text{arg} \max_j a_j$}\\
    0 & \text{otherwise}
    \end{cases}
\end{align*}. That is, it has a value of $1$ at the index corresponding to the maximal element of $a$ and value $0$ everywhere else. How many trainable weights does this network have when $K = 10$?",20
MIT Spring 2021,13,b,2,Neural Networks,Image,"Sam wants to build a neural network that takes in a scalar value $x$ in the range $[0,1]$ and generates a one-hot output vector $y$ of dimension $K$, where, for $k \in\{0,1, \ldots, K-1\}$, $y k=1$ if and only if $k / K<x \leq(k+1) / K$; that is, it discretizes the interval into $K$ equally sized sequential ranges. Plesse don't worry about precisely what the output is at the boundaries of the intervals.

They choose an architecture with a single linear layer with weights $W$ and $W_{0}$ and a softmax activation function, so that the output
$$
a=\operatorname{softmax}(z)
$$
where
$$
z=W^{T} x+W_{0}
$$
Assume that, for prediction purposes, we are going to take the output of the network, $a$, and convert it into a $K$-dimensional one-hot vector $\left(y_{0}, \ldots, y_{k-1}\right)$ where
$$
y_{i}= \begin{cases}1 & \text { if } i=\arg \max _{j} a_{j} \\ 0 & \text { otherwise }\end{cases}
$$
That is, it has a value of 1 at the index corresponding to the maximal element of $a$ and value 0 everywhere else. 
Let's consider the case of $K=3$. On the axes below, draw the three components of the $z$ vector, $z_{0}, z_{1}$, and $z_{2}$, as a function of $x$ so that the resulting $y$ will provide a correct discretization of the interval into three equal regions. (There are many correct solutions.)",Drawing image
MIT Spring 2021,13,c,3,Neural Networks,Image,"Sam wants to build a neural network that takes in a scalar value $x$ in the range $[0,1]$ and generates a one-hot output vector $y$ of dimension $K$, where, for $k \in\{0,1, \ldots, K-1\}$, $y k=1$ if and only if $k / K<x \leq(k+1) / K$; that is, it discretizes the interval into $K$ equally sized sequential ranges. Plesse don't worry about precisely what the output is at the boundaries of the intervals.

They choose an architecture with a single linear layer with weights $W$ and $W_{0}$ and a softmax activation function, so that the output
$$
a=\operatorname{softmax}(z)
$$
where
$$
z=W^{T} x+W_{0}
$$
Assume that, for prediction purposes, we are going to take the output of the network, $a$, and convert it into a $K$-dimensional one-hot vector $\left(y_{0}, \ldots, y_{k-1}\right)$ where
$$
y_{i}= \begin{cases}1 & \text { if } i=\arg \max _{j} a_{j} \\ 0 & \text { otherwise }\end{cases}
$$
That is, it has a value of 1 at the index corresponding to the maximal element of $a$ and value 0 everywhere else. 
Provide a set of weight values that will discretize the unit interval into 3 equal parts, with output predictions $y=[1,0,0]$ for $x \in[0,1 / 3], y=[0,1,0]$ for $x \in[1 / 3,2 / 3]$, and $x=[0,0,1]$ for $x \in[2 / 3,1]$. Please don't worry about exactly what happens at the boundaries!!!","$$
W_{0}=[1 / 3,0,-2 / 3]^{T}
$$
$$
W=[-1,0,1]^{T}
$$"
MIT Fall 2021,1,a.i,0.4,Features,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Genre (Game, Productivity, Education, Information, Social)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","One-hot, with a bit for each possible genre: Game â 10000, Productivity â 01000, Education â 00100, Information â 00010, Social â 00001"
MIT Fall 2021,1,a.ii,0.4,Features,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Suitable for people ages (2â4, 5â10, 11â15, 16 and over)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","Thermometer, because order should be preserved: 2â4: 1000 ; 5â10: 1100, 11â15: 1110, 16 and over: 1111"
MIT Fall 2021,1,a.iii,0.4,Features,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Was it banned in any previous quarter (True, False)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","Single binary feature, True: 1, False: 0. We also accepted a True/False encoding since Python correctly does arithmetic with it."
MIT Fall 2021,1,a.iv,0.4,Features,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Price of the app (positive number)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","Real-value, may standardize it using (x â Âµ)/Ï for Âµ being the mean and Ï the standard deviation"
MIT Fall 2021,1,a.v,0.4,Features,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Does it have in-game advertising (True, False)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","Single binary feature, True: 1, False: 0. We also accepted a True/False encoding since Python correctly does arithmetic with it.
"
MIT Fall 2021,1,b.i,0.333333333,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac wants to predict the sales volume (how many times someone will purchase the app each month) for his new app. The sales volume can be negative if many people returned the app for a refund in a given month. What should Mac choose for the number of units in the output layer?",One unit
MIT Fall 2021,1,b.ii,0.333333333,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac wants to predict the sales volume (how many times someone will purchase the app each month) for his new app. The sales volume can be negative if many people returned the app for a refund in a given month. What should Mac choose for the activation function(s) in the output layer? Choose either Linear, ReLU or sigmoid.",Linear
MIT Fall 2021,1,b.iii,0.333333333,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac wants to predict the sales volume (how many times someone will purchase the app each month) for his new app. The sales volume can be negative if many people returned the app for a refund in a given month. What should Mac choose for the loss function? Choose from either negative log likelihood or quadratic.",Quadratic
MIT Fall 2021,1,c.i,0.333333333,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac has data on three other properties: whether an app was featured on the front page, whether it got a favorable review on the Coolest Apps Evar web site, and whether Orange Computer offered to pay to port the app to their site. He would like to train a new neural network to predict these three properties. For this new prediction task, what should Mac choose for the number of units in the output
layer?",3 units
MIT Fall 2021,1,c.ii,0.333333333,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac has data on three other properties: whether an app was featured on the front page, whether it got a favorable review on the Coolest Apps Evar web site, and whether Orange Computer offered to pay to port the app to their site. He would like to train a new neural network to predict these three properties. For this new prediction task, what should Mac choose for the activation function in the output layer? Choose from linear, ReLU, sigmoid or softmax.",Sigmoid
MIT Fall 2021,1,c.iii,0.333333333,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac has data on three other properties: whether an app was featured on the front page, whether it got a favorable review on the Coolest Apps Evar web site, and whether Orange Computer offered to pay to port the app to their site. He would like to train a new neural network to predict these three properties. For this new prediction task, what should Mac choose for the loss function? Choose from negative log likelihood or quadratic.",Negative log likelihood
MIT Fall 2021,1,d.i,1,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Macâs first attempt at machine learning to predict the sales volume (setup of (b)) uses all customer data from 2020. He randomly partitions the data into train (80%) and validation (20%), and uses one unit, linear activation function, and quadratic loss function. To prevent overfitting, he uses ridge regularization of the weights W, minimizing the optimization objective $J(W; \lambda) = \sum_{i=1}^n \mathcal{L}(h(x^{(i)}; W), y^{(i)}) + \lambda \|W\|^2$ where $\|W\|^{2}$ is the sum over the square of all output units' weights. Mac discovers that itâs possible to find a value of W such that J(W ; Î») = 0 even when Î» is very large, nearing â.  Mac suspects that he might have an error in the code that he
wrote to derive the labels (i.e., the monthly sales volumes). Letâs see why. What can Mac conclude about W from this finding?",every element of W equals 0.
MIT Fall 2021,1,d.ii,1,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Macâs first attempt at machine learning to predict the sales volume (setup of (b)) uses all customer data from 2020. He randomly partitions the data into train (80%) and validation (20%), and uses one unit, linear activation function, and quadratic loss function. To prevent overfitting, he uses ridge regularization of the weights W, minimizing the optimization objective $J(W; \lambda) = \sum_{i=1}^n \mathcal{L}(h(x^{(i)}; W), y^{(i)}) + \lambda \|W\|^2$ where $\|W\|^{2}$ is the sum over the square of all output units' weights. Mac discovers that itâs possible to find a value of W such that J(W ; Î») = 0 even when Î» is very large, nearing â.  Mac suspects that he might have an error in the code that he
wrote to derive the labels (i.e., the monthly sales volumes). If every element of W equals 0, what does this imply about the labels?","When W has all entries equal to 0, the prediction at every data point is a constant
(the offset). The only way for the squared error to be 0 is for the label of every data point to equal that offset. It seems unlikely that every data label would be exactly the same in this data set, which we assume ranges over a wide number of apps."
MIT Fall 2021,1,e,1,Neural Networks,Image,"Mac found and fixed the error. Now, to choose the regularization constant $\lambda$, Mac tried values of $1,10,100$, and 1000 , creating the below plot. Unfortunately, he forgot to label the legend! Help Mach by filling in the legend using two of the following: 'Training error', 'Validation error', 'Training time'.",Image filling
MIT Fall 2021,1,f,1,Neural Networks,Image,"Continuing the scenario of (e), which value of $\lambda$ (out of $1,10,100$, and 1000 ) should Mac choose to obtain the neural network that he will deploy on the app store, and why?","$\lambda=100$, because the validation error is lowest at this value."
MIT Fall 2021,1,g,2,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
When Mac wakes up the next day, he decides to re-run learning using Î» = 100, now with a different partition of the data into train and validation sets (since he had previously forgotten to set the random seed). He finds that he gets a very different validation error! To obtain a more stable estimate, Mac decides to split the data into 5 disjoint chunks of 20% of the data. For each chunk, he evaluates on it after training on the union of the other 4 chunks. He gets the following results for the average error within each chunk: 0.15, 0.3, 0.1, 0.2, 0.25. What can Mac conclude is an estimate of the test error of the neural network?",0.2 (the average). This is cross-validation.
MIT Fall 2021,1,h,2,Neural Networks,Image,"The initial results look promising. Mac now wants to add in data from additional, earlier, years. (He is confident his customers have been behaving similarly over many years, so the earlier data is relevant.)

Before curating the older data, Mac decides to use the training data that he has to get a sense of whether more data would help. He creates a learning curve where on the horizontal axis he varies the amount of training data used and on the vertical axis he shows the validation error, using a fixed validation set across all settings considered. He experiments with $\lambda=1,10,100$, but again forgot to include a legend. Fill in the below legend by labeling the curves with the value of $\lambda$ that each corresponds to:",Image filling
MIT Fall 2021,1,i,1,Neural Networks,Image,Based on these plots does it seem likely that even more data will improve validation error (possibly for a different value of $\lambda$ )? Explain why or why not.,"Yes, because the validation error continues to decline as the amount of regularization decreases and amount of data increases. With more data and $\lambda=0$, it is conceivable that the validation error will be even smaller."
MIT Fall 2021,1,j,1,Neural Networks,Text,"Mac OâLarnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and theyâre able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning. Mac experiments with even more training data and additional values of Î», but finds that he cannot decrease the validation error further. Are there changes to the neural network architecture that Mac could make to try to improve prediction performance? Explain.","Mac could add hidden layers with nonlinear activation functions to the
neural network."
MIT Fall 2021,2,a,2,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
Walk through each step of the $k$-means algorithm, beginning with the initialization shown in the plot in the top left of the box below. Dots show the observed data. In each plot (go left to right, top to down), mark with two ' $x$ ' symbols where the cluster centers are in that iteration of $k$-means. These are already shown in the initial state. Once the $k$-means algorithm has converged, you can leave all subsequent plots unmarked.",Image filling
MIT Fall 2021,2,b,1,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
What is the numerical value of the $k$-means objective for the clustering found in (a), after the algorithm has finished running?",$8 \cdot\left(1^{2}+0.5^{2}\right)=8 \cdot(1.25)=10$
MIT Fall 2021,2,c,1,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
Just as in (a), walk through each step of the $k$-means algorithm, beginning with the initialization shown in the plot in the top left. In each plot (go left to right, top to down), mark with two ' $x$ ' symbols where the cluster centers are in that iteration of $k$-means. Once the $k$-means algorithm has converged, you can leave all subsequent figures unmarked.",Image filling
MIT Fall 2021,2,d,1,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
What is the numerical value of the $k$-means objective for the clustering found in (c), after the algorithm has finished running?",$4 \cdot\left(2^{2}\right)+4 \cdot\left(3^{2}\right)=16+36=52$.
MIT Fall 2021,2,e,1,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
According to the $k$-means objective of the learned clusters, which initialization was better?",$\sqrt{\text { Initialization (a) } \quad \text { Initialization (c) }$
MIT Fall 2021,2,f,2,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
Consider the data in black dots shown in the plot below. We drew one cluster center with an $x$ symbol at $(2,4)$. Draw the second cluster center to satisfy the following property. When we initialize the clusters centers at the two $x$ 's and run the k-means algorithm to convergence, the final state will be such that one cluster will have all the data points assigned to it, and the other cluster will have no data points assigned to it.","There are two correct answers, either $(0,0)$ or $(0,8)$."
MIT Fall 2021,2,g,2,Clustering,Text,"Assume that the number of clusters k = 2. Christy thinks she came up with a compelling new initialization method for the $k$-means algorithm. Looking at her code below, explain why it is unlikely to give good results.
\begin{lstlisting}[language=Python]
def kmeans_init(X, n_clusters):
    centers = []
    for i in range(n_clusters):
        centers.append(X[:, X.shape[1]-1-i])
    return np.asarray(centers).T
\end{lstlisting}","Christy's method selects the last n\_clusters data points as the cluster centers. These points may be very close to each other, leading to the $k$-means algorithm finding a poor local optima of the $k$-means objective."
MIT Fall 2021,2,h,2,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
Each of the following five data sets has two ground truth clusters, whose points are denoted as 'å' and ' $o$ '. For which of these would the clustering with the smallest $k$-means objective value not recover the ground truth? Assurne $k=2$. (Select all that apply.)"," $\mathrm{O}$ (I) $\sqrt{\text { (II) } \quad \sqrt{(\mathrm{II})} \sqrt{(\mathrm{IV})} \sqrt{(\mathrm{V})}$
(I)
(II)
(III)
(IV)
$(\mathrm{V})$"
MIT Fall 2021,3,a,1,Decision Trees,Text,"We seek to learn a classifier on the following data set given in (point, class) format: ((-3,6),-1),((-1,6),-1),((2,6),+1),((4,6),-1),((-3,5),-1),((-1,5),-1),((2,5),+1),((4,5),-1),((2,3),+1),((4,3),-1),((2,2),+1),((4,2),-1),((-1,1),+1). 
We first learn a linear logistic classifier with offset on this data set, with no regularization. Will it obtain zero training error? Write âyesâ or ânoâ and explain your answer","No, this data set is not linearly separable.
"
MIT Fall 2021,3,b,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
We now learn a depth-2 decision tree, with min_samples_split=2. We give you a partially completed tree below, where the first split is $x_{1} \geq 3$. Complete the rest of the tree by filling in the boxes with the splits on the second level and the classifications (either $+1$ or $-1$ ) at the leafs. Use the entropy criterion to choose the splits, and leave empty any boxes that are unused. As a reminder, min_samples_split is the minimum number of data points required to split an internal node.",yes
MIT Fall 2021,3,c,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
Now suppose that we set min_samples_split=10. Again, complete the rest of the tree by filling in the boxes.

Leave empty any boxes that are unused.",Image filling
MIT Fall 2021,3,d,1,Decision Trees,Text,"In decision trees, what is the purpose of increasing the minimum samples in each split?","It improves generalization (i.e., prevents overfitting to the training data) by requiring more samples to split a node, resulting in smaller tree depth."
MIT Fall 2021,3,e,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
What is the training error of the trees learned in parts (b) and (c)?","Tree (b): 1/13.
Tree (c): 4/13."
MIT Fall 2021,3,f,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
With min_samples_split=2, if we were to continue building the tree without any restriction to its depth, what would be the training error of the resulting tree?",0
MIT Fall 2021,3,g,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
Suppose we give as new features $x_{i}^{3}$, using these in addition to the original features $x_{i}$. Draw the new depth-2 tree that would be learned. Assume the features are organized $x_{1}, x_{2}, x_{1}^{3}, x_{2}^{3}$ and if two features are equally good for the split according to the entropy criterion, then we choose the first one in this order. As in part (b), assume min_samples_split=2.",Image filling
MIT Fall 2021,4,a,3,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
Draw on the below figure the decision boundary for a 1-NN classifier on this data set. In each region, denote whether the classification of any point (any point, not just the training data) in that region would be $+1$ or $-1$. (Note, all data points are assumed to be on integer coordinates.)",Image filling
MIT Fall 2021,4,b,2,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
 Which training data points, if any, could you remove and keep the decision boundary identical? Answer using their $\left(x_{1}, x_{2}\right)$ coordinates.",Image filling
MIT Fall 2021,4,c,2,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
You perform leave-one-out cross-validation of the 1-NN and 3-NN classifiers on this data set, i.e. you use use cross-validation with a chunk size of 1 data point. Assume ties go to the $+1$ region. What cross-validation errors do you obtain?",Image filling
MIT Fall 2021,4,d,2,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
 Suppose we now use the following feature transformation, $\phi\left(x_{1}, x_{2}\right)=x_{1} x_{2}$, and seek to learn a nearest neighbor classifier in the transformed space. This is equivalent to using a different distance metric, $d\left(x, x^{n}\right)=\left\|\phi(x)-\Phi\left(x^{r}\right)\right\|^{2}$. What is the average leave-one-out cross-validation error of a 3-NN classifier using this new distance metric? Which points would be misclassified (specified using their $\left(x_{1}, x_{2}\right)$ coordinates)?","3-NN:
$1 / 13$ Misclassified points:
$(-1,1)$"
MIT Fall 2021,4,e,3,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
The plots below show the decision boundaries as predicted by a k-NN classifier for four different values of $\mathrm{k}: 1,5,20,40$. Map each plot to the corresponding value of $k$.
(I)
(II)
(III)
$(\mathrm{IV})$",Image filling
MIT Fall 2021,5,a.i,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to train the reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20. When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
What action should Ena choose when fully fit with a horizon of 1?",play
MIT Fall 2021,5,a.ii,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to train the reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20. When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
What action should Ena choose when partially fit with a horizon of 1?",play
MIT Fall 2021,5,a.iii,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to train the reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20. When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
What action should Ena take when injured with a horizon of 1?",break
MIT Fall 2021,5,b,3,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to train the reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20. When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
When the horizon is 2 and Ena is partially fit what is the expected reward for taking the best action when partially fit?","Train, 58"
MIT Fall 2021,5,c,2,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. When the horizon is 2 and Ena is partially fit what is the expected reward for taking the best action with a discount factor of .5?","Play, 25"
MIT Fall 2021,5,d.i,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
When Ena is fully fit, what is the inifinite horizon policy?",play
MIT Fall 2021,5,d.ii,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit.
When Ena is partially fit, what is the inifinite horizon optimal policy?",train
MIT Fall 2021,5,d.iii,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
When Ena is injured, what is the inifinite horizon optimal policy?",break
MIT Fall 2021,5,e,2,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
Is there any policy which maximizes the expected reward in the infinite horizon under which Ena should play if injured? Explain.",No. Both other actions have a negative reward and they both keep Ser in the injured state
MIT Fall 2021,5,f,3,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
Djo Ko is another athlete who plays the same sport. Djo Ko has the exact same MDP
as Enaâs, except Djoâs team has forgotten the reward for playing when in the fully fit state. Djoâs team also remember that the horizon 2 best action to take in the
partially fit state is exactly the same as that for Ser Ena (determined in part b). Given this information, what are the range of possible values for R(fully fit, play) for Djo Ko? Assume discount of 1.","R(fully fit, play) > 32â10/6 = 53.33."
MIT Fall 2021,6,a,3,Neural Networks,Text,"A neural network takes in an input x = (x1, x2) and outputs $\hat{y}$ = a * x1 + b * x2.  The loss function is given as L(\hat{y}, y) = \left(y-\hat{y}\right)^2. Suppose $a_0$ and $b_0$ are the initial values of the weights, and $a_k$ and $b_k$ are the weights at iteration $k$.  Give equations for the updated weights $a_{k+1}$, $b_{k+1}$ in terms of current iteration's weights $a_{k}$, $b_{k}$, the step size parameter $\eta$, and the inputs $x_1$, $x_2$.","a_{k+1} = a_k â Î·*dL/da = a_k â 2Î·*[(a_k â 1)*x^2_1 + (b_k â 1)x_1*x_2]
b_{k+1} = b_k â Î·*dL/db = b_k â 2Î·*[(b_k â 1)*x^2_1 + (a_k â 1)x_1*x_2]
"
MIT Fall 2021,6,b,2,Neural Networks,Image,"Years ago, MIT student Itu Nes learned about neural networks and how to train them, from taking 6.036. Now Itu is an engineer at Orange Computer, a hot tech company employing machine learning to revolutionize music. Looking back at her notes, Itu realizes that she once wrote down exactly what she now needs to do in her job, but unfortunately some key details are lost. Can you help her figure things out?
Specifically, Itu wants to train this simple single-node neural network:
The network accepts two inputs $x_{1}$ and $x_{2}$, and outputs a prediction $\hat{y}$ based on weights $a$ and $b$. Itu's dataset has points $(x, y)$ where $x=\left(x_{1}, x_{2}\right)$, and $y$ are the true labels. Itu employs the squared error loss function
$$
L(\hat{y}, y)=(y-\hat{y})^{2}
$$
In her notes, Itu wrote about using gradient descent to obtain the optimal weights for the network, by minimizing this loss. Moreover, for each run of the gradient descent, she used a single data point to train the weights. Afterwards, Itu learns that the true labels are $y=x_{1}+x_{2}$. 
Suppose $a_{0}$ and $b_{0}$ are the initisl values of the weights, and $a_{k}$ and $b_{k}$ are the weights at iteration $k$. Give equations for the updated weights $a_{k+1}, b_{k+1}$ in terms of current iteration's weights $a_{k}, b_{k}$, the step size parameter $\eta$, and the inputs $x_{1}, x_{2}$.",Latex
MIT Fall 2021,6,c,2,Neural Networks,Image,"Years ago, MIT student Itu Nes learned about neural networks and how to train them, from taking 6.036. Now Itu is an engineer at Orange Computer, a hot tech company employing machine learning to revolutionize music. Looking back at her notes, Itu realizes that she once wrote down exactly what she now needs to do in her job, but unfortunately some key details are lost. Can you help her figure things out?
Specifically, Itu wants to train this simple single-node neural network:
The network accepts two inputs $x_{1}$ and $x_{2}$, and outputs a prediction $\hat{y}$ based on weights $a$ and $b$. Itu's dataset has points $(x, y)$ where $x=\left(x_{1}, x_{2}\right)$, and $y$ are the true labels. Itu employs the squared error loss function
$$
L(\hat{y}, y)=(y-\hat{y})^{2}
$$
In her notes, Itu wrote about using gradient descent to obtain the optimal weights for the network, by minimizing this loss. Moreover, for each run of the gradient descent, she used a single data point to train the weights. Afterwards, Itu learns that the true labels are $y=x_{1}+x_{2}$. 
Itu soes that when she fixed $x_{1}=1, x_{2}=1$ and ran 10 iterations of gradient descent starting with $a_{0}=2, b_{0}=2$, she recorded that the two weights oscillated back and forth, as captured in this plot pasted into her notebook:

Note that in this plot, the $a$ and $b$ points lay on top of each other. Unfortunately, Itu forgot to write down her code, nor did she write down what value of $\eta$ may have been used to generate this plot. Help her figure out: was this plot a mistake (and explain why), or if not, what value of $\eta$ could have generated it?","This oscillation happens when $\eta=1 / 2$, because $d L / d a=d L / d b=-4$"
MIT Fall 2021,6,d,3,Neural Networks,Image,"Years ago, MIT student Itu Nes learned about neural networks and how to train them, from taking 6.036. Now Itu is an engineer at Orange Computer, a hot tech company employing machine learning to revolutionize music. Looking back at her notes, Itu realizes that she once wrote down exactly what she now needs to do in her job, but unfortunately some key details are lost. Can you help her figure things out?
Specifically, Itu wants to train this simple single-node neural network:
The network accepts two inputs $x_{1}$ and $x_{2}$, and outputs a prediction $\hat{y}$ based on weights $a$ and $b$. Itu's dataset has points $(x, y)$ where $x=\left(x_{1}, x_{2}\right)$, and $y$ are the true labels. Itu employs the squared error loss function
$$
L(\hat{y}, y)=(y-\hat{y})^{2}
$$
In her notes, Itu wrote about using gradient descent to obtain the optimal weights for the network, by minimizing this loss. Moreover, for each run of the gradient descent, she used a single data point to train the weights. Afterwards, Itu learns that the true labels are $y=x_{1}+x_{2}$. 
Itu sees that when she fixed $x_{1}=1, x_{2}=1$ and $\operatorname{ran} 10$ iterations of gradient descent starting with $a_{0}=2, b_{0}=0$, she recorded that the two weights remained unchanged, as captured in this plot pasted into her notebook:

Again: was this plot a mistake (and explain why), or if not, what value of $\eta$ could have generated it?","Any $\eta$, e.g. $\eta=5$, because $d L / d a=0$ and $d L / d b=0$ for these parameters. Alternatively $\eta=0$ will also leave $a$ and $b$ at their initial values."
MIT Fall 2021,6,e,2,Neural Networks,Image,"Years ago, MIT student Itu Nes learned about neural networks and how to train them, from taking 6.036. Now Itu is an engineer at Orange Computer, a hot tech company employing machine learning to revolutionize music. Looking back at her notes, Itu realizes that she once wrote down exactly what she now needs to do in her job, but unfortunately some key details are lost. Can you help her figure things out?
Specifically, Itu wants to train this simple single-node neural network:
The network accepts two inputs $x_{1}$ and $x_{2}$, and outputs a prediction $\hat{y}$ based on weights $a$ and $b$. Itu's dataset has points $(x, y)$ where $x=\left(x_{1}, x_{2}\right)$, and $y$ are the true labels. Itu employs the squared error loss function
$$
L(\hat{y}, y)=(y-\hat{y})^{2}
$$
In her notes, Itu wrote about using gradient descent to obtain the optimal weights for the network, by minimizing this loss. Moreover, for each run of the gradient descent, she used a single data point to train the weights. Afterwards, Itu learns that the true labels are $y=x_{1}+x_{2}$. 
Itu sees that when she fixed $x_{1}$ and $x_{2}$ and ran 10 iterations of gradient descent with $\eta=0.01$ starting with $a_{0}=b_{0}=2$, she recorded that $b$ stayed unchanged, but $a$ decayed to 1 , as captured in this plot pasted into her notebook:

Again: was this plot a mistake (and explain why), or if not, what values of $x_{1}, x_{2}$ could have generated it?","Resulted from choosing $x_{1}=4 \cdot x_{2}=0$ (other nonzero, positive values of $x_{1}$ also work $)$"
MIT Fall 2021,7,a.i,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). Specifically, the input $X$ is a $4 \times 1$ column vector, $\hat{y}$ is a $1\times 1$ scalar. $W^2$ is a $3 \times 1$ vector. We also know that, $Z^1 = (W^1)^T X$ and $Z^2 = (W^2)^T A^1$. What are the dimensions of the matrix $W^1$?",4x3
MIT Fall 2021,7,a.ii,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). Specifically, the input $X$ is a $4 \times 1$ column vector, $\hat{y}$ is a $1\times 1$ scalar. $W^2$ is a $3 \times 1$ vector. We also know that, $Z^1 = (W^1)^T X$ and $Z^2 = (W^2)^T A^1$. What are the dimensions of $Z^2$?",1x1
MIT Fall 2021,7,b.i,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). There is only one data point which is: $X = [1, 1, 1, 1]^T$ and $y = [1]$. If $W^1$ and $W^2$ are both matrices/vectors of all ones, what is the resulting Loss where the Loss = (y - \hat{y})^2$?",121
MIT Fall 2021,7,b.ii,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). There is only one data point which is: $X = [1, 1, 1, 1]^T$ and $y = [1]$. If $W^1$ is a matrix of all $-1$âs (all negative ones) and $W^2$ is a vector of all $1$âs (positive ones), what is the resulting Loss where the Loss = (y - \hat{y})^2$?",1
MIT Fall 2021,7,c.i,2,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). Determine the expression for $\frac{\partial L}{\partial W^1}$. You may leave your expression in terms of $X, y, \hat{y}, W^2$ and $\frac{\partial A^1}{\partial Z^1}$.",âL/âW^1 = â2X(âA^1/âZ^1*W2*(y â yË))^T
MIT Fall 2021,7,c.ii,2,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). Specifically, the input $X$ is a $4 \times 1$ column vector, $\hat{y}$ is a $1\times 1$ scalar. $W^2$ is a $3 \times 1$ vector. We also know that, $Z^1 = (W^1)^T X$ and $Z^2 = (W^2)^T A^1$. What are the dimensions of $\frac{\partial L}{\partial W^1}$",4x3
MIT Fall 2021,7,d.i,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). We now use back-propagation to update the weights during each iteration. Assume that we only have one data point (X, y) available to use, and the stepsize
parameter is 0.01. Assume $X = [1, 1, 1, 1]^T, y = [1]$. Further assume that we start with $W^1$ as a matrix of $-1$âs (negative ones) while $W^2$ is a vector of $1$âs (positive ones). How many components of $W^1$ will get updated (i.e. have their value changed) after one iteration of backprop?",zero Components
MIT Fall 2021,7,d.ii,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). We now use back-propagation to update the weights during each iteration. Assume that we only have one data point (X, y) available to use, and the stepsize
parameter is 0.01. Assume $X = [0, 0, 0, 0]^T, y = [0]$. Further assume that we start off with $W^1$ and $W^2$ as matrices/vectors of all ones. How many components of $W^1$ will get updated (i.e. have their value changed) after one iteration of back-propagation?",Zero Components
MIT Fall 2021,7,d.iii,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). We now use back-propagation to update the weights during each iteration. Assume that we only have one data point (X, y) available to use, and the stepsize
parameter is 0.01. Assume $X = [1, 1, 1, 1]^T, y = [1]$. Further assume that we start off with $W^1$ and $W^2$ as matrices/vectors of all ones. How many components of $W^1$ will get updated (i.e. have their value changed) after one iteration of backprop?",All components (12)
MIT Fall 2021,7,d.iv,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). We now use back-propagation to update the weights during each iteration. Assume that we only have one data point (X, y) available to use, and the stepsize
parameter is 0.01. Assume $X = [1, 1, 1, 1]^T, y = [1]$. Further assume that we start off with $W^1$ as a matrix of all ones. \textbf{$W^2 = [0, 1, 0]^T$}. How many components of $W^1$ will get updated (i.e. have their value changed) after one iteration of backprop?",4 components
MIT Fall 2021,8,a.i,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this yearâs
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 Ã 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 Ã 2 filter. Letâs help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 0 and stride of 1?",2x2
MIT Fall 2021,8,a.ii,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this yearâs
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 Ã 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 Ã 2 filter. Letâs help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 1 and stride of 1?",4x4
MIT Fall 2021,8,a.iii,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this yearâs
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 Ã 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 Ã 2 filter. Letâs help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 2 and stride of 1?",6x6
MIT Fall 2021,8,a.iv,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this yearâs
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 Ã 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 Ã 2 filter. Letâs help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 0 and stride of 2?",1x1
MIT Fall 2021,8,a.v,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this yearâs
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 Ã 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 Ã 2 filter. Letâs help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 1 and stride of 2?",2x2
MIT Fall 2021,8,a.vi,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this yearâs
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 Ã 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 Ã 2 filter. Letâs help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 2 and stride of 2?",3x3
MIT Fall 2021,8,b.i,0.5,CNNs,Text,When performing binary classification what activation function should one use in the final output layer?,sigmoid
MIT Fall 2021,8,b.ii,0.5,CNNs,Text,When performing binary classification what loss function should one use?,negative log likelihood loss
MIT Fall 2021,8,ci,0.5,CNNs,Text,"If Rec wants to allow for more than two classes when performing classification, which activation function should they use in the final output layer?",softmax
MIT Fall 2021,8,cii,0.5,CNNs,Text,"If Rec wants to allow for more than two classes when performing classification, what loss function should one use?",cross entropy
MIT Fall 2021,8,d.i,0.25,CNNs,Text,w is the weights for classifier network. What are dimensions of w for binary classification?,"w = [1,1]"
MIT Fall 2021,8,d.ii,0.25,CNNs,Text,b is the bias for classifier network. What are dimensions of b for binary classification?,b = 1
MIT Fall 2021,8,d.iii,0.25,CNNs,Text,w is the weights for classifier network. What are dimensions of w for k-class classification?,"w = [1, k]"
MIT Fall 2021,8,d.iv,0.25,CNNs,Text,b is the bias for classifier network. What are dimensions of b for multi k-class classification?,b = k
MIT Fall 2021,8,e,2,CNNs,Image,"MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition. 
What are the spatial dimensions of the output image if a $2 \times 2$ filter is convolved with a $3 \times 3$ image for paddings of 0,1 , and 2 , and strides of 1 and 2 ? Fill in the dimensions below:",$2 \times 2-4 \times 4-6 \times 6$
MIT Fall 2021,8,f,1,CNNs,Image,"MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition. 
 Rec writes a bit of python code to implement their tiny CNN classifier for images of 2D tetris pieces, following examples they have seen in $6.036$. They include in the comments the dimensions of the numpy arrays, where known.

For performing binary classification, what activation function should Rec use for $f$ inal_act and which loss function should Rec use?","Sigmoid + Negative Log Likelihood Loss
"
MIT Fall 2021,8,g,1,CNNs,Image,"MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition. 
If Rec wants to allow for more than two classes, which activation function should they use for final_act and which loss function?",Softmax + Cross Entropy
MIT Fall 2021,8,h,1,CNNs,Image,"MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition. 
What are dimensions of $w$ and $\mathrm{b}$ for i) binary classification vs. ii) $k$-class classification?","For binary classification $u$ is: $[1,1] \quad$ and $b$ is: $[1]$
For $k$-class classification $u$ is: $[1, \mathrm{k}] \quad$ and $\mathrm{b}$ is: $[\mathrm{k}]$"
MIT Fall 2021,8,i,1,CNNs,Text,"Write an expression for the derivative of the binary classification loss with respect to z2, where z = conv2d(x , fcoef , padding =0 , stride =1), a = ReLU( z ), a_sum = z1.sum( dim = -1).sum( dim = -1), z2 = w.T @ a_sum + b
You may express your answer using g for the output of final act and y for the example label.",g-y
MIT Fall 2021,8,j.i,0.5,CNNs,Text,"Using âL/âb = (g â y), z = conv2d(x , fcoef , padding =0 , stride =1), a = ReLU( z )
a_sum = z1.sum( dim = -1).sum( dim = -1), z2 = w.T @ a_sum + b, write an expression for gradient of the loss with respect to w of the output layer when the loss is negative log likelihood of predicted output g and actual output y? You may express your answers in terms of a_sum.",dl/dw = z1_sum(g-y)
MIT Fall 2021,8,j.ii,0.5,CNNs,Text,"Using âL/âb = (g â y), z = conv2d(x , fcoef , padding =0 , stride =1), a = ReLU( z )
a_sum = z1.sum( dim = -1).sum( dim = -1), z2 = w.T @ a_sum + b, write an expression for gradient of the loss with respect to b of the output layer when the loss is negative log likelihood of predicted output g and actual output y? You may express your answers in terms of a_sum.",dl/db = z1_sum(g-y)
MIT Fall 2021,8,k,1,CNNs,Image,"Assume we apply a filter with weights $[[f 1, f 2],[f 3, f 4]]$ to this $3 \times 3$ image:
with stride 1 and padding 0 and perform back propagation. Which filter weights may have non-zero gradients? Why? Under what conditions will those gradients be non-zero?",$\mathrm{f} 1$ and $\mathrm{f} 3$ are the only weights that will receive gradients because only those weights get multiplied by non-zero features. The gradients to those weights will be non-zero if $2 *(f 1+f 3)>0$ because of the ReLU activation function.
MIT Spring 2022,1,a,2,Neural Networks,Text,"Consider the simplest of all neural networks, consisting of a single unit with a sigmoid activation function: $h(x;w = \sigma(w_0 + w_1x)$ where $\sigma(z) = (1 + exp(-z))^{-1}$ Letâs start with a classifier defined by $w_0 = 0$ and $w_1 = 1$. Which range of input values x are classified as positive? Which as negative?",Positive if x > 0; negative otherwise.
MIT Spring 2022,1,b,2,Neural Networks,Text,"Consider the simplest of all neural networks, consisting of a single unit with a sigmoid activation function: $h(x;w = \sigma(w_0 + w_1x)$ where $\sigma(z) = (1 + exp(-z))^{-1}$ Letâs start with a classifier defined by $w_0 = 0$ and $w_1 = 2$. Which range of input values x are classified as positive? Which as negative?",Positive if x > 0; negative otherwise.
MIT Spring 2022,1,c,2,Neural Networks,Text,"Consider the simplest of all neural networks, consisting of a single unit with a sigmoid activation function: $h(x;w = \sigma(w_0 + w_1x)$ where $\sigma(z) = (1 + exp(-z))^{-1}$ Letâs start with a classifier defined by $w_0 = -1$ and $w_1 = 1$. Which range of input values x are classified as positive? Which as negative?",Positive if x > 1; negative otherwise.
MIT Spring 2022,2,a,2,Neural Networks,Image,"A recent paper gives us reason to think that, rather than regularizing the weights used in a deep network that is used for classification, it is good to regularize the $z$ values, before the activation function is applied, in the last layer.

Consider a neural network with one hidden layer and a single output unit where the activation function $f$ is a sigmoid, as shown below.
We can specify the DARC objective function $J(\theta, \lambda)$, where the parameters $\theta=\left(w_{11}, w_{12}, w_{21}, w_{22}, v_{1}, v_{2}\right)$ which depends on data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i-1}^{N}$ as
$$
J(\theta, \lambda)=\sum_{i} \mathcal{L}_{n l l}\left(f\left(z^{(i)}\right), y^{(i)}\right)+\lambda \sum_{i}\left(z^{(i)}\right)^{2}
$$
where $z^{(i)}$ is the value of the output unit on example $i$ before it goes into the activation function. 
What is the partial derivative of this unusual regularization term with respect to the weight $w_{11}$, for a single $(x, y)$ training point?
$$
\frac{\partial}{\partial w_{11}} \lambda(z)^{2}
$$
Write it in terms of $x, y, z_{1}, z_{2}, z, w$ and $v$ values. You can use $f^{\prime}$ for derivative of $f$.",$2 \lambda z v_{1} x_{1} f^{\prime}\left(z_{1}\right)$
MIT Spring 2022,2,b,2,Neural Networks,Image,"A recent paper gives us reason to think that, rather than regularizing the weights used in a deep network that is used for classification, it is good to regularize the $z$ values, before the activation function is applied, in the last layer.

Consider a neural network with one hidden layer and a single output unit where the activation function $f$ is a sigmoid, as shown below.
We can specify the DARC objective function $J(\theta, \lambda)$, where the parameters $\theta=\left(w_{11}, w_{12}, w_{21}, w_{22}, v_{1}, v_{2}\right)$ which depends on data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i-1}^{N}$ as
$$
J(\theta, \lambda)=\sum_{i} \mathcal{L}_{n l l}\left(f\left(z^{(i)}\right), y^{(i)}\right)+\lambda \sum_{i}\left(z^{(i)}\right)^{2}
$$
where $z^{(i)}$ is the value of the output unit on example $i$ before it goes into the activation function. 
 What is the derivative with respect to $w_{11}$ of the typical regularization term, which penalizes the squares of the weights? How do these two regularizers differ?
",$2 \lambda w_{11}$. One depends on the input.
MIT Spring 2022,2,c,2,Neural Networks,Image,"A recent paper gives us reason to think that, rather than regularizing the weights used in a deep network that is used for classification, it is good to regularize the $z$ values, before the activation function is applied, in the last layer.

Consider a neural network with one hidden layer and a single output unit where the activation function $f$ is a sigmoid, as shown below.
We can specify the DARC objective function $J(\theta, \lambda)$, where the parameters $\theta=\left(w_{11}, w_{12}, w_{21}, w_{22}, v_{1}, v_{2}\right)$ which depends on data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i-1}^{N}$ as
$$
J(\theta, \lambda)=\sum_{i} \mathcal{L}_{n l l}\left(f\left(z^{(i)}\right), y^{(i)}\right)+\lambda \sum_{i}\left(z^{(i)}\right)^{2}
$$
where $z^{(i)}$ is the value of the output unit on example $i$ before it goes into the activation function. 
 Describe a situation in which it is possible for $w_{11}$ to be extremely large, but for $z$ to have small magnitude.",Maybe $v_{1}$ is very small.
MIT Spring 2022,2,d,2,Neural Networks,Image,"A recent paper gives us reason to think that, rather than regularizing the weights used in a deep network that is used for classification, it is good to regularize the $z$ values, before the activation function is applied, in the last layer.

Consider a neural network with one hidden layer and a single output unit where the activation function $f$ is a sigmoid, as shown below.
We can specify the DARC objective function $J(\theta, \lambda)$, where the parameters $\theta=\left(w_{11}, w_{12}, w_{21}, w_{22}, v_{1}, v_{2}\right)$ which depends on data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i-1}^{N}$ as
$$
J(\theta, \lambda)=\sum_{i} \mathcal{L}_{n l l}\left(f\left(z^{(i)}\right), y^{(i)}\right)+\lambda \sum_{i}\left(z^{(i)}\right)^{2}
$$
where $z^{(i)}$ is the value of the output unit on example $i$ before it goes into the activation function. 
Would the DARC strategy of regularizing $z$ be good if we were, instead, doing regression and $f(x)=x$ ? Explain why or why not.","No, because we need the output to be able to attain its target value, which will be made impossible by penalizing the magnitude of the output."
MIT Spring 2022,3,a,2,Classifiers,Image,"Consider the following data. Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem.
Approach 1: Nested linear classifiers
Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and
$$
\begin{aligned}
a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\
a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\
h\left(x ; v_{0}, v_{1}, v_{2}\right) &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right)
\end{aligned}
$$
where
$$
\operatorname{sign}(x)= \begin{cases}+1 & \text { if } x>0 \\ -1 & \text { otherwise }\end{cases}
$$ Draw the classifiers corresponding to $a_{1}$ and $a_{2}$ on the axes above. Label them clearly, including their normal vectors.",Image filling
MIT Spring 2022,3,b.i,1.666666667,Classifiers,Text,"Consider the following data: Positive = (-4, 4), (4, -4), Negative = (-1, 1), (1, -1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and $$ \begin{aligned} a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\ a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\ h\left(x ; v_{0}, v_{1}, v_{2}\right) &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right) \end{aligned} $$ where $$ \operatorname{sign}(x)= \begin{cases}+1 & \text { if } x>0 \\ -1 & \text { otherwise }\end{cases} $$ Select values of the v1 so that the nested classifier correctly predicts the values in the data set.",-1
MIT Spring 2022,3,b.ii,1.666666667,Classifiers,Text,"Consider the following data: Positive = (-4, 4), (4, -4), Negative = (-1, 1), (1, -1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and $$ \begin{aligned} a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\ a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\ h\left(x ; v_{0}, v_{1}, v_{2}\right) &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right) \end{aligned} $$ where $$ \operatorname{sign}(x)= \begin{cases}+1 & \text { if } x>0 \\ -1 & \text { otherwise }\end{cases} $$ Select values of the v2 so that the nested classifier correctly predicts the values in the data set.",1
MIT Spring 2022,3,b.iii,1.666666667,Classifiers,Text,"Consider the following data: Positive = (-4, 4), (4, -4), Negative = (-1, 1), (1, -1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and $$ \begin{aligned} a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\ a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\ h\left(x ; v_{0}, v_{1}, v_{2}\right) &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right) \end{aligned} $$ where $$ \operatorname{sign}(x)= \begin{cases}+1 & \text { if } x>0 \\ -1 & \text { otherwise }\end{cases} $$ Select values of the v3 so that the nested classifier correctly predicts the values in the data set.",0.5
MIT Spring 2022,3,c.i,2,Classifiers,Text,"Consider the following data: Positive = (-4, 4), (4, -4), Negative = (-1, 1), (1, -1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and $$ \begin{aligned} a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\ a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\ h\left(x ; v_{0}, v_{1}, v_{2}\right) &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right) \end{aligned} $$ where $$ \operatorname{sign}(x)= \begin{cases}+1 & \text { if } x>0 \\ -1 & \text { otherwise }\end{cases} $$ We'll define a new feature transformation $\phi$ that maps a point $x \in \mathbb{R}^{2}$ into a four-dimensional vector: $$ (K(x,(-4,4)), K(x,(-1,-1)), K(x,(1,1)), K(x,(4,-4))) $$ where $K$ is a function of two points: $K\left(x, x^{\prime}\right)=e^{-\left\|x-x^{\prime}\right\|^{2}}$. Intuitively, feature $i$ of $\phi(x)$ has value 1 if $x$ is equal to point $p_{i}$ and its value decreases as $x$ moves away from $p_{i}$. (c) We find a classifier in the transformed space with parameters $\theta=(1,-1,-1,1)$ $$ h(x ; \theta)=\operatorname{sign}\left(\theta^{T} \phi(x)\right) $$ What fraction of the training data does this classifier predict correctly?",100%
MIT Spring 2022,3,c.ii,2,Classifiers,Text,"Consider the following data: Positive = (-4, 4), (4, -4), Negative = (-1, 1), (1, -1). Clearly, the points are not linearly separable, so we will try three alternative approaches to solving the problem. Let $w=\left[\begin{array}{l}+1 \\ -1\end{array}\right]$ and $$ \begin{aligned} a_{1} &=\operatorname{sign}\left(w^{T} x+4\right) \\ a_{2} &=\operatorname{sign}\left(w^{T} x-4\right) \\ h\left(x ; v_{0}, v_{1}, v_{2}\right) &=\operatorname{sign}\left(v_{1} a_{1}+v_{2} a_{2}+v_{0}\right) \end{aligned} $$ where $$ \operatorname{sign}(x)= \begin{cases}+1 & \text { if } x>0 \\ -1 & \text { otherwise }\end{cases} $$ We'll define a new feature transformation $\phi$ that maps a point $x \in \mathbb{R}^{2}$ into a four-dimensional vector: $$ (K(x,(-4,4)), K(x,(-1,-1)), K(x,(1,1)), K(x,(4,-4))) $$ where $K$ is a function of two points: $K\left(x, x^{\prime}\right)=e^{-\left\|x-x^{\prime}\right\|^{2}}$. Intuitively, feature $i$ of $\phi(x)$ has value 1 if $x$ is equal to point $p_{i}$ and its value decreases as $x$ moves away from $p_{i}$. (c) We find a classifier in the transformed space with parameters $\theta=(1,-1,-1,1)$ $$ h(x ; \theta)=\operatorname{sign}\left(\theta^{T} \phi(x)\right) $$ What prediction does it make for point $(0,0)$?",-1
MIT Spring 2022,3,d,4,Classifiers,Image,We can classify the points correctly if $f$ (in both layers) is sigmoid. Provide the weights so this network will correctly classify the given points.,Image filling
MIT Spring 2022,4,a,1.5,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term.
 If we initialized our unit with $W_{\text {ols }}$ and did batch gradient descent (summing the error over all the data points) with squared loss, no regularization, and a fixed small step size, which of the following would most typically happen.
A. The weights would change substantially at the beginning, but then converge back to the values we initialized with.
B. The weights would not change.
C. The weights would make small oscillations around the initial weights.
D. The weights would converge to a different value.
E. Something else would happen.",B. The weights would not change
MIT Spring 2022,4,b,1.5,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term.
 If we initialized our unit with $W_{\text {ols }}$ and did batch gradient descent (summing the error over all the data points) with squared loss, no regularization, and a fixed small step size, explain why the weights would not change.",These weights are an optimum of the objective and the gradient will be (nearly) zero.
MIT Spring 2022,4,c.i,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. 
If we initialized our unit with $W_{o l s}$ and did stochastic gradient descent (one data point at a time) with squared loss, no regularization, and a fixed small step size, many different things could happen. Explain briefly the circumstances in which the weights would not change.","If the OLS solution had 0 error on all training examples, then SGD will not result in any changes."
MIT Spring 2022,4,c.ii,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. 
If we initialized our unit with $W_{o l s}$ and did stochastic gradient descent (one data point at a time) with squared loss, no regularization, and a fixed small step size, many different things could happen. Explain briefly the circumstances in which the weights would make small oscillations around the initial weights.","If there was error, and the gradients are not too big, then in expectation the steps should be small motions around the optimum."
MIT Spring 2022,4,c.iii,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. 
If we initialized our unit with $W_{o l s}$ and did stochastic gradient descent (one data point at a time) with squared loss, no regularization, and a fixed small step size, many different things could happen. Explain briefly the circumstances in which the weights would converge to a different value.",Itâs possible that it will bounce out of the current optimum and end up in another one.
MIT Spring 2022,4,d,2,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. 
Consider a neural-network unit initialized with $W_{\text {ridge }}$. Provide an objective function $J(W)$ that depends on the data, such that batch gradient descent to minimize $J$ will have no effect on the weights.",$J(W)=\left(W^{T} X-Y\right)^{2}+\lambda\|W\|^{2}$
MIT Spring 2022,4,e,2,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. 
Rory has solved many problems from this particular domain before and the solution has typically been close to $W^{*}=(1, \ldots, 1)^{T}$. Define an objective function $J(W)$ that we could minimize in order to obtain good estimates for Rory's next problem, even with very little data.",$J(W)=\left(W^{T} X-Y\right)^{2}+\lambda\|W-\mathbf{1}\|^{2}$
MIT Spring 2022,4,f,2,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. 
Ryo thinks they can get a better hypothesis by using knowledge about neural networks, and considers the hypothesis class $$ g=w_{a} \sigma\left(w_{b} x+w_{c}\right)+w_{d} $$ Assume that the inputs $x$ are 1-dimensional and recall $\sigma(z)=1 /\left(1+e^{-z}\right)$. Provide a data set with 3 points for which Ryo's hypothesis class can reach a lower MSE than the original OLS solution or argue that one does not exist. ","(0, 0), (1, 1), (2, 1)"
MIT Spring 2022,4,g,2,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. 
Provide a data set with 3 points for which the original OLS hypothesis class can reach a substantially lower MSE than Ryo's hypothesis class or argue that one does not exist.",Does not exist: You can stretch out the sigmoid so that the linear part of it is pretty linear and goes wherever you want it to.
MIT Spring 2022,4,h,2,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. Ryu thinks getting initial parameters from Ryo's hypothesis might be a good way to initialize a two-layer neural network. Consider the case where we have a simple neural network with - Two units in the hidden layer - Sigmoid activation function in the hidden layer - Linear activation function on the output unit So, the hypothesis is: $$ g=w_{a} \sigma\left(w_{b} x+w_{c}\right)+w_{e} \sigma\left(w_{f} x+w_{g}\right)+w_{d} $$ If Ryu first trained Ryo's hypothesis to reach a local optimum using gradient descent to obtain $w_{a}, w_{b}, w_{c}, w_{d}$, then initialized the more complex network above using $w_{e}=w_{a}$, $w_{f}=w_{b}$, and $w_{g}=w_{c}$, with $w_{d}$ as before, and did batch gradient descent with squared loss and a fixed small step size, what would most typically happen",The weights would converge to a different value with the same loss
MIT Spring 2022,4,i,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. Ryu thinks getting initial parameters from Ryo's hypothesis might be a good way to initialize a two-layer neural network. Consider the case where we have a simple neural network with - Two units in the hidden layer - Sigmoid activation function in the hidden layer - Linear activation function on the output unit So, the hypothesis is: $$ g=w_{a} \sigma\left(w_{b} x+w_{c}\right)+w_{e} \sigma\left(w_{f} x+w_{g}\right)+w_{d} $$ If Ryu first trained Ryo's hypothesis to reach a local optimum using gradient descent to obtain $w_{a}, w_{b}, w_{c}, w_{d}$, then initialized the more complex network above using $w_{e}=w_{a}$, $w_{f}=w_{b}$, and $w_{g}=w_{c}$, with $w_{d}$ as before, and did batch gradient descent with squared loss and a fixed small step size, explain why the weights would converge to a different value.","Because the two units are initialized exactly the same, the gradients for
both of them will be the same. So, it is as if we had a single linear unit, ran it through
a sigmoid, and then added an offset."
MIT Spring 2022,4,j,1,Regression,Text,"We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute
 $$
 W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}
 $$
 Using ridge regression, we can compute
 $$
 W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}
 $$
 We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:
 $$
 h(x ; W)=W^{T} x .
 $$
 Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term. Ryu thinks getting initial parameters from Ryo's hypothesis might be a good way to initialize a two-layer neural network. Consider the case where we have a simple neural network with - Two units in the hidden layer - Sigmoid activation function in the hidden layer - Linear activation function on the output unit So, the hypothesis is: $$ g=w_{a} \sigma\left(w_{b} x+w_{c}\right)+w_{e} \sigma\left(w_{f} x+w_{g}\right)+w_{d} $$ If Ryu first trained Ryo's hypothesis to reach a local optimum using gradient descent to obtain $w_{a}, w_{b}, w_{c}, w_{d}$, then initialized the more complex network above setting w_e, w_f and w_g randomly would we expect a lower loss?","Yes, with more freedom now we would expect a lower loss"
MIT Spring 2022,5,a.i,1,Features,Text,"You are hired by an event-planning company to build some machine-learning predictors. One of them, for predicting the number of attendees of a concert, takes as input x, a vector of aspects of the event (day of the week, start time, city size, music genre (classical, folk, rock, pop, rap), number of attendees at last concert, if known). What would be a good encoding strategy for music genre? Choose between numeric, one hot, discretized numeric (meaning discretized into bins), other.",One Hot
MIT Spring 2022,5,a.ii,1,Features,Text,"You are hired by an event-planning company to build some machine-learning predictors. One of them, for predicting the number of attendees of a concert, takes as input x, a vector of aspects of the event (day of the week, start time, city size, music genre (classical, folk, rock, pop, rap), number of attendees at last concert, if known). What would be a good encoding strategy for number of attendees at last concert? Choose between numeric, one hot, discretized numeric (meaning discretized into bins), other.",numeric
MIT Spring 2022,5,a.iii,1,Features,Text,"You are hired by an event-planning company to build some machine-learning predictors. One of them, for predicting the number of attendees of a concert, takes as input x, a vector of aspects of the event (day of the week, start time, city size, music genre (classical, folk, rock, pop, rap), number of attendees at last concert, if known). What would be a good encoding strategy for start time? Choose between numeric, one hot, discretized numeric (meaning discretized into bins), other.",discretized numeric (or numeric)
MIT Spring 2022,5,b,2,Features,Text,"You are hired by an event-planning company to build some machine-learning predictors. One of them, for predicting the number of attendees of a concert, takes as input x, a vector of aspects of the event (day of the week, start time, city size, music genre (classical, folk, rock, pop, rap), number of attendees at last concert, if known). If you didnât know anything more about this problem, what would be a reasonable loss function to use?",Squared Loss
MIT Spring 2022,5,c,3,Features,Text,"You are hired by an event-planning company to build some machine-learning predictors. One of them, for predicting the number of attendees of a concert, takes as input x, a vector of aspects of the event (day of the week, start time, city size, music genre (classical, folk, rock, pop, rap), number of attendees at last concert, if known). Cody decides to use a loss function of the following form
 $$
 \mathcal{L}_{\text {cody }}(g, a)= \begin{cases}\lambda(g-a)^{2} & \text { if } g>a \\ (g-a)^{2} & \text { otherwise }\end{cases}
 $$
 where $g$ is the guessed value and $a$ is the actual value, and $\lambda$ is an adjustable parameter. Jody decides to use a loss function of the following form
 $$
 \mathcal{L}_{\text {cody }}(g, a)= \begin{cases}\lambda_1(g-a)^{2} & \text { if } g>a \\ \lambda_2(g-a)^{2} & \text { otherwise }\end{cases}
 $$
 where $g$ is the guessed value and $a$ is the actual value, and $\lambda$ is an adjustable parameter. Is Jodyâs loss function actually able to capture a larger class of losses?",No
MIT Spring 2022,5,d,3,Features,Text,"Cody decides to use a loss function of the following form
 $$
 \mathcal{L}_{\text {cody }}(g, a)= \begin{cases}\lambda(g-a)^{2} & \text { if } g>a \\ (g-a)^{2} & \text { otherwise }\end{cases}
 $$
 where $g$ is the guessed value and $a$ is the actual value, and $\lambda$ is an adjustable parameter. We talk to Bodie, who really knows the concert business and says a better model for the total loss is:
 $$
 \mathcal{L}_{\text {bodie }}(g, a)=\alpha g+\beta \max (0, a-g) .
 $$
 This loss has two terms. The first part of the loss comes from the cost of renting a venue: if we guess that there will be $g$ attendees, we have to rent a place with $g$ seats, and we assume that such a rental costs $\alpha$ per seat. The second part of the loss comes from the loss of potential ticket sales: if $a$ people really wanted to attend but we can only seat $g$, then we lose $\beta$ for each of the $g-a$ people we have to turn away. Note that it is safe to assume that $\alpha<\beta$ (otherwise, we should not bother holding the concert!). We would like to train a linear regression model to minimize this loss. Let $g=\theta^{T} x$ be the prediction given input $x$ and parameters $\theta$, and let $y$ be the target training value for that $x$. Provide an expression for $\partial \mathcal{L}_{\text {bodie }}(g, y) / \partial \theta$.
 Note that this loss is not everywhere differentiable, which we have seen before with ReLU units. Don't worry about the what the value should be at that one point.","$$
\begin{gathered}
\frac{\partial \mathcal{L}_{\text {bodie }}(g, y)}{\partial g} \frac{\partial g}{\partial \theta} \\
\left(\alpha+\beta\left\{\begin{array}{ll}
0 & \text { if } y<\theta^{T} x \\
-1 & \text { otherwise }
\end{array}\right) x\right.
\end{gathered}
$$"
MIT Spring 2022,5,e.i,1.5,Features,Text,"You are hired by an event-planning company to build some machine-learning predictors. One of them, for predicting the number of attendees of a concert, takes as input x, a vector of aspects of the event (day of the week, start time, city size, music genre (classical, folk, rock, pop, rap), number of attendees at last concert, if known). We talk to Bodie, who really knows the concert business and says a better model for the total loss is:
 $$
 \mathcal{L}_{\text {bodie }}(g, a)=\alpha g+\beta \max (0, a-g) .
 $$
 This loss has two terms. The first part of the loss comes from the cost of renting a venue: if we guess that there will be $g$ attendees, we have to rent a place with $g$ seats, and we assume that such a rental costs $\alpha$ per seat. The second part of the loss comes from the loss of potential ticket sales: if $a$ people really wanted to attend but we can only seat $g$, then we lose $\beta$ for each of the $g-a$ people we have to turn away. Note that it is safe to assume that $\alpha<\beta$ (otherwise, we should not bother holding the concert!). But Bodie isn't really sure how to set the $\alpha$ and $\beta$ parameters, so we still have a problem! However, they are able to find a set of data of the form $(s, p, l)$ where $s$ describes the number of seats in the venue rented, $p$ describes the actual number of people who attempted to attend (including the number of people who were turned away) and $l$ describes the actual loss value. (e) Bodie wants to use this data to estimate $\alpha$ and $\beta$ in $\mathcal{L}_{\text {bodie }}$ by finding values of these parameters that predict the loss the most accurately in the mean-squared-error sense. Describe how to use the $(s, p, l)$ data to formulate a linear regression problem that will recover $\operatorname{good}$ estimates of $\alpha$ and $\beta$. What are the inputs, $x$?","Vectors of $(s, \max (0, p-s))$"
MIT Spring 2022,5,e.ii,1.5,Features,Text,"You are hired by an event-planning company to build some machine-learning predictors. One of them, for predicting the number of attendees of a concert, takes as input x, a vector of aspects of the event (day of the week, start time, city size, music genre (classical, folk, rock, pop, rap), number of attendees at last concert, if known). We talk to Bodie, who really knows the concert business and says a better model for the total loss is:
 $$
 \mathcal{L}_{\text {bodie }}(g, a)=\alpha g+\beta \max (0, a-g) .
 $$
 This loss has two terms. The first part of the loss comes from the cost of renting a venue: if we guess that there will be $g$ attendees, we have to rent a place with $g$ seats, and we assume that such a rental costs $\alpha$ per seat. The second part of the loss comes from the loss of potential ticket sales: if $a$ people really wanted to attend but we can only seat $g$, then we lose $\beta$ for each of the $g-a$ people we have to turn away. Note that it is safe to assume that $\alpha<\beta$ (otherwise, we should not bother holding the concert!). But Bodie isn't really sure how to set the $\alpha$ and $\beta$ parameters, so we still have a problem! However, they are able to find a set of data of the form $(s, p, l)$ where $s$ describes the number of seats in the venue rented, $p$ describes the actual number of people who attempted to attend (including the number of people who were turned away) and $l$ describes the actual loss value. (e) Bodie wants to use this data to estimate $\alpha$ and $\beta$ in $\mathcal{L}_{\text {bodie }}$ by finding values of these parameters that predict the loss the most accurately in the mean-squared-error sense. Describe how to use the $(s, p, l)$ data to formulate a linear regression problem that will recover $\operatorname{good}$ estimates of $\alpha$ and $\beta$. What are the target outputs, $y$?",l
MIT Spring 2022,5,f,4,Features,Image,"Given a true loss function $\mathcal{L}_{\text {true, which is not differentisble, how could you use it to find }}$ a good value of $\lambda$ so that you can use $\mathcal{L}_{\text {cody }}$ with that $\lambda$ to construct a good predictive hypothesis? Assume you have a dataset $\mathcal{D}$ and that you are given a set lambdas of plausible values for $\lambda$. Let's write out a strategy in very abstract pseudo-code, using the following basic procedures:
- train(data, lossfn) : trains a regression model to minimize lossfn on data, returns parameters theta
- subpart (data, $j, K$ ) : divides data into $K$ equal parts and returns the jth subpart
- allbutsubpart (data, $j, K$ ) : divides data into $K$ equal parts and returns all except the jth subpart
- eval (theta, data, lossfn) : returns average loss of hypothesis with weights theta on data according to lossfn
- L.true : the true loss function that maps a guess and an actual value into a cost
- L_cody(lambda) : returns $\mathcal{L}_{\text {cady }}$ for this value of lambda, which is itself a loss function that maps a guess and an actual value into a cost

Fill in the blanks in the code below, for a process in which we perform 10-fold crossvalidation to find the best lambda value.
best_lambda = None; best_loss $=$ None
for lambda in lambdas do
for $k$ in range( $)$ do
hypoth $=\operatorname{train}($ allbut subpart $($ data. k, 10),$\ldots$ Lucody (lambda)
loss $=\operatorname{eval}($ hypoth, subpart (data, k, 10)
if best_lambda is None or loss < best_loss then
best_lambda $=$ lambda
best_loss $=$ loss
best_lambda
return best_lambda
",Image filling
MIT Spring 2022,6,a,2,MDPs,Image,"Consider the MDP shown above. It has states $S_{0}, \ldots, S_{6}$ and actions $A, B$. Each arrow is labeled with one or more actions, and a probability value: this means that if any of those actions is chosen from the state at the start of the arrow, then it will make a transition to the state at the end of the arrow with the associated probability.

Rewards are associated with states, and independent, in this example, from the action that is taken in that state. Remember that with horizon $H=1$, the agent can collect the reward associated with the state it is in, and then terminates. 
Consider the policy $\pi$ that takes action $B$ in $S_{0}$ and action $A$ in $S_{2}$. If the system starts in $S_{0}$ or $S_{2}$, then under that policy, only those two states (So and $S_{2}$ ) are reachable. Recall that, for a fixed policy $\pi$,
$$
V_{\pi}(s)=R(s, \pi(s))+\gamma \sum_{s^{\prime}} P\left(S_{t+1}=s^{\prime} \mid S_{t}=s_{1} A_{t}=\pi(s)\right) V_{\pi}\left(s^{\prime}\right)
$$
Assuming the discount factor $\gamma=0.8$, what are the infinite-horizon values $V_{\pi}\left(S_{0}\right)$ and $V_{\pi}\left(S_{2}\right)$ ? It is sufficient to write out a small system of linear equations involving just those two variables; you do not have to take the time to solve them numerically.","$$
\begin{aligned}
&V_{\pi}\left(S_{0}\right)=0+0.8 \cdot V_{\pi}\left(S_{2}\right) \\
&V_{\pi}\left(S_{2}\right)=1+0.8 \cdot\left(0.9 V_{\pi}\left(S_{2}\right)+0.1 V_{\pi}\left(S_{0}\right)\right)
\end{aligned}
$$"
MIT Spring 2022,6,b,1.5,MDPs,Image,"Consider the MDP shown above. It has states $S_{0}, \ldots, S_{6}$ and actions $A, B$. Each arrow is labeled with one or more actions, and a probability value: this means that if any of those actions is chosen from the state at the start of the arrow, then it will make a transition to the state at the end of the arrow with the associated probability.

Rewards are associated with states, and independent, in this example, from the action that is taken in that state. Remember that with horizon $H=1$, the agent can collect the reward associated with the state it is in, and then terminates. 
What is the optimal value $V_{h-1}(s)=\max _{a} Q_{h-1}(s, a)$ for each state for horizon $H=1$ with no discounting?","i. $V_{h=1}\left(S_{0}\right)$ [1]
ii. $V_{h-1}\left(S_{1}\right)$ 0
iii. $V_{h=1}\left(S_{2}\right)$"
MIT Spring 2022,6,c,1.5,MDPs,Image,"Consider the MDP shown above. It has states $S_{0}, \ldots, S_{6}$ and actions $A, B$. Each arrow is labeled with one or more actions, and a probability value: this means that if any of those actions is chosen from the state at the start of the arrow, then it will make a transition to the state at the end of the arrow with the associated probability.

Rewards are associated with states, and independent, in this example, from the action that is taken in that state. Remember that with horizon $H=1$, the agent can collect the reward associated with the state it is in, and then terminates. 
What is the optimal action and value $V_{h-2}(s)$ for each state for horizon $H=2$ with no discounting? (If the actions are tied in value, list both).","i. $S_{0}: A:$ B $V_{h-2}:$
ii. $S_{1}: A$ : B $V_{h-2}:$ 5
iii. $S_{2}: A$ :
A $V_{h-2}: \frac{1+.9}{\hline}$"
MIT Spring 2022,6,d,3,MDPs,Image,"Consider the MDP shown above. It has states $S_{0}, \ldots, S_{6}$ and actions $A, B$. Each arrow is labeled with one or more actions, and a probability value: this means that if any of those actions is chosen from the state at the start of the arrow, then it will make a transition to the state at the end of the arrow with the associated probability.

Rewards are associated with states, and independent, in this example, from the action that is taken in that state. Remember that with horizon $H=1$, the agent can collect the reward associated with the state it is in, and then terminates. 
What is the optimal action and value $V(s)$ for each state for horizon $H=3$ with no discounting? (If the actions are tied in value, list both).","i. $S_{0}: A$ : $\mathbf{A}$ $V_{h-3:}:$ 5
ii. $S_{1}: A$ : B $V_{h-3}:$ 5
iii. $S_{2}: A$ :
A $V_{h-3:} 1+9 \cdot 1.9+1 \cdot 1$"
MIT Spring 2022,6,e,2,MDPs,Image,"Consider the MDP shown above. It has states $S_{0}, \ldots, S_{6}$ and actions $A, B$. Each arrow is labeled with one or more actions, and a probability value: this means that if any of those actions is chosen from the state at the start of the arrow, then it will make a transition to the state at the end of the arrow with the associated probability.

Rewards are associated with states, and independent, in this example, from the action that is taken in that state. Remember that with horizon $H=1$, the agent can collect the reward associated with the state it is in, and then terminates. 
If we increase the horizon beyond 3 , will the optimal action in state $S_{0}$ ever change? Explain.","Yes. With a longer horizon, it's worth taking action $\mathrm{B}$ in $S_{0}$ and going around and around that loop."
MIT Spring 2022,7,a.i,0.4,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
Provide the q-learning value for Q(A, Move).",0
MIT Spring 2022,7,a.ii,0.4,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
Provide the q-learning value for Q(B, Move).",0
MIT Spring 2022,7,a.iii,0.4,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(C, Move)",1
MIT Spring 2022,7,a.iv,0.4,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(A, move).",0
MIT Spring 2022,7,a.v,0.4,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. 
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(b, move).",0.9
MIT Spring 2022,7,b,2,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Characterize the weakness of Q-learning demonstrated by this example, which would be worse if there were a long sequence of states $B_{1}, \ldots, B_{100}$ between A and C. Very briefly describe a strategy for overcoming this weakness. ",It doesn't propagate the value all the way back the chain. Do the updates backward along the trajectory; or save your experience and replay it.
MIT Spring 2022,7,c.i,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. 
A, Move, 0
B, Move, 0
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(A, move).","Q(A, move) = .81"
MIT Spring 2022,7,c.ii,1,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. 
A, Move, 0
B, Move, 0
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. Provide the q-learning value for Q(B, move).","Q(B, move) = 0"
MIT Spring 2022,7,d,2,Reinforcement Learning,Text,"Consider an MDP with four states, called $A, B, C$, and $D$, and with two actions called Move and Stay. The discount factor $\gamma=0.9$. Here is a reminder of the Q-learning update formula, based on experience tuple $\left(s, a, r, s^{\prime}\right)$ : $$ Q(s, a):=(1-\alpha) Q(s, a)+\alpha\left(r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)\right) $$ Let $\alpha=1$. Assume we see the following state-action-reward sequence: 
A, Move, 0 
B, Move, 0 
C, Move, 1 
A, Move, 0 
B, Move, 0. 
A, Move, 0
B, Move, 0
With Q-values all starting at 0, we run the Q-learning algorithm on that state-action sequence. What problem with our algorithm is revealed by this example? Very briefly explain a small change to the method or parameters we are using that will solve this problem.",Use a smaller learning rate
MIT Spring 2022,8,a,1,CNNs,Text,"Network A is a neural network with a single convolution of size 3. It has a single max pooling layer with size d so the output of the network it sigmoid (max(z1,..zd)). Network B is a neural network with a single convolution of size 3. It has a single min pooling layer with size d so the output of the network it sigmoid (min(z1,..zd)). For which network does a high output value correspond, qualitatively, to âevery location in x corresponds to an instance of the desired patternâ choose between A or B or none.",B
MIT Spring 2022,8,b,1,CNNs,Text,"Network A is a neural network with a single convolution of size 3. It has a single max pooling layer with size d so the output of the network it sigmoid (max(z1,..zd)). Network B is a neural network with a single convolution of size 3. It has a single min pooling layer with size d so the output of the network it sigmoid (min(z1,..zd)). For which network does a high output value correspond, qualitatively, to âat least half of the locations in x correspond to an instance of the desired patternâ choose between A or B or none.",None
MIT Spring 2022,8,c,1,CNNs,Text,"Network A is a neural network with a single convolution of size 3. It has a single max pooling layer with size d so the output of the network it sigmoid (max(z1,..zd)). Network B is a neural network with a single convolution of size 3. It has a single min pooling layer with size d so the output of the network it sigmoid (min(z1,..zd)). For which network does a high output value correspond, qualitatively, to âthere is at least one instance of the desired pattern in this imageâ choose between A or B or none.",A
MIT Spring 2022,8,d,3,CNNs,Text,"Network A is a neural network with a single convolution of size 3. It has a single max pooling layer with size d so the output of the network it sigmoid (max(z1,..zd)). Network B is a neural network with a single convolution of size 3. It has a single min pooling layer with size d so the output of the network it sigmoid (min(z1,..zd)). What is $\partial g / \partial z_{i}$ for network A? Feel free to make use of the fact that $\partial \sigma(z) / \partial z=\sigma(z)(1-\sigma(z))$.","$\sigma\left(z_{i}\right)\left(1-\sigma\left(z_{i}\right)\right)$ if $z_{i}=\max \left(z_{1}, \ldots, z_{d}\right)$, and 0 otherwise."
MIT Spring 2022,8,e.i,1,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. Now, suppose we are just given a single training pair $(x, y)$ where the target $y$ is binary $0 / 1$. The loss that we are minimizing is again just
$$
\mathrm{NLL}(y, \hat{y})=-y \log \hat{y}-(1-y) \log (1-\hat{y})
$$
which is minimized when $\hat{y}$ matches the target $y$. We are interested in understanding qualitatively how the filter parameters $w$ get updated in the two networks if we use simple gradient descent to minimize $\operatorname{NLL}(y, \hat{y})$. Specify whether the behavior would occur in:
- Which network (A, B, or it doesn't matter)
- Target $y$ (1, 0, or it doesn't matter). 
The behavior is: 
After each step of gradient descent, the filter weights $w$ change so that their dot product with the values of one particular sub-region $\left[x_{j-1} ; x_{j} ; x_{j+1}\right]$ of the image increases.","A, 1"
MIT Spring 2022,8,e.ii,1,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. Now, suppose we are just given a single training pair $(x, y)$ where the target $y$ is binary $0 / 1$. The loss that we are minimizing is again just
$$
\mathrm{NLL}(y, \hat{y})=-y \log \hat{y}-(1-y) \log (1-\hat{y})
$$
which is minimized when $\hat{y}$ matches the target $y$. We are interested in understanding qualitatively how the filter parameters $w$ get updated in the two networks if we use simple gradient descent to minimize $\operatorname{NLL}(y, \hat{y})$. Specify whether the behavior would occur in:
- Which network (A, B, or it doesn't matter)
- Target $y$ (1, 0, or it doesn't matter). 
The behavior is: 
After each step of gradient descent, the filter weights $w$ change so that their dot product with the values of one particular sub-region $\left[x_{j-1} ; x_{j} ; x_{j+1}\right]$ of the image decreases.","B, 0"
MIT Spring 2022,8,e.iii,1,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. Now, suppose we are just given a single training pair $(x, y)$ where the target $y$ is binary $0 / 1$. The loss that we are minimizing is again just
$$
\mathrm{NLL}(y, \hat{y})=-y \log \hat{y}-(1-y) \log (1-\hat{y})
$$
which is minimized when $\hat{y}$ matches the target $y$. We are interested in understanding qualitatively how the filter parameters $w$ get updated in the two networks if we use simple gradient descent to minimize $\operatorname{NLL}(y, \hat{y})$. Specify whether the behavior would occur in:
- Which network (A, B, or it doesn't matter)
- Target $y$ (1, 0, or it doesn't matter). 
The behavior is:
After each step of gradient descent, the filter weights $w$ change so that their dot product with the values of some image sub-region increases, but the specific region may change from one step to another.","B, 1"
MIT Spring 2022,8,e.iv,1,CNNs,Text,"We are going to consider two different simple convolutional networks over one dimensional (vector) inputs. Each network has a single convolutional layer with a single filter of size 3 and stride 1 . Let $\left(z_{1}, \ldots, z_{d}\right)$ be the output of this convolutional layer, i.e., $\left(z_{1}, \ldots, z_{d}\right)$ represents the feature map constructed from the input vector $\left(x_{1}, \ldots, x_{d}\right)$. For simplicity, you can think of $z_{j}$ just as a linear map $z_{j}=\left[x_{j-1} ; x_{j} ; x_{j+1}\right]^{T} w$ where $w$ are the filter parameters. Our two networks differ in terms of how the feature map values are pooled to a single output value.

Network A has a single max-pooling layer with input size $d$, so that the output of the network $\hat{y}=\sigma\left(\max \left(z_{1}, \ldots, z_{d}\right)\right)$ where $\sigma(\cdot)$ is the sigmoid function.
Network B has a single min-pooling layer with input size $d$, so that the output of the network
$$
\hat{y}=\sigma\left(\min \left(z_{1}, \ldots, z_{d}\right)\right)
$$
When the filter's output value is high it represents a positive detection of some pattern of interest. Now, suppose we are just given a single training pair $(x, y)$ where the target $y$ is binary $0 / 1$. The loss that we are minimizing is again just
$$
\mathrm{NLL}(y, \hat{y})=-y \log \hat{y}-(1-y) \log (1-\hat{y})
$$
which is minimized when $\hat{y}$ matches the target $y$. We are interested in understanding qualitatively how the filter parameters $w$ get updated in the two networks if we use simple gradient descent to minimize $\operatorname{NLL}(y, \hat{y})$. Specify whether the behavior would occur in:
- Which network (A, B, or it doesn't matter)
- Target $y$ (1, 0, or it doesn't matter). 
The behavior is:
After each step of gradient descent, the filter weights $w$ change so that their dot product with the values of some image sub-region decreases, but the specific region may change from one step to another.","A, 0"
MIT Spring 2022,9,a,4,RNNs,Text,"Consider three RNN variants:
1. The basic RNN architecture we studied was
$$
\begin{aligned}
&s_{t}=f\left(W^{s s} s_{t-1}+W^{s x} x_{t}\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
where $W^{s s}$ is $m \times m, W^{s x}$ is $m \times d$, and $f$ is an activation function to be specified later. We omit the offset parameters for simplicity (set them to zero).
2. Ranndy thinks the basic RNN is representationally weak, and it would be better not to decompose the state update in this way. Ranndy's proposal is to instead
$$
\begin{aligned}
&s_{t}=f\left(W^{s s x} \operatorname{concat}\left(s_{t-1}, x_{t}\right)\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$
where $\operatorname{concat}\left(s_{t-1}, x_{t}\right)$ is a vector of length $m+d$ obtained by concatenating $s_{t-1}$ and $x_{t}$, so $W^{s s x}$ has dimensions $m \times(m+d)$.
3. Orenn wants to try yet another model, of the form:
$$
\begin{aligned}
&s_{t}=f\left(W^{s s} s_{t-1}\right)+f\left(W^{s x} x_{t}\right) \\
&y_{t}=W^{o} s_{t}
\end{aligned}
$$ Lec Surer insists on understanding these models a bit better, and how they might relate.
(a) Select the correct claim and answer the associated question.
(1) Claim: The three models are all equivalent when $f(z)=z$. In this case, define $W^{s s x}$
(2) Claim: The three models are not all equivalent when $f(z)=z$. In this case, assume $m=d=1$ and provide one setting of $W^{s s x}$ in Ranndy's model such that $W^{s s}$ and $W^{s x}$ cannot be chosen to make the basic and Orenn's models the same as Ranndy's.","Claim $1 W^{s s x}=h s t a c k\left(W^{s s}, W^{s x}\right)$"
MIT Spring 2022,9,b,6,RNNs,Image,"Here is Rina's model again:
$$
\begin{aligned}
&s_{t}=f\left(W^{s s} s_{t-1}\right)+f\left(W^{s x_{t}} x_{t}\right) \\
&y_{t}=W^{s_{s}}
\end{aligned}
$$
Something interesting might happen with this model when $f(z)$ is not the identity. Specifically, it supposedly corresponds to the architecture shown in the figure below, which includes an additional hidden layer. Specify what $W, W^{\prime}$, and $m^{\prime}$ are so that this architecture indeed corresponds to Rina's model. Specify your answers in terms of $m, W^{s a}$, $W^{s z}$, and $W^{O}$.","i. 2m
ii. $W$
Solution: A block-diagonal matrix of the form
$$
\left[\begin{array}{cc}
W^{s s} & 0 \\
0 & W^{s x}
\end{array}\right]
$$
iii. $W^{\prime}$
Solution: hstack $(I(m) ; I(m))$"
Cornell Fall 2018,1,1,3,Bonus,I filled out the course evaluation for CS4780?,,
Cornell Fall 2018,1,2,2,Model Selection,"(T/F) The fewer assumptions an algorithm makes, the better it is. In practce the best algorithm is Generic Programming which makes no assumptions at all.","False, all algorithms make assumptions",
Cornell Fall 2018,1,3,1,Decision Trees,(T/F) With Random Forests there is no need to perform a training/validation split.,True.,
Cornell Fall 2018,1,4,2,Logistic Regression,"(T/F) MLE is great to learn the parameters of a binomial distribution, but it cannot be used to learn the parameters of a separating hyper-plane.","False, the logistic loss in Logistic Regression is derived through MLE to learn the best separating hyperplane.",
Cornell Fall 2018,1,5,2,Classifiers,(T/F) The Naive Bayes classiÞer assumes that all features are independent.,"False, It assumes all features are conditionally independent - given the label.",
Cornell Fall 2018,1,6,2,Logistic Regression,"(T/F) Logistic Regression converges whenever a separating hyper-plane exists, otherwise it may run forever.",False. Logistic regression solves a convex optimization problem and always converges.,
Cornell Fall 2018,1,7,2,Classifiers,(T/F) The set of Support Vectors are all the the training data points an SVM cannot classify correctly.,"False, they also include all training points with a margin of $\leq 1$.",
Cornell Fall 2018,1,8,1,Classifiers,(T/F) A learned kernel SVM model (with RBF kernel) requires you to store some of the training data.,True (the support vectors),
Cornell Fall 2018,1,9,1,Classifiers,(T/F) The decision boundary of a dual SVM classiÞer with linear kernel is identical to that of a primal SVM classiÞer.,True.,
Cornell Fall 2018,1,10,1,Loss Functions,(T/F) l1 regularizer encourage sparse solutions.,True.,
Cornell Fall 2018,1,11,2,Classifiers,(T/F) In SVMs l2 regularization minimizes the squared bias term $b^2$.,"False, the bias term is not regularized.",
Cornell Fall 2018,1,12,2,Classifiers,(T/F) Linear classiÞers have as parameters the hyper-plane normal $\mathbf{w} and a bias term $b$. Reducing this bias term $b$ will often increase the variance of the classiÞer.,"False, the bias term is di_erent from the bias/variance trade-o_.",
Cornell Fall 2018,1,13,1,Regression,(T/F) The conditional distribution $P(y|x)$ of Gaussian Process Regression is itself a Gaussian distribution.,True.,
Cornell Fall 2018,1,14,1,Regression,(T/F) Kernelized linear regression (with RBF kernel) is a non-parametric algorithm.,True.,
Cornell Fall 2018,1,15,1,Decision Trees,"(T/F) A CART tree, if learned to full depth, are non-parametric algorithms.",True.,
Cornell Fall 2018,1,16,2,Ensemble Methods,"(T/F) In bagging, each classiÞer in the ensemble is trained on a data set that is independently and identically distributed.","False, the data is not independently sampled.",
Cornell Fall 2018,1,17,1,Ensemble Methods,(T/F) One advantage of bagging is that all ensemble members (i.e. classiÞers) can be trained in parallel.,True.,
Cornell Fall 2018,1,18,2,Ensemble Methods,(T/F) AdaBoost with decision trees (depth 3) is non-parametric.,"False, the set of parameters is not a function of the number of training instances, $n$.",
Cornell Fall 2018,1,19,2,Ensemble Methods,(T/F) AdaBoost terminates the moment it reaches $0\%$ training error.,"False, as long as there is a weak learner with $< 0.5$ weighted training error, AdaBoost keeps boosting.",
Cornell Fall 2018,1,20,1,Decision Trees,(T/F) One advantage of Random Forests is that you obtain meaningful probability estimates as your output predictions $P(y|x)$.,True.,
Cornell Fall 2018,1,21,1,Neural Networks,(T/F) Deep convolutional neural networks are particularly well suited for image classification tasks.,True.,
Cornell Fall 2018,1,22,2,Neural Networks,(T/F) The optimization of deep neural networks is a convex minimization problem.,"False, it is non-convex because of the non-linear transition functions.",
Cornell Fall 2018,2,1,3,Model Selection,Your Decision Tree classiÞer has a training error of $0\%$ and a testing error of $87\%$. What can you say about the bias/variance trade-o_ (assuming the data is not noisy). Name two possible interventions to reduce the testing error?,"High Variance, Low Bias. You could prune the tree, or use bagging.",
Cornell Fall 2018,2,2,3,Model Selection,"For k-fold cross validation, describe the positive and negative e_ects as $k \rightarrow n$. When would you be most inclined to use $k = n$?",The error decreases (as you have more training data) but as $k \rightarrow n$ the validation procedures also becomes very slower. You would use $k = n$ if you have very little training data (e.g. $n = 20$).,
Cornell Fall 2018,2,3,3,Model Selection,The expected regression error decomposes into three terms. Write down the mathematical decomposition and label each term.,"$$
\underbrace{E_{\mathbf{x}, y, D}\left[\left(h_{D}(\mathbf{x})-y\right)^{2}\right]}_{\mathrm {Expected Test Error }}=\underbrace{E_{\mathbf{x}, D}\left[\left(h_{D}(\mathbf{x})-\bar{h}(\mathbf{x})\right)^{2}\right]}_{\mathrm {Variance }}+\underbrace{E_{\mathbf{x}, y}\left[(\bar{y}(\mathbf{x})-y)^{2}\right]}_{\mathrm {Noise }}+\underbrace{E_{\mathbf{x}}\left[(\bar{h}(\mathbf{x})-\bar{y}(\mathbf{x}))^{2}\right]}_{\mathrm {Bias }^{2}}
$$",
Cornell Fall 2018,2,4,3,Model Selection,Explain why adding more training data does not always help reduce your testing error below a desired threshold $\epsilon>0$. Describe such a scenario.,The training error is a lower bound on the testing error. Adding more data increases the training error. If your training error is already too high $(>\epsilon)$ adding more data will not help bring the testing error below $\epsilon$ as it is bounded by the training error.,
Cornell Fall 2018,2,5a,1,Model Selection,"Consider the following algorithms and highlighted hyper-parameters. Decide whether increasing these parameters could help reduce overfitting. Answer with ""Yes"" or ""No"". The number of hidden units in the Neural Network.",No.,
Cornell Fall 2018,2,5b,1,Model Selection,"Consider the following algorithms and highlighted hyper-parameters. Decide whether increasing these parameters could help reduce overfitting. Answer with ""Yes"" or ""No"". The maximum depth in Decision Trees.",No.,
Cornell Fall 2018,2,5c,1,Model Selection,"Consider the following algorithms and highlighted hyper-parameters. Decide whether increasing these parameters could help reduce overfitting. Answer with ""Yes"" or ""No"". $\lambda$ in Logistic Regression, trained with a $\lambda \sum_{j} w_{j}^{2}$ penalty in the objective.",Yes.,
Cornell Fall 2018,2,5d,1,Model Selection,"Consider the following algorithms and highlighted hyper-parameters. Decide whether increasing these parameters could help reduce overfitting. Answer with ""Yes"" or ""No"". The number of iterations $T$ in Boosting.",No.,
Cornell Fall 2018,3,1,2,Classifiers,Name one condition that is necessary and sufficient for a matrix $\mathbf{K}$ to be positive semi-definite.,"$\forall \mathbf{q}, \mathbf{q} \top \mathbf{K} \mathbf{q} \geq 0$ or $K=L \top L$ for some real matrix $L$, or $K$ only has non-negative eigenvalues.",
Cornell Fall 2018,3,2,3,Classifiers,"Which of the following algorithms can be kernelized: a) Decision Trees, b) Linear Regression, c) Gaussian Processes. Justify your answer.","b) and c) not a). b) and c) access data points only through inner-products, whereas a) splits on feature values and needs the feature realization of the data.",
Cornell Fall 2018,3,3,4,Classifiers,Consider the following data set. Draw the decision boundary you would obtain with a hard margin linear SVM? Circle all the support vectors!,Solution,
Cornell Fall 2018,3,4,3,Classifiers,Add two blue points (\#1 and \#2) such that \#1 would and \#2 would not affect the decision boundary if the SVM was re-trained.,Solution,
Cornell Fall 2018,3,5,4,Classifiers,Let $m$ be the number of support vectors of an SVM trained on $n$ data points (with RBF kernel). For a fixed $n$ imagine you increase the dimensionality $d$ of the data until it becomes very large. How would you expect the ratio $\frac{m}{n}$ to change as $d \gg 0$ ?,It approaches 1 because of the curse of dimensionality. All training points will be very far away from each other and close to the decision boundary.,
Cornell Fall 2018,3,6,2,Classifiers,Describe a scenario in which you may want to use a kernel SVM with linear kernel instead of a standard linear (primal) SVM.,"If your dimensionality is very large, once the kernel is computed the computational complexity of kernel SVMs is independent of $d$.",
Cornell Fall 2018,3,7,2,Classifiers,Consider the following data set. Draw a plausible decision boundary for a hard-margin SVM with polynomial kernel.,Solution,
Cornell Fall 2018,3,,,Classifiers,,,
Cornell Fall 2018,3,8,2,Classifiers,You are given a non-linear regression data set. You are deciding between training a Gaussian Process or kernelized linear regression (both with $\mathrm{RBF}$ Kernel). Which one will have lower testing / training error?,They are identical.,
Cornell Fall 2018,4,1,4,Decision Trees,Name two advantages of decision tree over nearest neighbor algorithms.,"(1) once the tree is constructed, the training data does not need to be stored. Instead, we can simply store how many points of each label ended up in each leaf - typically these are pure so we just have to store the label of all points. (2) decision trees are very fast during test time, as test inputs simply need to traverse down the tree to a leaf - the prediction is the majority label of the leaf. (3) decision trees require no metric because the splits are based on feature thresholds and not distances.",
Cornell Fall 2018,4,2,2,Decision Trees,Name the CART stopping criteria (with unlimited depth).,all labels are identical or all features are identical,
Cornell Fall 2018,4,3a,4,Decision Trees,"Consider the classification dataset $S$ with $|S|=9$ visualized in the following figure and table: \begin{tabular}{lll}
\hline
$\mathrm{i}$ & $\mathbf{x}_{i}$ & $y_{i}$ \\
\hline
1 & $(1,1)$ & $+1$ \\
2 & $(1,2)$ & $-1$ \\
3 & $(1,3)$ & $-1$ \\
4 & $(2,1)$ & $-1$ \\
5 & $(2,2)$ & $-1$ \\
6 & $(2,3)$ & $-1$ \\
7 & $(3,1)$ & $-1$ \\
8 & $(3,2)$ & $+1$ \\
9 & $(3,3)$ & $+1$ \\
\hline
\end{tabular}
Compute the Gini impurity for this dataset before any split.",Gini impurity: $I_{G}(S)=\frac{1}{3} * \frac{2}{3}+\frac{2}{3} * \frac{1}{3}=\frac{4}{9}$.,
Cornell Fall 2018,4,3b,4,Decision Trees,"Consider the classification dataset $S$ with $|S|=9$ visualized in the following figure and table: \begin{tabular}{lll}
\hline
$\mathrm{i}$ & $\mathbf{x}_{i}$ & $y_{i}$ \\
\hline
1 & $(1,1)$ & $+1$ \\
2 & $(1,2)$ & $-1$ \\
3 & $(1,3)$ & $-1$ \\
4 & $(2,1)$ & $-1$ \\
5 & $(2,2)$ & $-1$ \\
6 & $(2,3)$ & $-1$ \\
7 & $(3,1)$ & $-1$ \\
8 & $(3,2)$ & $+1$ \\
9 & $(3,3)$ & $+1$ \\
\hline
\end{tabular}
Perform the CART algorithm with Gini impurity on $S$. Please draw a resulting tree (with splitting values and features) and also draw the corresponding hyper-planes in the previous figure.",Solution,
Cornell Fall 2018,5,1,3,Ensemble Methods,What loss function does AdaBoost minimize? (Write down the precise mathematical form.),The exponential loss $\frac{1}{n} \sum_{i=1}^{n} e^{-y_{i} H\left(x_{i}\right)}$ (the $\frac{1}{n}$ is optional).,
Cornell Fall 2018,5,2,2,Ensemble Methods,Imagine $10 \%$ of your binary training data (all points unique) are accidentally mislabeled. What is the training error that AdaBoost will converge to after sufficient rounds of boosting?,$0 \%$,
Cornell Fall 2018,5,3,2,Ensemble Methods,Describe a data scenario in which AdaBoost is not a good choice. Justify your answer.,If you exhibit label noise. The exponential loss will ensure that the mislabeled data points will also be classified correctly and the algorithm will overfit (badly).,
Cornell Fall 2018,5,4a,2,Ensemble Methods,"Given a distribution $P$ you can sample a training set $D$ and obtain a classifier $h$. Imagine you train $m$ such classifiers $h_{1}, \ldots, h_{m}$ on $m$ data sets $D_{1}, \ldots, D_{m}$, each drawn i.i.d. from the data distribution $P$. As you increase $m$ from $m=1$ to $m \gg 0$, show how you can use these models to obtain a low variance classifier $\hat{h}$.",You average them: $\hat{h}=\frac{1}{m} \sum_{i=1}^{m} h_{i}$.,
Cornell Fall 2018,5,4b,2,Ensemble Methods,"Given a distribution $P$ you can sample a training set $D$ and obtain a classifier $h$. Imagine you train $m$ such classifiers $h_{1}, \ldots, h_{m}$ on $m$ data sets $D_{1}, \ldots, D_{m}$, each drawn i.i.d. from the data distribution $P$. As you increase $m$ from $m=1$ to $m \gg 0$, what happens to the variance of $\hat{h}$ in the limit, $m \gg 0$ ?","By the weak law of large numbers the average $\hat{h}$ will approach the expected classifier $\bar{h}$ as $m \gg 0$ and $E_{\mathbf{x}, D}\left[\left(h_{D}(\mathbf{x})-\bar{h}(\mathbf{x})\right)^{2}\right] \rightarrow$ 0 .",
Cornell Fall 2018,5,4c,2,Ensemble Methods,"Given a distribution $P$ you can sample a training set $D$ and obtain a classifier $h$. Imagine you train $m$ such classifiers $h_{1}, \ldots, h_{m}$ on $m$ data sets $D_{1}, \ldots, D_{m}$, each drawn i.i.d. from the data distribution $P$. As you increase $m$ from $m=1$ to $m \gg 0$, how does the bias of $\hat{h}$ compare to the bias of $h$ ?","The bias is unaffected, i.e. the bias of $\hat{h}$ is identical to the bias of $h$, because the $E[\hat{h}]=E[h]$.",
Cornell Fall 2018,5,5,4,Ensemble Methods,"After two iterations of AdaBoost, with step sizes $\alpha_{1}, \alpha_{2}$ respectively and weak learners $h_{1}, h_{2}$, what are all possible weights that could potentially be assigned to a training data point (ignore normalization).","$e^{-\alpha_{1}-\alpha_{2}}, e^{-\alpha_{1}+\alpha_{2}}, e^{+\alpha_{1}-\alpha_{2}}, e^{\alpha_{1}+\alpha_{2}}$",
Cornell Fall 2018,5,6,4,Ensemble Methods,"Robin is trying to use AdaBoost on full CART trees without depth limit (all training points are distinct). Although the code seems correct, it crashes in the very first round. What do you think is the problem?","The CART tree has zero classification error, yielding an infinite step-size $\alpha=\frac{1}{2} \ln \left(\frac{1-\epsilon}{\epsilon}\right)$ and a division by zero.",
Cornell Fall 2018,6,1,2,Neural Networks,Name two reasons why Newton's Method typically is not used to train deep neural networks.,1. too many parameters to store the Hessian; 2 . it converges quickly to the closest local minima / saddle point and not to a wide minimum,
Cornell Fall 2018,6,2,2,Loss Functions,Let the loss function be $\ell(\mathbf{w})=\frac{1}{2 n} \sum_{i=1}^{n}\left(\mathbf{x}_{i}^{\top} \mathbf{w}-y_{i}\right)^{2}$. Write down the update for Stochastic Gradient Descent and Gradient Descent.,$G_{G D}=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{x}_{i}^{\top} \mathbf{w}-\right.$ $\left.y_{i}\right) \mathbf{x}_{i}$ whereas the SGD update is $G_{S G D}=\frac{1}{m} \sum_{i=1}^{m}\left(\mathbf{x}_{s_{i}}^{\top} \mathbf{w}-y_{s_{i}}\right) \mathbf{x}_{s_{i}}$ for randomly picked $s_{i} \in[n]$.,
Cornell Fall 2018,6,3,2,CNNs,"Suppose you have a convolutional filter of size $k \times k$. When you apply this filter to a $n \times n$ input image, what is the dimension of the output feature map with no padding?",$(n-k+1) \times(n-k+1)$,
Cornell Fall 2018,6,4,2,CNNs,"Suppose you have a $3 \times 3$ matrix $I$ from one patch of an image. Each matrix value corresponds to a pixel. 
$$
I=\left[\begin{array}{lll}
3 & 1 & 1 \\
3 & 0 & 2 \\
4 & 4 & 0
\end{array}\right]
$$
and filter kernel
$$
k=\left[\begin{array}{ll}
1 & 0 \\
1 & 1
\end{array}\right]
$$
What is the output matrix after convolving the input $I$ with $k$ (no flipping of the kernel in case you learned that in your computer vision/signal processing class)? We don't consider the padding and stride here. The output should be a $2 \times 2$ matrix","$$
\left[\begin{array}{cc}
6 & 3 \\
11 & 4
\end{array}\right]
$$",
Cornell Fall 2018,6,5,4,Neural Networks,"Consider you have the following neural network: 
\begin{itemize}
\item Input layer: 80 units

\item First hidden layer: 20 hidden units

\item Second hidden layer: 60 hidden units

\item Third hidden layer: 20 hidden units

\item Output layer: 80 units

\item Sigmoidal activation for each hidden layer and the output

\item Loss function: logistic loss

\end{itemize}
Each layer has a bias. How many parameters does this neural network have? You can leave your answer as an expression.","$$
\# \text { params }=80 \cdot 20+20 \cdot 60+60 \cdot 20+20 \cdot 80+20+60+20+80=4780
$$",
,,,,,,,
Cornell Spring 2017,1,1,1,Decision Tree,(T/F) Random forests is one of the few machine learning algorithms that makes no assumptions on the data.,"False, every machine learning algorithm makes assumptions. RF assumes that similar inputs have similar labels.",
Cornell Spring 2017,1,2,1,Optimization,"(T/F) One implication of the curse of dimensionality is that if you sample $n$ data points uniformly at random within a hyper cube of dimensionality $d$, all pairwise distances converge to 0 as $n \rightarrow \infty$.","False, they become concentrated around the average distance as $d\to\infty$.",
Cornell Spring 2017,1,3,1,Neural Networks,"(T/F) During training, in a linearly separable data set, the perceptron algorithm never misclassifies the same input twice.","False, it can iterate many times over the data set and get the same points wrong repeatedly.",
Cornell Spring 2017,1,4,1,Optimization,"(T/F) You have a biased coin and toss it $n$ times. The MAP estimate with $+1$ smoothing of the probability of getting ""head"" is $\frac{n_{H}+1}{n+1}$, where $n_{H}$ is the number of occurrences of ""head"" amongst your $n$ throws.","False, it is $\frac{n_{H}+1}{n+2}$. 5. (T/F) The multinomial Naive Bayes algorithm is a linear classifier.",
Cornell Spring 2017,1,5,1,Classifiers,(T/F) The multinomial Naive Bayes algorithm is a linear classifier.,True.,
Cornell Spring 2017,1,6,1,Optimization,"(T/F) MAP inference maximizes $P(\mathbf{w} \mid D a t a)$ whereas MLE maximizes $P($ Data; $\mathbf{w})$, where $\mathbf{w}$ represents the model parameters.",True.,
Cornell Spring 2017,1,7,1,Optimization,(T/F) Newton's Method diverges only if the Hessian matrix is not invertible.,"False, it can also diverge with invertible Hessian matrices.",
Cornell Spring 2017,1,8,1,Regression,"(T/F) Linear (ordinary least squares) regression can be solved in closed form, although sometimes that is computationally impractical or even infeasible.",True.,
Cornell Spring 2017,1,9,1,Classifiers,(T/F) SVMs maximize the margin between the training and testing data.,"False, they maximize the margin between the training data and the separating hyperplane.",
Cornell Spring 2017,1,10,1,Optimization,"(T/F) In order for gradient descent to converge, the loss function has to be convex and differentiable everywhere.","False, if it is not convex it will still converge, but to a local minimum.",
Cornell Spring 2017,1,11,1,Model Selection,"(T/F) The bias variance trade-off decomposes the error obtained by a classifier into (squared) bias, variance, and noise. The noise term cannot possibly be addressed, even by changing the feature representation of the data.","False, changing the feature representation of the data will affect the noise. For example, if all features are removed the error is only noise (which would be very large).",
Cornell Spring 2017,1,12,1,Model Selection,"(T/F) In a setting of high bias, a great remedy is to add more training data.","False, more training data does not help with bias.",
Cornell Spring 2017,1,13,1,Ensemble Methods,(T/F) Bagging reduces variance.,True.,
Cornell Spring 2017,1,14,1,Ensemble Methods,(T/F) Boosting reduces noise.,"False, it reduces bias (and sometimes even variance a little).",
Cornell Spring 2017,1,15,1,Classifiers,"(T/F) Learning with kernels is expensive, because the data is mapped into a very high dimensional space and therefore storing the transformed data consumes a lot of storage.","False, the mapping is performed implicitly.",
Cornell Spring 2017,1,16,1,Regression,(T/F) The mean prediction of Gaussian processes is identical to kernelized linear regression.,True.,
Cornell Spring 2017,1,17,1,Optimization,(T/F) One popular application of Gaussian Processes is to find hyper-parameters of machine learning algorithms.,True.,
Cornell Spring 2017,1,18,1,Optimization,(T/F) Ball-Trees are a data structure to speed up the perceptron algorithm.,"False, they can speed up nearest neighbor searchers, but that is never performed in the Perceptron.",
Cornell Spring 2017,1,19,1,Decision Tree,(T/F) Decision Trees stop splitting when the impurity function can no longer be improved with a single split.,"False, e.g. in the XOR data set the first split does not improve the impurity function. The splitting stops if the maximum depth (or number of nodes) is reached, or all inputs are identical.",
Cornell Spring 2017,1,20,1,Decision Tree,(T/F) Random Forests are bagged decision trees with one additional modification: Each splitting dimension is chosen completely uniformly at random.,"False, the best splitting dimension is selected amongst $k$ random dimensions.",
Cornell Spring 2017,1,21,1,Ensemble Methods,"(T/F) Provided each weak learner can classify a weighted version of the training data set with better than 0.5 accuracy, in AdaBoost the training error reduces exponentially.",True.,
Cornell Spring 2017,1,22,1,Neural Networks,"(T/F) Deep neural networks are great on many data sets, but do not work competitively on image classification tasks.","False, they are particularly good at image classification tasks.",
Cornell Spring 2017,1,23,1,Neural Networks,(T/F) The optimization of deep neural networks is a convex minimization problem.,"False, it is non-convex because of the non-linear transition functions.",
Cornell Spring 2017,2,1,3,Model Selection,"Write down the bias (squared), variance, noise decomposition of the expected test error $\mathbb{E}_{x, y, D}\left[\left(h_{D}(x)-y\right)^{2}\right]$.","Variance: $E_{x, D}\left[\left(h_{D}(x)-\bar{h}(x)\right)^{2}\right]$ Bias squared: $E_{x}\left[(\bar{h}(x)-\bar{y}(x))^{2}\right]$ Noise: $E_{x, y}\left[(\bar{y}(x)-y(x))^{2}\right]$",
Cornell Spring 2017,2,2,5,Model Selection,Describe how to detect settings with high bias and provide three approaches that could help reduce the bias.,"Detect high bias if training error is above goal error (plot training and testing error vs number of data points). Reduce bias by decreasing model complexity, using boosting, and 2 points for correct detection. 3x1 point for correct remedies.",
Cornell Spring 2017,2,3a.,4.333333333,Model Selection,"For each of the following scenarios, determine if the model has low/high bias and variance. Explain your choice. Logistic regression on linearly separable data and non-linearly separable data.","Linearly separable: low bias, low variance. Non-linearly separable: high bias, low variance.",
Cornell Spring 2017,2,3b.,4.333333333,Model Selection,"For each of the following scenarios, determine if the model has low/high bias and variance. Explain your choice. kNN with small k and large k.","Small k: low bias, high variance. Large k: high bias, low variance.",
Cornell Spring 2017,2,3c.,4.333333333,Model Selection,"For each of the following scenarios, determine if the model has low/high bias and variance. Explain your choice. Uniform random labeling.","low bias, high variance",
Cornell Spring 2017,3,1a.,5,Classifiers,"Suppose you are using a kernel SVM with the $\mathrm{RBF}$ kernel $k(\mathbf{x}, \mathbf{z})=\exp \left(-\frac{\|\mathbf{x}-\mathbf{z}\|_{2}^{2}}{\sigma^{2}}\right)$ to do classification. Recall that the kernel SVM is trained by solving the dual optimization problem: $$
\begin{aligned}
\min _{\alpha_{1}, \ldots, \alpha_{n}} & \frac{1}{2} \sum_{i, j} \alpha_{i} \alpha_{j} y_{i} y_{j} \mathbf{K}_{i j}-\sum_{i} \alpha_{i} \\
\text { s.t. } & 0 \leq \alpha_{i} \leq C \\
& \sum_{i} \alpha_{i} y_{i}=0
\end{aligned}
$$ Assume you can either set $C$ and $\sigma^{2}$ to a very large value $(\gg 0)$ or a very small value $(\epsilon)$. Provide a setting with high bias and one with high variance. Briefly explain your answers.",Large \sigma and large C lead to high variance as decision boundary is smaller; small \sigma and small C lead to high bias as decision boundary is very large,
Cornell Spring 2017,3,1b.,3,Classifiers,$\mathbf{x}_{1}$ turns out to be a support vector. What can you say about its corresponding optimal value $\alpha_{i}^{*}$ and the margin between the hyperplane and $\mathbf{x}_{1} ?$,$\alpha_{i}^{*} > 0$ and the normalized margin must be 1,
Cornell Spring 2017,3,1c.,5,Classifiers,"In order to apply the classifier to a test point, we need the hyper-plane bias $b$. Show how $b$ can be recovered from $\alpha_{1}^{*}, \ldots, \alpha_{n}^{*}$ with the help of the support vector $\mathbf{x}_{i}$ and label $y_{i} \in\{-1 .+1\}$.",The bias can be retrieved by average difference between the weighted labels (weighted by the \alpha's) and the inner product of the features with the true weight parameters.,
Cornell Spring 2017,3,2,8,Classifiers,"For this question, you will find the following rules about recursively building kernels helpful. Given kernels $k_{1}(\mathbf{x}, \mathbf{z})$ and $k_{2}(\mathbf{x}, \mathbf{z})$, the following are well-defined kernels:

$$
\begin{aligned}
k(\mathbf{x}, \mathbf{z}) &=\mathbf{x}^{\top} A \mathbf{z}, A \succeq 0 \\
k(\mathbf{x}, \mathbf{z}) &=c k_{1}(\mathbf{x}, \mathbf{z}) \\
k(\mathbf{x}, \mathbf{z}) &=\exp \left(k_{1}(\mathbf{x}, \mathbf{z})\right) \\
k(\mathbf{x}, \mathbf{z}) &=f(\mathbf{x}) k_{1}(\mathbf{x}, \mathbf{z}) f(\mathbf{z}) \\
k(\mathbf{x}, \mathbf{z}) &=k_{1}(\mathbf{x}, \mathbf{z})+k_{2}(\mathbf{x}, \mathbf{z})
\end{aligned}
$$

Suppose that $\mathbf{x}, \mathbf{z} \in \mathbb{R}^{2}$. Let $[\mathbf{x}]_{1}$ and $[\mathbf{x}]_{2}$ denote the first and second coordinates of $\mathbf{x}$, respectively. Show that

$$
k(\mathbf{x}, \mathbf{z})=\exp \left(-\frac{\left\|[\mathbf{x}]_{1}-[\mathbf{z}]_{1}\right\|_{2}^{2}}{\sigma^{2}}\right)+\exp \left(-\frac{\left\|[\mathbf{x}]_{2}-[\mathbf{z}]_{2}\right\|_{2}^{2}}{\sigma^{2}}\right)
$$

is a kernel.

Hint: You may find the following two matrices helpful:

$$
A_{1}=\left[\begin{array}{ll}
1 & 0 \\
0 & 0
\end{array}\right], \quad A_{2}=\left[\begin{array}{ll}
0 & 0 \\
0 & 1
\end{array}\right] .
$$

You can assume they are positive semi-definite (i.e. $A_{1} \succeq 0, A_{2} \succeq 0$ ).","The trick here is to define

$$
\begin{aligned}
A_{1} &=\left[\begin{array}{ll}
1 & 0 \\
0 & 0
\end{array}\right] \\
A_{2} &=\left[\begin{array}{ll}
0 & 0 \\
0 & 1
\end{array}\right]
\end{aligned}
$$

so that $\mathbf{x}^{\top} A_{1} \mathbf{z}=[\mathbf{x}]_{1}[\mathbf{z}]_{1}$ and $\mathbf{x}^{\top} A_{2} \mathbf{z}=[\mathbf{x}]_{2}[\mathbf{z}]_{2}$ (these matrices are psd with eigenvalues 0 and 1). The rest of the proof is identical to the proof for the RBF kernel in class:

(a) $k_{1}(\mathbf{x}, \mathbf{z})=\mathbf{x}^{\top} A_{1} \mathbf{z}=[\mathbf{x}]_{1}[\mathbf{z}]_{1}$, rule (1)

(b) $k_{2}(\mathbf{x}, \mathbf{z})=\frac{2}{\sigma^{2}} k_{1}(\mathbf{x}, \mathbf{z})=\frac{2}{\sigma^{2}}[\mathbf{x}]_{1}[\mathbf{z}]_{1}$, rule $(2)$

(c) $k_{3}(\mathbf{x}, \mathbf{z})=\exp \left(k_{2}(\mathbf{x}, \mathbf{z})\right)=\exp \left(\frac{2[\mathbf{x}]_{1}[\mathbf{z}]_{1}}{\sigma^{2}}\right), \operatorname{rule}(3)$

(d) $k_{4}(\mathbf{x}, \mathbf{z})=\exp \left(-\frac{[\mathbf{x}]_{1}[\mathbf{x}]_{1}}{\sigma^{2}}\right) \exp \left(\frac{2[\mathbf{x}]_{1}[\mathbf{z}]_{1}}{\sigma^{2}}\right) \exp \left(-\frac{[\mathbf{z}]_{1}[\mathbf{z}]_{1}}{\sigma^{2}}\right)=\exp \left(-\frac{\left\|[\mathbf{x}]_{1}-[\mathbf{z}]_{1}\right\|_{2}^{2}}{\sigma^{2}}\right)$, rule (4) with $f(\mathbf{x})=\exp \left(-\frac{[\mathbf{x}]_{1}[\mathbf{x}]_{1}}{\sigma^{2}}\right)$

(e) Repeating the above with $A_{2}, k_{5}(\mathbf{x}, \mathbf{z})=\exp \left(-\frac{\left\|[\mathbf{x}]_{2}-[\mathbf{z}]_{2}\right\|_{2}^{2}}{\sigma^{2}}\right)$ is a kernel.

(f) Finally, $k_{4}+k_{5}$ is a kernel by rule (5).",
Cornell Spring 2017,4,1,2,Decision Tree,Imagine you build a K D-Tree and label each leaf with the most common label amongst all training points that fall into this leaf. Why would this not be a desirable classifier?,"Because many leaves would not be pure, which makes the most common label a bad estimate.",
Cornell Spring 2017,4,2,4,Decision Tree,Name two reasons why Random Forests are such popular classifiers amongst practitioners?,"1. They only have two hyper-parameters (the number of trees m and the number of features K), but both are really easy to set. You can set $K = \sqrt{d}$ and m as large as you can afford. 2. RF are based on decision/regression trees and require no feature scaling or any of the typical pre-processing of the data. Features can be in completely different units and can be categorical or real valued.",
Cornell Spring 2017,4,3,2,Decision Tree,"Assume you pre-process all your features in the following way: you sort each feature independently. For each feature, you then assign all those inputs that share the lowest feature value a new feature value of 1, all those with the second lowest value a 2, etc. How does this affect the trees that you construct?",It doesn't.,
Cornell Spring 2017,4,4,4,Decision Tree,Under what conditions on your training set will a CART tree (with unlimited depth) obtain 0% training error.,If there are no two training inputs with identical features but different labels.,
Cornell Spring 2017,4,5,3,Decision Tree,"You are building a regression tree with the squared loss impurity. i.e. the labels in the leaf are $L=\left\{y_{1}, \ldots, y_{m}\right\}$ and the loss, under prediction $t$, is $\sum_{y \in L}(y-t)^{2}$. Prove that the average label $t=\frac{1}{m} \sum_{i=1}^{m} y_{i}$ minimizes the loss at a leaf.","$$
t=\operatorname{argmin}_{t} \sum_{i=1}^{n}\left(t-y_{i}\right)^{2}
$$

Taking the derivative and eq. with 0 :

$$
\begin{aligned}
2 \sum_{i=1}^{n}\left(t-y_{i}\right) &=0 \\
2 n t-2 \sum_{i=1}^{n} y_{i} &=0 \\
t &=\frac{1}{n} \sum_{i=1}^{n} y_{i}
\end{aligned}
$$",
Cornell Spring 2017,4,6,6,Decision Tree,"You are now considering minimizing the absolute loss instead: $\sum_{y \in L}|y-t|$. Define $L_{\leq}=\{y \in L: y \leq t\}$ and $L_{>}=\{y \in L: y>t\}$. Prove that setting $t$ to the median of $L$ minimizes this loss. To simplify things you can assume you have an odd number of samples (i.e. $m=2 r+1$ ) and that all $y_{i} \in L$ are distinct (i.e. $y_{i} \neq y_{j}$ for any $y_{i}, y_{j} \in L$ ). (Without loss of generality it is sufficient to show there is no better splitting value $t^{\prime}$ that is larger than the median. )","Let $t$ be the median of $L$. Then, we have $L_{\leq} = \{y \in L: y \leq t\}$ and $L_{>} = \{y \in L: y > t\}$. We want to show that there is no better splitting value $t'$ that is larger than the median. Let us prove this by contradiction. Imagine that we are able to find a $t'$ that is larger than the median that achieves this splitting. Since the median is the 50-th percentile, $|L_{\leq}| = |L_{>}| = \frac{m}{2}$. Since we set our $t'$ larger, the corresponding $L_{\leq}$ has more elements. Thus establishing the contradiction.",
Cornell Spring 2017,5,1,3,Ensemble Methods,"Name two algorithms, for which boosting will be ineffective. Briefly justify why.","e.g. k-NN classification, unlimited depth decision trees, kernel SVMs. They have high variance and essentially zero bias. Rubrics: one point for each algorithm, one point for correct justification. Maximum 3 points. special cases: (1) Linear classifiers also gain 1 point since if the data set isn't linearly separable there is not much used to ensemble linear classifer. special cases: (2) Naive Bayes doesn't get a point. special cases: (3) Random Labeling gets 1 point since it's not a weak learner (since it doesn't learn). special cases: (4) Model labeling doesn't get points.",
Cornell Spring 2017,5,2,5,Ensemble Methods,Describe what happens in AdaBoost if two training inputs (in a binary classification problem) are identical in features but have different labels.,Both points will obtain veyr high weights and eventually will dominate the training data set. Weak learners will no longer be able to classify the weighted data set with better than 50% accuracy and the algorithm will stop. minimum 1 point if say something and show effort. (+1) if state that the algorithm will stop. (+2) if state that these two data points will gain weights. (+1) if state that the weak learner won't be able to distinguish these two data points eventually. maximum 5 points.,
Cornell Spring 2017,5,3,3,Ensemble Methods,"In neural networks bagging can be performed without random subsampling of the data. i.e., one trains m neural networks independently and ensembles their results. Can you explain why the subsampling is unnecessary in this case?","The random initialization and non-convexity of neural networks ensures that independently trained models will end up in different local minima and obtain different results. The effect is similar to training on slightly different data sets. minimum 1 point if say something and show effort. (+2) if state that NN has random initialization. (+1) if state NN converges to local minimum due to non-convexity. special case (1): if mentioned that Stochastic Gradient Descent randomly sample training data and lead to different weights, get 2 points. special case (2): if mentioned that layers such as dropout is some embedded randomness, get also 2 points.",
Cornell Spring 2017,5,4a.,4,Ensemble Methods,"Assume you have weak learners $h \in \mathcal{H}$ s.t. $h(\mathbf{x}) \in\{+1,-1\}$ for any $\mathbf{x}$. You are trying to apply boosting with the logistic loss function

$$
\mathcal{L}(H)=\sum_{i=1}^{n} \ln \left(1+e^{-y_{i} H\left(\mathbf{x}_{i}\right)}\right) .
$$

(remember, ln here refers to the natural logarithm)

Compute the derivative $\frac{\partial \mathcal{L}(H)}{\partial H\left(\mathbf{x}_{i}\right)}$.","$$
\frac{\partial \mathcal{L}(H)}{\partial H\left(\mathbf{x}_{i}\right)}=\frac{-y_{i}}{1+e^{-y_{i} H\left(\mathbf{x}_{i}\right)}} e^{-y_{i} H\left(\mathbf{x}_{i}\right)}=-\frac{y_{i}}{1+e^{y_{i} H\left(\mathbf{x}_{i}\right)}}
$$

minimum 1 point if show effort and write something. $(+3)$ if the answer is correct. $(+2)$ if the answer has only minor mistake (i.e. flip the sign, etc).",
Cornell Spring 2017,5,4b.,6,Ensemble Methods,"Assume you have weak learners $h \in \mathcal{H}$ s.t. $h(\mathbf{x}) \in\{+1,-1\}$ for any $\mathbf{x}$. You are trying to apply boosting with the logistic loss function

$$
\mathcal{L}(H)=\sum_{i=1}^{n} \ln \left(1+e^{-y_{i} H\left(\mathbf{x}_{i}\right)}\right) .
$$

(remember, ln here refers to the natural logarithm) Let $w_{i}=\frac{1}{1+e^{y_{i} H\left(\mathbf{x}_{i}\right)}}$ and let $\epsilon(h)=\sum_{i: h\left(\mathbf{x}_{i}\right) \neq y_{i}} w_{i}$ be the weighted error of the training set. For simplicity assume we are using a fixed step-size of 1. Show that the next classifier to be added to the ensemble $H$ in order to minimize the loss function is $h=\operatorname{argmin}_{h} \mathcal{L}(H+h)=\arg \min _{h} \epsilon(h)$.","$$
\begin{aligned}
h &=\operatorname{argmax}_{h} \sum_{i=1}^{n} h\left(\mathbf{x}_{i}\right) \frac{\partial \mathcal{L}(H)}{\partial H\left(\mathbf{x}_{i}\right)} \\
&=\operatorname{argmin}_{h} \sum_{i=1}^{n} w_{i} y_{i} h\left(\mathbf{x}_{i}\right) \\
&=\operatorname{argmin}_{h} \sum_{i: h\left(\mathbf{x}_{i}\right)=y_{i}} w_{i}-\sum_{i: h\left(\mathbf{x}_{i}\right) \neq y_{i}} w_{i} \\
&=\operatorname{argmin}_{h} \epsilon(h)-(1-\epsilon(h)) \\
&=\operatorname{argmin}_{h} \epsilon(h)
\end{aligned}
$$",
Cornell Spring 2017,6,1,5,Neural Networks,"Assume you are given a neural network with $L$ layers to minimize a loss function $\mathcal{L}$

$$
\begin{aligned}
h(\mathbf{x}) &=\mathbf{w}^{\top} \phi_{1}(\mathbf{x}) \\
\phi_{1}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{1} \phi_{2}(\mathbf{x})\right) \\
& \vdots \\
\phi_{\ell}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{\ell} \phi_{\ell+1}(\mathbf{x})\right) \\
& \vdots \\
\phi_{L}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{L} \mathbf{x}\right)
\end{aligned}
$$

(Note that the subscript of $\phi$ starts at 1 at the end of the network, and increases to $L$ as we make our way back to the start) Let us define $a_{\ell}=\mathbf{U}_{\ell} \phi_{\ell+1}(\mathbf{x})$ such that $\phi_{\ell}=\sigma\left(a_{\ell}\right)$. Let $\delta_{\ell}=\frac{\partial \mathcal{L}}{\partial a_{\ell}}$. Express $\frac{\partial \mathcal{L}}{\partial \mathbf{U}_{\ell}}$ in terms of $\delta_{\ell}$. (assume $1<\ell<L$ )","$$
\begin{aligned}
\frac{\partial \mathcal{L}}{\partial \mathbf{U}_{\ell}} &=\frac{\partial \mathcal{L}}{\partial a_{\ell}} \frac{\partial a_{\ell}}{\partial \mathbf{U}_{\ell}} \\
&=\delta_{\ell} \phi_{\ell+1}(\mathbf{x})^{T}
\end{aligned}
$$",
Cornell Spring 2017,6,2,5,Neural Networks,"Assume you are given a neural network with $L$ layers to minimize a loss function $\mathcal{L}$

$$
\begin{aligned}
h(\mathbf{x}) &=\mathbf{w}^{\top} \phi_{1}(\mathbf{x}) \\
\phi_{1}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{1} \phi_{2}(\mathbf{x})\right) \\
& \vdots \\
\phi_{\ell}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{\ell} \phi_{\ell+1}(\mathbf{x})\right) \\
& \vdots \\
\phi_{L}(\mathbf{x}) &=\sigma\left(\mathbf{U}_{L} \mathbf{x}\right)
\end{aligned}
$$

(Note that the subscript of $\phi$ starts at 1 at the end of the network, and increases to $L$ as we make our way back to the start) Assume that the derivative of $\sigma(z)$ is given as $\sigma^{\prime}(z)$. Define $\delta_{\ell+1}$ as a function of $\delta_{\ell}$. (assume $1<\ell<L$ ) where $x=\phi_{L+1}$","$$
\begin{aligned}
\delta_{\ell+1} &=\frac{\partial \mathcal{L}}{\partial a_{\ell+1}} \\
&=\frac{\partial \mathcal{L}}{\partial \phi_{\ell+1}} \frac{\partial \phi_{\ell+1}}{\partial a_{\ell+1}} \\
&=\frac{\partial \mathcal{L}}{\partial a_{\ell}} \frac{\partial a_{\ell}}{\partial \phi_{\ell+1}} \frac{\partial \phi_{\ell+1}}{\partial a_{\ell+1}} \\
&=\sigma^{\prime}\left(a_{\ell+1}\right) \odot \mathbf{U}_{\ell}^{T} \delta_{\ell} \\
&=\sigma^{\prime}\left(\mathbf{U}_{\ell+1} \phi_{\ell+2}\right) \odot \mathbf{U}_{\ell}^{T} \delta_{\ell}
\end{aligned}
$$",
Cornell Spring 2017,6,3,3,Neural Networks,Provide one reason why stochastic gradient descent can be better than traditional (batch) gradient descent when applied to neural networks.,"SGD can jump out of local minima more easily, since it's more noisy. Alternatively, you can note that as you increase your batch size, your update gradient asymptotically approaches the true gradient. Thus, you can split your batch into n parts, yielding n updates with generally better than $\frac{1}{n}$ accuracy relative to the true gradient, yielding more progress per computation time. SGD takes this to an extreme. Both answers are correct, but not equivalent.",
Cornell Spring 2017,6,4,4,Classifiers,Assume you make all transition functions the identity (i.e. $\sigma(z)=z$ ). Prove that the final classifier is simply a linear classifier of the form $h(\mathbf{x})=\hat{\mathbf{w}}^{\top} \mathbf{x}$ for some vector $\hat{\mathbf{w}}$.,"$$
\begin{aligned}
h(\mathbf{x}) &=\mathbf{w}^{T}\left(\prod_{\ell=L}^{1} \mathbf{U}_{\ell}\right) \mathbf{x} \\
&=\hat{\mathbf{w}}^{T} \mathbf{x}
\end{aligned}
$$",
Cornell Spring 2017,6,5,4,Optimization,ML-practitioners tend to drop the learning rate during training. Explain why and what effect it has.,"Starting out with a large learning rate has two advantages: 1. it prevents you from getting trapped in sharp local minima, because the weights ""jump around"" too much with each step; and 2. it moves you quickly ""down-hill"" because you take larger steps. Then switching to a smaller learning rate allows the network to converge to the local minima closest to the current weight position.",
Harvard Spring 2015,1,N/A,10,Clustering,"Imagine that you have N data and you wish to find K clusters using K-Means++. As- suming that N > K, can the K-Means++ algorithm choose the same datum twice to become a cluster center? Why or why not?","The K-Means++ algorithm will never choose the same datum twice to become a cen- ter. This is because the distribution over the data items is proportional to the squared distance to the closest cluster center. When a datum is a cluster center, this distribution will assign zero probability for that item.",
Harvard Spring 2015,2,a,5,Clustering,"(Link) In the two figures below, draw the dendrogram for the data on the left, where the y- axis provides their values. In the top figure, use the single-linkage criterion (min over between-group distances) and in the bottom figure use the complete-linkage criterion (max over between-group distances).",Solution is diagram,
Harvard Spring 2015,2,b,5,Clustering,"(Link) In the two figures below, draw the dendrogram for the data on the left, where the y- axis provides their values. In the top figure, use the single-linkage criterion (min over between-group distances) and in the bottom figure use the complete-linkage criterion (max over between-group distances).",Solution is diagram,
Harvard Spring 2015,3,N/A,10,Classifiers,"Suppose that K1(x,x_) and K2(x,x_) are both valid kernel functions. Recall that a valid kernel is one that corresponds to an inner product in some (possibly infinite- dimensional) feature space and produces a matrix Kij = K(xi, xj) that is a positive semi-definite for any finite set of examples x1, x2, . . . , xN. Show that
K(x, x_) = _K1(x, x_) + _K2(x, x_)
is a valid kernel if K1(x, x_) and K2(x, x_) are both valid kernels and _, _ > 0. [Hint: It may be useful to recall that a matrix K is positive semi-definite if yTKy ³ 0, _y.]","Say the kernel function and set of points create matrices K1, K2, and K, corresponding to K1(á, á), K2(á, á), and K(á, á), respectively.
It suffices to show that for any vector y, yTKy ³ 0. This follows algebraically
yTKy = yT (_K1 + _K2) y (1)
= _yTK1y + _yTK2y (2) ³0+0 (3)
where the inequality follows by assumptions _, _ > 0 and K1, K2 valid kernel functions (i.e. yTK1y ³ 0 for both functions K1 and K2).",
Harvard Spring 2015,4,N/A,10,Classifiers,"Suppose that we have a data set and we train two support vector machines as follows. We train the first SVM on a random subset of the data. Then we add the remainder of the data and train another SVM on the complete data set. How might the size of the optimal margin change from the first to the second SVM? Would you expect it to increase, decrease, stay the same, or do something else? Provide an explanation and/or diagrams to make your case.","The margin for the full dataset will decrease or stay the same. Specifically, if the subset contains the support vectors from the full data set, the margin for the full data set stays the same; otherwise, the margin for the full data set decreses.
More data points means more contraints. The margin found on the full data set satis- fies all the classfication contraints in the subset problem, but the solution may not be optimized for the subset. One can also illustrate this with diagrams.",
Harvard Spring 2015,5,N/A,10,Reinforcement Learning,Suppose Andy has a donut-eating utility function UA(donut) and Brian has a donut- eating utility function UB(donut). If UA(donut) = 7 _ (UB(donut))2 _ 42. Explain whether or not Andy and Brian have the same donut-eating preferences.,"The two utility functions are not the same. They would be the same if one is a mono- tonically increasing function of the other, but this is not the case for parabolic function.
To show this, we see if UA(d1) = 1, then, UB(d1) = _35; if UB(d2) = _1, then UB(d2) = _35. Andy prefers d1 to d2, but Brian has no preferences between the two donuts.
Note that the absolute values of the utility functions do not matter; only the relative values matter. That is, given any x1 and x2, two utility functions UA and UB are the same if and only if UA(x1) _ UA(x2) > 0 __ UB(x1) _ UB(x2) > 0, UA(x1) _ UA(x2) < 0 __ UB(x1) _ UB(x2) < 0, UA(x1) _ UA(x2) = 0 __ UB(x1) _ UB(x2) = 0.",
Harvard Spring 2015,6,a,5,Reinforcement Learning,"The update rule for Q-learning is $Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma + \max\limits_{a^{'}}Q(s^{'},a^{'} - Q(s,a) \right]$,
where $s_$ is the state you actually enter after performing action a in state s and r is the reward you actually receive. Consider two states $S = {s1, s2}$ and actions $A = {a1, a2}$, and current Q-values. \begin{tabular}{ c c c }
& $a1$ & $a2$ \\ 
$s1$ & 3 & 2 \\ 
$s2$ & 4 & 6 
\end{tabular}
\newline
Suppose the agent is in state s1. Using $\epsilon$-greedy, how would it decide to act?",The best action a1 is selected with probability 1 _ $\epsilon$. An action is selected at random (with uniform probability) with probability $\epsilon$.,
Harvard Spring 2015,6,b,5,Reinforcement Learning,"The update rule for Q-learning is $Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma + \max\limits_{a^{'}}Q(s^{'},a^{'} - Q(s,a) \right]$,
where $s_$ is the state you actually enter after performing action a in state s and r is the reward you actually receive. Consider two states $S = {s1, s2}$ and actions $A = {a1, a2}$, and current Q-values. \begin{tabular}{ c c c }
& $a1$ & $a2$ \\ 
$s1$ & 3 & 2 \\ 
$s2$ & 4 & 6 
\end{tabular}
\newline
Suppose the agent exploits in $s1$ and lands in $s2$. Which $Q$-value would be updated, and what is the value for $\max\limits_{a^{'}} Q(s_, a_)$ used in the update?","Because the agent exploits in s1, it takes action $a1$ from $s1$. Thus Q(s1, a1) will be updated. The value for $\max\limits_{a^{'}} Q(s_, a_)$ used in the update is $Q(s2, a2) = 6$.",
Harvard Spring 2015,6,c,5,Reinforcement Learning,State one advantage of policy iteration over value iteration for planning.,"Policy iteration takes as most as many iterations to reach the optimal policy as value iteration, and in practice usually takes far fewer iterations. Policy iteration has a definite stopping condition: when the policy does not change after two suc- cessive iterations, the algorithm is completed. Policy iteration can also be modi- fied to take advantage of approximate solutions to the value function, particularly in problems with a large number of states in which the linear system cannot be solved practically by matrix inversion.",
Harvard Spring 2015,7,N/A,15,Clustering,"We are given a mixture model in the form $$p(x|\pi,\{\theta_k \}_{k=1}^{k}) = \sum_{k=1}^{K} \pi_{k}p(x|\theta_k)$$
where $x \in RD$. The mean of the kth component distribution $p(x | \theta_k)$ is given by $\mu_k$. What is the mean of the overall mixture?","From the definition of expectation
$$E(x|\pi,\{\theta_k \}_{k=1}^{k}) = \int{p(x|\pi,\{\theta_k \}_{k=1}^{k})xdx}$$
$$=\int{\sum_{k=1}^{K} \pi_{k}p(x| \theta_k)xdx}$$
$$=\sum_{k=1}^{K}\pi_{k}\int{ p(x| \theta_k)xdx}$$
$$=\sum_{k=1}^{K}\pi_{k}E[x|\theta_k] = \sum_{k=1}^{K} \pi_k\mu_k$$",
Harvard Spring 2015,8,a,5,Optimization,"(Link) In diagram A,what does thedashed line(II) depict,int erms of our model, $\theta$_{0} and $\theta$_{0}? What does the arrow depict?","The dashed line (II) depicts the lower bound on the log marginal likelihood given by the expected complete data log likelihood (plus an entropy term) correspond- ing to the posterior at _0 (because itÕs tight at $\theta_0$). Specifically it is
Q(\theta; \theta_0) = Ep(z|x,\theta_0) [log p(x, z|\theta) _ log p(z|x, _0)] (11) = Ep(z|x,\theta_0) [log p(x, z|\theta)] + H[p(z|x, \theta_0)] 
The arrow represents maximizing this function with respect to $\theta$ (the M-step). Note that the maximum of this function only depends on the expected complete data log likelihood (the entropy term is fixed).",
Harvard Spring 2015,8,b,5,Optimization,"(Link) In diagram B, what does the dashed line (III) depict? What do the arrows depict?","The dashed line (III) depicts the updated lower bound on the log marginal likeli- hood given by the Q function at _1.
$$Q(\thata; \theta_1) = E_{p(z|x,\theta_1)} [log p(x, z|\theta) _ log p(z|x, \theta_1)] $$ $$= E_p(z|x,\theta_1) [log p(x, z|_)] + H[p(z|x, _1)] (14)
The arrows represent moving from Q(_; _0) to Q(_; _1), which corresponds to the E-step.",
Harvard Spring 2021,1,a,1,Bayesian network,(Diagram) (Question),Solution,
Harvard Spring 2021,1,b,3,Bayesian network,(Diagram) (Question),Solution,
Harvard Spring 2021,1,c,2,Bayesian network,(Diagram) (Question),Solution,
Harvard Spring 2021,1,d,2,Bayesian network,(Diagram) (Question),Solution,
Harvard Spring 2021,1,e,1,Bayesian network,(Diagram) (Question),Solution,
Harvard Spring 2021,1,f,2,Bayesian network,Would adding any one of the missing edges in the Bayesian network result in the network representing more distribution or fewer distributions? Briefly justify your answer.,"More distributions. Various ways to see this - it removes the requirement of local independence, it adds more paths, and it also adds more parameters",
Harvard Spring 2021,2,a,2,Hidden Markov Models,(Diagram) (Question),Solution,
Harvard Spring 2021,2,b,2,Hidden Markov Models,(Diagram) (Info) (Question),Solution,
Harvard Spring 2021,2,c,2,Hidden Markov Models,(Diagram) (Info) (Question),Solution,
Harvard Spring 2021,2,d,1,Hidden Markov Models,"You consider using the HMM to predict the next state $p(s_{t+1}|x_1,\cdots, x_t)$ by first identifying the most likely seuqence of states $s_1^{*}\cdots s_t^{*}$ given $x_1\cdots, x_t$, and then predicting $$p(s_{t+1}|x_1,\cdots, x_t)\propto p(s_{t+1}|s_1^{*}\cdots s_t^{*}) = p(s_{t+1}|s_{t}^{*}$ What is wrong with this?",This is wrong because it puts all the probabiliuty on the most likely sequence of states (the point estimate) when we should marginalize out over all possible sequences,
Harvard Spring 2021,3,a,2,Clustering,(Diagram) (Question),Solution,
Harvard Spring 2021,3,b,2,Clustering,(Diagram) (Question),Solution,
Harvard Spring 2021,3,c,1,Clustering,(Diagram) (Question),Solution,
Harvard Spring 2021,3,d,2,Clustering,"Consider a genera setting with D-dimensional data and principal components with a set of eigenvalue of $\{\lambda_{1}, \lambda_{2}, \cdots, \lambda_{D}\}$ that has low variance. Does this tend to be ana indicator that reducing the dimension of the data via PCA will be effective or not particularly effective? Briefly justify your answer.","Low variance in eigenvalues indicates components each carrying a similar amounf of ""information"" in the data, uniformly across components. This suggests that PCA may be relatively ineffective, with a number of ocmponents needed to explain the data.",
Harvard Spring 2021,3,e,1,Clustering,Give one data property that would lead you to strongly prefer K-means clustering over Hierachical agglomerative clustering. Briefly justify your answer.,Two possible answers: (1) high dimension data since K-means doesn't suffer curse of dimensionality (2) a lot of data - k means sclaes linearly not quadratically,
Harvard Spring 2021,4,a,2,Optimization,Question,Solution,
Harvard Spring 2021,4,b,1,Optimization,"[For this problem, write the probability density function for a Normal distribution as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$; this denotes the value of the PDF for a Normal distribution with mean $\mu$ and variance $\sigma^2$ at some point $x$. There is no need to work with the actual expression for a Normal distribution.]
Suppose that a freezer that is used by HUDS contains a noisy sensor that sometimes malfunctions. The measurements are $\left\{x_n\right\}_{n=1}^N$, where each $x_n$ is a real number. Each measurement is sampled independently, according to the following distribution:
- With probability $\alpha, 0<\alpha<1$, the sensor works correctly and returns a value distributed as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$, where $\mu$ is the true temperature and $\sigma>0$.
- With probability $1-\alpha$, the sensor fails and returns a value distributed as $\mathcal{N}\left(x ; 0, \epsilon^2\right)$, for some $\epsilon>0$.

The parameters of the model are $\{\alpha, \mu, \sigma, \epsilon\}$. For measurement $x_n$, we use variable $z_n$ to denote whether the sensor is functioning correctly $\left(z_n=1\right)$ or incorrectly $\left(z_n=0\right)$.

Write down the expression for the probability density $p\left(x_n, z_n ; \alpha, \mu, \sigma, \epsilon\right)$ for the $n$th reading. [Use the ""power trick"", i.e. use the $z$ value as an exponent]
","$p\left(x_n, z_n ; \alpha, \mu, \sigma, \epsilon\right)=\left[\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right]^{z_n}\left[(1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right]^{1-z_n}$",
Harvard Spring 2021,4,c,3,Optimization,"[For this problem, write the probability density function for a Normal distribution as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$; this denotes the value of the PDF for a Normal distribution with mean $\mu$ and variance $\sigma^2$ at some point $x$. There is no need to work with the actual expression for a Normal distribution.]
Suppose that a freezer that is used by HUDS contains a noisy sensor that sometimes malfunctions. The measurements are $\left\{x_n\right\}_{n=1}^N$, where each $x_n$ is a real number. Each measurement is sampled independently, according to the following distribution:
- With probability $\alpha, 0<\alpha<1$, the sensor works correctly and returns a value distributed as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$, where $\mu$ is the true temperature and $\sigma>0$.
- With probability $1-\alpha$, the sensor fails and returns a value distributed as $\mathcal{N}\left(x ; 0, \epsilon^2\right)$, for some $\epsilon>0$.

The parameters of the model are $\{\alpha, \mu, \sigma, \epsilon\}$. For measurement $x_n$, we use variable $z_n$ to denote whether the sensor is functioning correctly $\left(z_n=1\right)$ or incorrectly $\left(z_n=0\right)$.

Write down the expression for the complete-data log likelihood,
$$
\ln \left(p\left(\left\{x_n, z_n\right\}_{n=1}^N ; \alpha, \mu, \sigma, \epsilon\right)\right) .
$$
[Your answer should be expressed as sums of log terms.]
","Complete data log likelihood is
$$
\begin{aligned}
\ln \left(p\left(\left\{x_n, z_n\right\}_{n=1}^N ; \alpha, \mu, \sigma, \epsilon\right)\right)=& \ln \left(\prod_n p\left(x_n, z_n ; \alpha, \mu, \sigma, \epsilon\right)\right) \\
=& \ln \left(\prod_n\left[\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right]^{z_n}\left[(1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right]^{1-z_n}\right) \\
=& \sum_{n=1}^N\left(z_n \ln \left(\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right)+\left(1-z_n\right) \ln \left((1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right)\right) \\
=&\left.\sum_{n=1}^N z_n \ln \alpha+\sum_n z_n \ln \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right)+\\
& \sum_n\left(1-z_n\right) \ln (1-\alpha)+\sum_n\left(1-z_n\right) \ln \mathcal{N}\left(x_n ; 0, \epsilon^2\right)
\end{aligned}
$$
",
Harvard Spring 2021,4,d,2,Optimization,"[For this problem, write the probability density function for a Normal distribution as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$; this denotes the value of the PDF for a Normal distribution with mean $\mu$ and variance $\sigma^2$ at some point $x$. There is no need to work with the actual expression for a Normal distribution.]
Suppose that a freezer that is used by HUDS contains a noisy sensor that sometimes malfunctions. The measurements are $\left\{x_n\right\}_{n=1}^N$, where each $x_n$ is a real number. Each measurement is sampled independently, according to the following distribution:
- With probability $\alpha, 0<\alpha<1$, the sensor works correctly and returns a value distributed as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$, where $\mu$ is the true temperature and $\sigma>0$.
- With probability $1-\alpha$, the sensor fails and returns a value distributed as $\mathcal{N}\left(x ; 0, \epsilon^2\right)$, for some $\epsilon>0$.

The parameters of the model are $\{\alpha, \mu, \sigma, \epsilon\}$. For measurement $x_n$, we use variable $z_n$ to denote whether the sensor is functioning correctly $\left(z_n=1\right)$ or incorrectly $\left(z_n=0\right)$.


E-step: Derive an expression for the conditional probapility,
$$
q_n=p\left(z_n=1 \mid x_n ; \alpha, \mu, \sigma, \epsilon\right) .
$$
[Give an exact expression, not something that is proportional to $q_n$.]
","The conditional probability $q_n$ is given by
$$
\begin{aligned}
q_n=p\left(z_n=1 \mid x_n ; \alpha, \mu, \sigma, \epsilon\right) &=\frac{p\left(x_n \mid z_n=1\right) p\left(z_n=1\right)}{p\left(x_n \mid z_n=1\right) p\left(z_n=1\right)+p\left(x_n \mid z_n=0\right) p\left(z_n=0\right)} \\
&=\frac{\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)}{\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)+(1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)}
\end{aligned}
$$
",
Harvard Spring 2021,4,e,2,Optimization,"[For this problem, write the probability density function for a Normal distribution as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$; this denotes the value of the PDF for a Normal distribution with mean $\mu$ and variance $\sigma^2$ at some point $x$. There is no need to work with the actual expression for a Normal distribution.]
Suppose that a freezer that is used by HUDS contains a noisy sensor that sometimes malfunctions. The measurements are $\left\{x_n\right\}_{n=1}^N$, where each $x_n$ is a real number. Each measurement is sampled independently, according to the following distribution:
- With probability $\alpha, 0<\alpha<1$, the sensor works correctly and returns a value distributed as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$, where $\mu$ is the true temperature and $\sigma>0$.
- With probability $1-\alpha$, the sensor fails and returns a value distributed as $\mathcal{N}\left(x ; 0, \epsilon^2\right)$, for some $\epsilon>0$.

The parameters of the model are $\{\alpha, \mu, \sigma, \epsilon\}$. For measurement $x_n$, we use variable $z_n$ to denote whether the sensor is functioning correctly $\left(z_n=1\right)$ or incorrectly $\left(z_n=0\right)$.


M-step: Derive an expression for the expected complete-data log likelihood,
$$
\mathbb{E}_{z \sim q}\left[\ln \left(p\left(\left\{x_n, z_n\right\}_{n=1}^N ; \alpha, \mu, \sigma, \epsilon\right)\right)\right] .
$$
Here, "" $z \sim q$ "" means "" $z_n$ is distributed according to $q_n$, for each $n$."" [Your answer should be expressed as sums of $\log$ terms.]
","The expected complete-data log likelihood is
$$
\begin{aligned}
E_{z \sim q} \ln \left(p\left(\left\{x_n, z_n\right\}_{n=1}^N ; \alpha, \mu, \sigma, \epsilon\right)\right)=& \sum_{n=1}^N\left(q_n \ln \left(\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right)+\left(1-q_n\right) \ln \left((1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right)\right) \\
\left.=\sum_{n=1}^N\left[q_n \ln \alpha+q_n \ln \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right)\right] \\
&+\sum_n\left[\left(1-q_n\right) \ln (1-\alpha)+\left(1-q_n\right) \ln \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right]
\end{aligned}
$$
",
Harvard Spring 2021,4,f,1,Optimization,What is it about the M-step in typical applications that makes the EM algorithm convenient for working with models with latent variables?,"Typically, the M-step will have a solution via a closed form, analytic expression, making E-M very fast and robust.",
Harvard Spring 2021,5,a,2,Markov Decision Process,(Diagram) (Question),Solution,
Harvard Spring 2021,5,b,2,Markov Decision Process,(Diagram) (Question),Solution,
Harvard Spring 2021,5,c,2,Markov Decision Process,(Diagram) (Question),Solution,
Harvard Spring 2021,5,d,2,Markov Decision Process,(Diagram) (Question),Solution,
Harvard Spring 2021,5,e,1,Markov Decision Process,(Diagram) (Question),Solution,
Harvard Spring 2021,6,a,1,Reinforcement Learning,"What do we mean when we say that Q-learning and SARSA learning are ""model-free"" reinforcement learning methods?","Neither method learns $r(s,a)$ and $p(s'|s,a)$. They do not learn to predict the reward from an action or the next state distribution.",
Harvard Spring 2021,6,b,1,Reinforcement Learning,"True or False: The behavior of Q-learning agent needs to be ""greedy in the limit"" for Q-learning to learn the Q-values corresponding to the optimal policy",FALSE,
Harvard Spring 2021,6,c,1,Reinforcement Learning,"Briefly, why are Q-learning and SARSA designed to learn Q-values rather than just MDF values V(s); ie. why learn ""state-action values"" rather than just ""state values""","The value function $V(s)$ does not provide enough information, without also learning $r(s,a)$ and $p(s'|s,a)$, to know how to act! In comparison, $\pi(s)\in argmax_n Q(s,a)$ tells an agent how to act with Q-values",
Harvard Spring 2021,6,d,1,Reinforcement Learning,"Consider an MDP with two states $S=\{$ state 1 , state 2$\}$ and two actions $\{$ left, right $\}$ and an RL agent with the following Q-values:
$\begin{array}{ccc} & \text { left } & \text { right } \\ \text { state1 } & 6 & 4 \\ \text { state } 2 & 2 & 3\end{array}$
Suppose the agent is in state1. What is the distribution over the action the agent takes when using an $\epsilon$-greedy policy that explores with probability $\epsilon>0$ ?
",with prob $1-\eps$ take action left otherwise take one of left and right uniformly at random,
Harvard Spring 2021,6,e,1,Reinforcement Learning,"The update rule for Q-learning is as follows, where $\alpha$ is the learning rate and $\gamma$ the discount factor:
$$
Q(s, a) \leftarrow Q(s, a)+\alpha\left(r+\gamma \cdot \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right) .
$$
Suppose the agent takes action left in state1, and transitions to state2. Which $Q$-value (or $Q$-values) is updated, and with which action $a^{\prime}$ and state $s^{\prime}$ ?
","the agent updates Q(state1, left), and uses the value of Q(state2, right) for the update, i.e. adopting s'=state2 and a'=right",
Harvard Spring 2021,6,f,2,Reinforcement Learning,State one advantage of SARSA over Q-learning and one advantage of Q-learning over SARSA,"Q over SARSA: off--policy, can learn optimal policy even while continuing to adapt to the environment via eps-greedy; Q is also less succeptible to ""local minima"" or learning the wrong (suboptimal) policy than SARSA, since exploration in SARSA has to be coupled with ""greedy in the limit"". SARSA over Q-learning: simpler, does not have ""max a"" component; Provide risk-aversion",