Semester,Question Number,Part,Points,Topic,Type,Question,Solution
MIT Fall 2021,1,a.i,0.4,Features,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Genre (Game, Productivity, Education, Information, Social)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","One-hot, with a bit for each possible genre: Game → 10000, Productivity → 01000, Education → 00100, Information → 00010, Social → 00001"
MIT Fall 2021,1,a.ii,0.4,Features,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Suitable for people ages (2–4, 5–10, 11–15, 16 and over)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","Thermometer, because order should be preserved: 2–4: 1000 ; 5–10: 1100, 11–15: 1110, 16 and over: 1111"
MIT Fall 2021,1,a.iii,0.4,Features,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Was it banned in any previous quarter (True, False)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","Single binary feature, True: 1, False: 0. We also accepted a True/False encoding since Python correctly does arithmetic with it."
MIT Fall 2021,1,a.iv,0.4,Features,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Price of the app (positive number)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","Real-value, may standardize it using (x − µ)/σ for µ being the mean and σ the standard deviation"
MIT Fall 2021,1,a.v,0.4,Features,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
What is the best way to encode the app characteristic 'Does it have in-game advertising (True, False)' as a feature for an input to the neural network? Choose from among the following: multiple unary features (one-hot encoding), multiple binary features (thermometer encoding), an integer or real-valued feature. Also give the exact function that maps each input to its corresponding feature(s).","Single binary feature, True: 1, False: 0. We also accepted a True/False encoding since Python correctly does arithmetic with it.
"
MIT Fall 2021,1,b.i,0.333333333,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac wants to predict the sales volume (how many times someone will purchase the app each month) for his new app. The sales volume can be negative if many people returned the app for a refund in a given month. What should Mac choose for the number of units in the output layer?",One unit
MIT Fall 2021,1,b.ii,0.333333333,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac wants to predict the sales volume (how many times someone will purchase the app each month) for his new app. The sales volume can be negative if many people returned the app for a refund in a given month. What should Mac choose for the activation function(s) in the output layer? Choose either Linear, ReLU or sigmoid.",Linear
MIT Fall 2021,1,b.iii,0.333333333,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac wants to predict the sales volume (how many times someone will purchase the app each month) for his new app. The sales volume can be negative if many people returned the app for a refund in a given month. What should Mac choose for the loss function? Choose from either negative log likelihood or quadratic.",Quadratic
MIT Fall 2021,1,c.i,0.333333333,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac has data on three other properties: whether an app was featured on the front page, whether it got a favorable review on the Coolest Apps Evar web site, and whether Orange Computer offered to pay to port the app to their site. He would like to train a new neural network to predict these three properties. For this new prediction task, what should Mac choose for the number of units in the output
layer?",3 units
MIT Fall 2021,1,c.ii,0.333333333,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac has data on three other properties: whether an app was featured on the front page, whether it got a favorable review on the Coolest Apps Evar web site, and whether Orange Computer offered to pay to port the app to their site. He would like to train a new neural network to predict these three properties. For this new prediction task, what should Mac choose for the activation function in the output layer? Choose from linear, ReLU, sigmoid or softmax.",Sigmoid
MIT Fall 2021,1,c.iii,0.333333333,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac has data on three other properties: whether an app was featured on the front page, whether it got a favorable review on the Coolest Apps Evar web site, and whether Orange Computer offered to pay to port the app to their site. He would like to train a new neural network to predict these three properties. For this new prediction task, what should Mac choose for the loss function? Choose from negative log likelihood or quadratic.",Negative log likelihood
MIT Fall 2021,1,d.i,1,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac’s first attempt at machine learning to predict the sales volume (setup of (b)) uses all customer data from 2020. He randomly partitions the data into train (80%) and validation (20%), and uses one unit, linear activation function, and quadratic loss function. To prevent overfitting, he uses ridge regularization of the weights W, minimizing the optimization objective $J(W; \lambda) = \sum_{i=1}^n \mathcal{L}(h(x^{(i)}; W), y^{(i)}) + \lambda \|W\|^2$ where $\|W\|^{2}$ is the sum over the square of all output units' weights. Mac discovers that it’s possible to find a value of W such that J(W ; λ) = 0 even when λ is very large, nearing ∞.  Mac suspects that he might have an error in the code that he
wrote to derive the labels (i.e., the monthly sales volumes). Let’s see why. What can Mac conclude about W from this finding?",every element of W equals 0.
MIT Fall 2021,1,d.ii,1,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
Mac’s first attempt at machine learning to predict the sales volume (setup of (b)) uses all customer data from 2020. He randomly partitions the data into train (80%) and validation (20%), and uses one unit, linear activation function, and quadratic loss function. To prevent overfitting, he uses ridge regularization of the weights W, minimizing the optimization objective $J(W; \lambda) = \sum_{i=1}^n \mathcal{L}(h(x^{(i)}; W), y^{(i)}) + \lambda \|W\|^2$ where $\|W\|^{2}$ is the sum over the square of all output units' weights. Mac discovers that it’s possible to find a value of W such that J(W ; λ) = 0 even when λ is very large, nearing ∞.  Mac suspects that he might have an error in the code that he
wrote to derive the labels (i.e., the monthly sales volumes). If every element of W equals 0, what does this imply about the labels?","When W has all entries equal to 0, the prediction at every data point is a constant
(the offset). The only way for the squared error to be 0 is for the label of every data point to equal that offset. It seems unlikely that every data label would be exactly the same in this data set, which we assume ranges over a wide number of apps."
MIT Fall 2021,1,e,1,Neural Networks,Image,"Mac found and fixed the error. Now, to choose the regularization constant $\lambda$, Mac tried values of $1,10,100$, and 1000 , creating the below plot. Unfortunately, he forgot to label the legend! Help Mach by filling in the legend using two of the following: 'Training error', 'Validation error', 'Training time'.",Image filling
MIT Fall 2021,1,f,1,Neural Networks,Image,"Continuing the scenario of (e), which value of $\lambda$ (out of $1,10,100$, and 1000 ) should Mac choose to obtain the neural network that he will deploy on the app store, and why?","$\lambda=100$, because the validation error is lowest at this value."
MIT Fall 2021,1,g,2,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning.
When Mac wakes up the next day, he decides to re-run learning using λ = 100, now with a different partition of the data into train and validation sets (since he had previously forgotten to set the random seed). He finds that he gets a very different validation error! To obtain a more stable estimate, Mac decides to split the data into 5 disjoint chunks of 20% of the data. For each chunk, he evaluates on it after training on the union of the other 4 chunks. He gets the following results for the average error within each chunk: 0.15, 0.3, 0.1, 0.2, 0.25. What can Mac conclude is an estimate of the test error of the neural network?",0.2 (the average). This is cross-validation.
MIT Fall 2021,1,h,2,Neural Networks,Image,"The initial results look promising. Mac now wants to add in data from additional, earlier, years. (He is confident his customers have been behaving similarly over many years, so the earlier data is relevant.)

Before curating the older data, Mac decides to use the training data that he has to get a sense of whether more data would help. He creates a learning curve where on the horizontal axis he varies the amount of training data used and on the vertical axis he shows the validation error, using a fixed validation set across all settings considered. He experiments with $\lambda=1,10,100$, but again forgot to include a legend. Fill in the below legend by labeling the curves with the value of $\lambda$ that each corresponds to:",Image filling
MIT Fall 2021,1,i,1,Neural Networks,Image,Based on these plots does it seem likely that even more data will improve validation error (possibly for a different value of $\lambda$ )? Explain why or why not.,"Yes, because the validation error continues to decline as the amount of regularization decreases and amount of data increases. With more data and $\lambda=0$, it is conceivable that the validation error will be even smaller."
MIT Fall 2021,1,j,1,Neural Networks,Text,"Mac O’Larnin is considering selling an app on Frugal Play. You have a friend with inside info at Frugal, and they’re able to share data on how previous apps have performed on the store. Mac decides that he will learn a neural network with no hidden layer (i.e., consisting only of the output layer). He needs help in figuring out the precise formulation for machine learning. Mac experiments with even more training data and additional values of λ, but finds that he cannot decrease the validation error further. Are there changes to the neural network architecture that Mac could make to try to improve prediction performance? Explain.","Mac could add hidden layers with nonlinear activation functions to the
neural network."
MIT Fall 2021,2,a,2,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
Walk through each step of the $k$-means algorithm, beginning with the initialization shown in the plot in the top left of the box below. Dots show the observed data. In each plot (go left to right, top to down), mark with two ' $x$ ' symbols where the cluster centers are in that iteration of $k$-means. These are already shown in the initial state. Once the $k$-means algorithm has converged, you can leave all subsequent plots unmarked.",Image filling
MIT Fall 2021,2,b,1,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
What is the numerical value of the $k$-means objective for the clustering found in (a), after the algorithm has finished running?",$8 \cdot\left(1^{2}+0.5^{2}\right)=8 \cdot(1.25)=10$
MIT Fall 2021,2,c,1,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
Just as in (a), walk through each step of the $k$-means algorithm, beginning with the initialization shown in the plot in the top left. In each plot (go left to right, top to down), mark with two ' $x$ ' symbols where the cluster centers are in that iteration of $k$-means. Once the $k$-means algorithm has converged, you can leave all subsequent figures unmarked.",Image filling
MIT Fall 2021,2,d,1,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
What is the numerical value of the $k$-means objective for the clustering found in (c), after the algorithm has finished running?",$4 \cdot\left(2^{2}\right)+4 \cdot\left(3^{2}\right)=16+36=52$.
MIT Fall 2021,2,e,1,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
According to the $k$-means objective of the learned clusters, which initialization was better?",$\sqrt{\text { Initialization (a) } \quad \text { Initialization (c) }$
MIT Fall 2021,2,f,2,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
Consider the data in black dots shown in the plot below. We drew one cluster center with an $x$ symbol at $(2,4)$. Draw the second cluster center to satisfy the following property. When we initialize the clusters centers at the two $x$ 's and run the k-means algorithm to convergence, the final state will be such that one cluster will have all the data points assigned to it, and the other cluster will have no data points assigned to it.","There are two correct answers, either $(0,0)$ or $(0,8)$."
MIT Fall 2021,2,g,2,Clustering,Text,"Assume that the number of clusters k = 2. Christy thinks she came up with a compelling new initialization method for the $k$-means algorithm. Looking at her code below, explain why it is unlikely to give good results.
\begin{lstlisting}[language=Python]
def kmeans_init(X, n_clusters):
    centers = []
    for i in range(n_clusters):
        centers.append(X[:, X.shape[1]-1-i])
    return np.asarray(centers).T
\end{lstlisting}","Christy's method selects the last n\_clusters data points as the cluster centers. These points may be very close to each other, leading to the $k$-means algorithm finding a poor local optima of the $k$-means objective."
MIT Fall 2021,2,h,2,Clustering,Image,"Assume that the number of clusters $k=2$ for all of the following questions. 
Each of the following five data sets has two ground truth clusters, whose points are denoted as '十' and ' $o$ '. For which of these would the clustering with the smallest $k$-means objective value not recover the ground truth? Assurne $k=2$. (Select all that apply.)"," $\mathrm{O}$ (I) $\sqrt{\text { (II) } \quad \sqrt{(\mathrm{II})} \sqrt{(\mathrm{IV})} \sqrt{(\mathrm{V})}$
(I)
(II)
(III)
(IV)
$(\mathrm{V})$"
MIT Fall 2021,3,a,1,Decision Trees,Text,"We seek to learn a classifier on the following data set given in (point, class) format: ((-3,6),-1),((-1,6),-1),((2,6),+1),((4,6),-1),((-3,5),-1),((-1,5),-1),((2,5),+1),((4,5),-1),((2,3),+1),((4,3),-1),((2,2),+1),((4,2),-1),((-1,1),+1). 
We first learn a linear logistic classifier with offset on this data set, with no regularization. Will it obtain zero training error? Write “yes” or “no” and explain your answer","No, this data set is not linearly separable.
"
MIT Fall 2021,3,b,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
We now learn a depth-2 decision tree, with min_samples_split=2. We give you a partially completed tree below, where the first split is $x_{1} \geq 3$. Complete the rest of the tree by filling in the boxes with the splits on the second level and the classifications (either $+1$ or $-1$ ) at the leafs. Use the entropy criterion to choose the splits, and leave empty any boxes that are unused. As a reminder, min_samples_split is the minimum number of data points required to split an internal node.",yes
MIT Fall 2021,3,c,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
Now suppose that we set min_samples_split=10. Again, complete the rest of the tree by filling in the boxes.

Leave empty any boxes that are unused.",Image filling
MIT Fall 2021,3,d,1,Decision Trees,Text,"In decision trees, what is the purpose of increasing the minimum samples in each split?","It improves generalization (i.e., prevents overfitting to the training data) by requiring more samples to split a node, resulting in smaller tree depth."
MIT Fall 2021,3,e,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
What is the training error of the trees learned in parts (b) and (c)?","Tree (b): 1/13.
Tree (c): 4/13."
MIT Fall 2021,3,f,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
With min_samples_split=2, if we were to continue building the tree without any restriction to its depth, what would be the training error of the resulting tree?",0
MIT Fall 2021,3,g,2,Decision Trees,Image,"We seek to learn a classifier on the data set shown on the left with 13 data points labeled $+1$ or $-1$. For your convenience, we include some helpful calculations in the table to the right. 
Suppose we give as new features $x_{i}^{3}$, using these in addition to the original features $x_{i}$. Draw the new depth-2 tree that would be learned. Assume the features are organized $x_{1}, x_{2}, x_{1}^{3}, x_{2}^{3}$ and if two features are equally good for the split according to the entropy criterion, then we choose the first one in this order. As in part (b), assume min_samples_split=2.",Image filling
MIT Fall 2021,4,a,3,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
Draw on the below figure the decision boundary for a 1-NN classifier on this data set. In each region, denote whether the classification of any point (any point, not just the training data) in that region would be $+1$ or $-1$. (Note, all data points are assumed to be on integer coordinates.)",Image filling
MIT Fall 2021,4,b,2,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
 Which training data points, if any, could you remove and keep the decision boundary identical? Answer using their $\left(x_{1}, x_{2}\right)$ coordinates.",Image filling
MIT Fall 2021,4,c,2,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
You perform leave-one-out cross-validation of the 1-NN and 3-NN classifiers on this data set, i.e. you use use cross-validation with a chunk size of 1 data point. Assume ties go to the $+1$ region. What cross-validation errors do you obtain?",Image filling
MIT Fall 2021,4,d,2,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
 Suppose we now use the following feature transformation, $\phi\left(x_{1}, x_{2}\right)=x_{1} x_{2}$, and seek to learn a nearest neighbor classifier in the transformed space. This is equivalent to using a different distance metric, $d\left(x, x^{n}\right)=\left\|\phi(x)-\Phi\left(x^{r}\right)\right\|^{2}$. What is the average leave-one-out cross-validation error of a 3-NN classifier using this new distance metric? Which points would be misclassified (specified using their $\left(x_{1}, x_{2}\right)$ coordinates)?","3-NN:
$1 / 13$ Misclassified points:
$(-1,1)$"
MIT Fall 2021,4,e,3,Classifiers,Image,"This question asks about learning nearest neighbor (NN) classifiers. Assume that we are using Euclidean distance squared as the distance metric, i.e. $d\left(x, x^{\prime}\right)=\left\|x-x^{\prime}\right\|^{2}$. 
The plots below show the decision boundaries as predicted by a k-NN classifier for four different values of $\mathrm{k}: 1,5,20,40$. Map each plot to the corresponding value of $k$.
(I)
(II)
(III)
$(\mathrm{IV})$",Image filling
MIT Fall 2021,5,a.i,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to train the reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20. When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
What action should Ena choose when fully fit with a horizon of 1?",play
MIT Fall 2021,5,a.ii,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to train the reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20. When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
What action should Ena choose when partially fit with a horizon of 1?",play
MIT Fall 2021,5,a.iii,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to train the reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20. When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
What action should Ena take when injured with a horizon of 1?",break
MIT Fall 2021,5,b,3,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to train the reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20. When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
When the horizon is 2 and Ena is partially fit what is the expected reward for taking the best action when partially fit?","Train, 58"
MIT Fall 2021,5,c,2,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. When the horizon is 2 and Ena is partially fit what is the expected reward for taking the best action with a discount factor of .5?","Play, 25"
MIT Fall 2021,5,d.i,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
When Ena is fully fit, what is the inifinite horizon policy?",play
MIT Fall 2021,5,d.ii,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit.
When Ena is partially fit, what is the inifinite horizon optimal policy?",train
MIT Fall 2021,5,d.iii,0.666666667,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
When Ena is injured, what is the inifinite horizon optimal policy?",break
MIT Fall 2021,5,e,2,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
Is there any policy which maximizes the expected reward in the infinite horizon under which Ena should play if injured? Explain.",No. Both other actions have a negative reward and they both keep Ser in the injured state
MIT Fall 2021,5,f,3,MDPs,Text,"Ena can be in one of three states: fit, partially fit, or injured. Ena can choose 3 actions: play, train or break. The discount factor is 1. When Ena is fully fit and decides to play the reward is +100. When Ena is fully fit and decides to trainthe reward is -10. When Ena is fully fit and decides to break there is no reward. When Ena is partially fit and decides to play the reward is +20. When Ena is partially fit and decides to train the reward is -10. When Ena is partially fit and decides to break the reward is -20.  When Ena is injured and decides to play the reward is -60. When Ena is injured and decides to train the reward is -30. When Ena is injured and decides to break the reward is 0. When Ena is fully fit and decides to play there is an 80% chance of remaining fully fit and 20% chance of getting injured. When Ena is fully fit and decides to train there is an 90% chance of remaining fully fit and a 10% chance of getting injured. When Ena is fully fit and decides to break there is an 50% chance of remaining fully fit and 50% chance of being partially fit. When Ena is partially fit and decided to play there is a 50% chance of remaining partially fit and 50% chance of getting injured. When Ena is partially fit and decided to train there is a 40% chance of remaining partially fit and 60% chance of getting fully fit. When Ena is partially fit and decided to break there is a 100% chance of remaining partially fit. When Ena is injured and decides to play there is a 100% chance of remaining injured. When Ena is injured and decides to train there is a 100% chance of reamining injured. When Ena is injured and decides to break there is a 50% chance of remaining injured and 50% chance of being partially fit. 
Djo Ko is another athlete who plays the same sport. Djo Ko has the exact same MDP
as Ena’s, except Djo’s team has forgotten the reward for playing when in the fully fit state. Djo’s team also remember that the horizon 2 best action to take in the
partially fit state is exactly the same as that for Ser Ena (determined in part b). Given this information, what are the range of possible values for R(fully fit, play) for Djo Ko? Assume discount of 1.","R(fully fit, play) > 32∗10/6 = 53.33."
MIT Fall 2021,6,a,3,Neural Networks,Text,"A neural network takes in an input x = (x1, x2) and outputs $\hat{y}$ = a * x1 + b * x2.  The loss function is given as L(\hat{y}, y) = \left(y-\hat{y}\right)^2. Suppose $a_0$ and $b_0$ are the initial values of the weights, and $a_k$ and $b_k$ are the weights at iteration $k$.  Give equations for the updated weights $a_{k+1}$, $b_{k+1}$ in terms of current iteration's weights $a_{k}$, $b_{k}$, the step size parameter $\eta$, and the inputs $x_1$, $x_2$.","a_{k+1} = a_k − η*dL/da = a_k − 2η*[(a_k − 1)*x^2_1 + (b_k − 1)x_1*x_2]
b_{k+1} = b_k − η*dL/db = b_k − 2η*[(b_k − 1)*x^2_1 + (a_k − 1)x_1*x_2]
"
MIT Fall 2021,6,b,2,Neural Networks,Image,"Years ago, MIT student Itu Nes learned about neural networks and how to train them, from taking 6.036. Now Itu is an engineer at Orange Computer, a hot tech company employing machine learning to revolutionize music. Looking back at her notes, Itu realizes that she once wrote down exactly what she now needs to do in her job, but unfortunately some key details are lost. Can you help her figure things out?
Specifically, Itu wants to train this simple single-node neural network:
The network accepts two inputs $x_{1}$ and $x_{2}$, and outputs a prediction $\hat{y}$ based on weights $a$ and $b$. Itu's dataset has points $(x, y)$ where $x=\left(x_{1}, x_{2}\right)$, and $y$ are the true labels. Itu employs the squared error loss function
$$
L(\hat{y}, y)=(y-\hat{y})^{2}
$$
In her notes, Itu wrote about using gradient descent to obtain the optimal weights for the network, by minimizing this loss. Moreover, for each run of the gradient descent, she used a single data point to train the weights. Afterwards, Itu learns that the true labels are $y=x_{1}+x_{2}$. 
Suppose $a_{0}$ and $b_{0}$ are the initisl values of the weights, and $a_{k}$ and $b_{k}$ are the weights at iteration $k$. Give equations for the updated weights $a_{k+1}, b_{k+1}$ in terms of current iteration's weights $a_{k}, b_{k}$, the step size parameter $\eta$, and the inputs $x_{1}, x_{2}$.",Latex
MIT Fall 2021,6,c,2,Neural Networks,Image,"Years ago, MIT student Itu Nes learned about neural networks and how to train them, from taking 6.036. Now Itu is an engineer at Orange Computer, a hot tech company employing machine learning to revolutionize music. Looking back at her notes, Itu realizes that she once wrote down exactly what she now needs to do in her job, but unfortunately some key details are lost. Can you help her figure things out?
Specifically, Itu wants to train this simple single-node neural network:
The network accepts two inputs $x_{1}$ and $x_{2}$, and outputs a prediction $\hat{y}$ based on weights $a$ and $b$. Itu's dataset has points $(x, y)$ where $x=\left(x_{1}, x_{2}\right)$, and $y$ are the true labels. Itu employs the squared error loss function
$$
L(\hat{y}, y)=(y-\hat{y})^{2}
$$
In her notes, Itu wrote about using gradient descent to obtain the optimal weights for the network, by minimizing this loss. Moreover, for each run of the gradient descent, she used a single data point to train the weights. Afterwards, Itu learns that the true labels are $y=x_{1}+x_{2}$. 
Itu soes that when she fixed $x_{1}=1, x_{2}=1$ and ran 10 iterations of gradient descent starting with $a_{0}=2, b_{0}=2$, she recorded that the two weights oscillated back and forth, as captured in this plot pasted into her notebook:

Note that in this plot, the $a$ and $b$ points lay on top of each other. Unfortunately, Itu forgot to write down her code, nor did she write down what value of $\eta$ may have been used to generate this plot. Help her figure out: was this plot a mistake (and explain why), or if not, what value of $\eta$ could have generated it?","This oscillation happens when $\eta=1 / 2$, because $d L / d a=d L / d b=-4$"
MIT Fall 2021,6,d,3,Neural Networks,Image,"Years ago, MIT student Itu Nes learned about neural networks and how to train them, from taking 6.036. Now Itu is an engineer at Orange Computer, a hot tech company employing machine learning to revolutionize music. Looking back at her notes, Itu realizes that she once wrote down exactly what she now needs to do in her job, but unfortunately some key details are lost. Can you help her figure things out?
Specifically, Itu wants to train this simple single-node neural network:
The network accepts two inputs $x_{1}$ and $x_{2}$, and outputs a prediction $\hat{y}$ based on weights $a$ and $b$. Itu's dataset has points $(x, y)$ where $x=\left(x_{1}, x_{2}\right)$, and $y$ are the true labels. Itu employs the squared error loss function
$$
L(\hat{y}, y)=(y-\hat{y})^{2}
$$
In her notes, Itu wrote about using gradient descent to obtain the optimal weights for the network, by minimizing this loss. Moreover, for each run of the gradient descent, she used a single data point to train the weights. Afterwards, Itu learns that the true labels are $y=x_{1}+x_{2}$. 
Itu sees that when she fixed $x_{1}=1, x_{2}=1$ and $\operatorname{ran} 10$ iterations of gradient descent starting with $a_{0}=2, b_{0}=0$, she recorded that the two weights remained unchanged, as captured in this plot pasted into her notebook:

Again: was this plot a mistake (and explain why), or if not, what value of $\eta$ could have generated it?","Any $\eta$, e.g. $\eta=5$, because $d L / d a=0$ and $d L / d b=0$ for these parameters. Alternatively $\eta=0$ will also leave $a$ and $b$ at their initial values."
MIT Fall 2021,6,e,2,Neural Networks,Image,"Years ago, MIT student Itu Nes learned about neural networks and how to train them, from taking 6.036. Now Itu is an engineer at Orange Computer, a hot tech company employing machine learning to revolutionize music. Looking back at her notes, Itu realizes that she once wrote down exactly what she now needs to do in her job, but unfortunately some key details are lost. Can you help her figure things out?
Specifically, Itu wants to train this simple single-node neural network:
The network accepts two inputs $x_{1}$ and $x_{2}$, and outputs a prediction $\hat{y}$ based on weights $a$ and $b$. Itu's dataset has points $(x, y)$ where $x=\left(x_{1}, x_{2}\right)$, and $y$ are the true labels. Itu employs the squared error loss function
$$
L(\hat{y}, y)=(y-\hat{y})^{2}
$$
In her notes, Itu wrote about using gradient descent to obtain the optimal weights for the network, by minimizing this loss. Moreover, for each run of the gradient descent, she used a single data point to train the weights. Afterwards, Itu learns that the true labels are $y=x_{1}+x_{2}$. 
Itu sees that when she fixed $x_{1}$ and $x_{2}$ and ran 10 iterations of gradient descent with $\eta=0.01$ starting with $a_{0}=b_{0}=2$, she recorded that $b$ stayed unchanged, but $a$ decayed to 1 , as captured in this plot pasted into her notebook:

Again: was this plot a mistake (and explain why), or if not, what values of $x_{1}, x_{2}$ could have generated it?","Resulted from choosing $x_{1}=4 \cdot x_{2}=0$ (other nonzero, positive values of $x_{1}$ also work $)$"
MIT Fall 2021,7,a.i,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). Specifically, the input $X$ is a $4 \times 1$ column vector, $\hat{y}$ is a $1\times 1$ scalar. $W^2$ is a $3 \times 1$ vector. We also know that, $Z^1 = (W^1)^T X$ and $Z^2 = (W^2)^T A^1$. What are the dimensions of the matrix $W^1$?",4x3
MIT Fall 2021,7,a.ii,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). Specifically, the input $X$ is a $4 \times 1$ column vector, $\hat{y}$ is a $1\times 1$ scalar. $W^2$ is a $3 \times 1$ vector. We also know that, $Z^1 = (W^1)^T X$ and $Z^2 = (W^2)^T A^1$. What are the dimensions of $Z^2$?",1x1
MIT Fall 2021,7,b.i,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). There is only one data point which is: $X = [1, 1, 1, 1]^T$ and $y = [1]$. If $W^1$ and $W^2$ are both matrices/vectors of all ones, what is the resulting Loss where the Loss = (y - \hat{y})^2$?",121
MIT Fall 2021,7,b.ii,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). There is only one data point which is: $X = [1, 1, 1, 1]^T$ and $y = [1]$. If $W^1$ is a matrix of all $-1$’s (all negative ones) and $W^2$ is a vector of all $1$’s (positive ones), what is the resulting Loss where the Loss = (y - \hat{y})^2$?",1
MIT Fall 2021,7,c.i,2,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). Determine the expression for $\frac{\partial L}{\partial W^1}$. You may leave your expression in terms of $X, y, \hat{y}, W^2$ and $\frac{\partial A^1}{\partial Z^1}$.",∂L/∂W^1 = −2X(∂A^1/∂Z^1*W2*(y − yˆ))^T
MIT Fall 2021,7,c.ii,2,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). Specifically, the input $X$ is a $4 \times 1$ column vector, $\hat{y}$ is a $1\times 1$ scalar. $W^2$ is a $3 \times 1$ vector. We also know that, $Z^1 = (W^1)^T X$ and $Z^2 = (W^2)^T A^1$. What are the dimensions of $\frac{\partial L}{\partial W^1}$",4x3
MIT Fall 2021,7,d.i,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). We now use back-propagation to update the weights during each iteration. Assume that we only have one data point (X, y) available to use, and the stepsize
parameter is 0.01. Assume $X = [1, 1, 1, 1]^T, y = [1]$. Further assume that we start with $W^1$ as a matrix of $-1$’s (negative ones) while $W^2$ is a vector of $1$’s (positive ones). How many components of $W^1$ will get updated (i.e. have their value changed) after one iteration of backprop?",zero Components
MIT Fall 2021,7,d.ii,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). We now use back-propagation to update the weights during each iteration. Assume that we only have one data point (X, y) available to use, and the stepsize
parameter is 0.01. Assume $X = [0, 0, 0, 0]^T, y = [0]$. Further assume that we start off with $W^1$ and $W^2$ as matrices/vectors of all ones. How many components of $W^1$ will get updated (i.e. have their value changed) after one iteration of back-propagation?",Zero Components
MIT Fall 2021,7,d.iii,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). We now use back-propagation to update the weights during each iteration. Assume that we only have one data point (X, y) available to use, and the stepsize
parameter is 0.01. Assume $X = [1, 1, 1, 1]^T, y = [1]$. Further assume that we start off with $W^1$ and $W^2$ as matrices/vectors of all ones. How many components of $W^1$ will get updated (i.e. have their value changed) after one iteration of backprop?",All components (12)
MIT Fall 2021,7,d.iv,1,Neural Networks,Text,"A neural network is given as Z1 = X * W1, A1 = f1(Z1), Z2 = W2 * A1, \hat{y} = f2(Z2). We now use back-propagation to update the weights during each iteration. Assume that we only have one data point (X, y) available to use, and the stepsize
parameter is 0.01. Assume $X = [1, 1, 1, 1]^T, y = [1]$. Further assume that we start off with $W^1$ as a matrix of all ones. \textbf{$W^2 = [0, 1, 0]^T$}. How many components of $W^1$ will get updated (i.e. have their value changed) after one iteration of backprop?",4 components
MIT Fall 2021,8,a.i,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this year’s
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 × 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 × 2 filter. Let’s help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 0 and stride of 1?",2x2
MIT Fall 2021,8,a.ii,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this year’s
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 × 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 × 2 filter. Let’s help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 1 and stride of 1?",4x4
MIT Fall 2021,8,a.iii,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this year’s
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 × 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 × 2 filter. Let’s help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 2 and stride of 1?",6x6
MIT Fall 2021,8,a.iv,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this year’s
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 × 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 × 2 filter. Let’s help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 0 and stride of 2?",1x1
MIT Fall 2021,8,a.v,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this year’s
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 × 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 × 2 filter. Let’s help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 1 and stride of 2?",2x2
MIT Fall 2021,8,a.vi,0.166666667,CNNs,Text,"MIT grad student Rec Urrent would like to submit an entry to win this year’s
Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel 3 × 3 images of 2D tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one 2 × 2 filter. Let’s help Rec win this competition.
What are the dimensions of the output image if a 2x2 filter is convolved with a 3x3 image for padding of 2 and stride of 2?",3x3
MIT Fall 2021,8,b.i,0.5,CNNs,Text,When performing binary classification what activation function should one use in the final output layer?,sigmoid
MIT Fall 2021,8,b.ii,0.5,CNNs,Text,When performing binary classification what loss function should one use?,negative log likelihood loss
MIT Fall 2021,8,ci,0.5,CNNs,Text,"If Rec wants to allow for more than two classes when performing classification, which activation function should they use in the final output layer?",softmax
MIT Fall 2021,8,cii,0.5,CNNs,Text,"If Rec wants to allow for more than two classes when performing classification, what loss function should one use?",cross entropy
MIT Fall 2021,8,d.i,0.25,CNNs,Text,w is the weights for classifier network. What are dimensions of w for binary classification?,"w = [1,1]"
MIT Fall 2021,8,d.ii,0.25,CNNs,Text,b is the bias for classifier network. What are dimensions of b for binary classification?,b = 1
MIT Fall 2021,8,d.iii,0.25,CNNs,Text,w is the weights for classifier network. What are dimensions of w for k-class classification?,"w = [1, k]"
MIT Fall 2021,8,d.iv,0.25,CNNs,Text,b is the bias for classifier network. What are dimensions of b for multi k-class classification?,b = k
MIT Fall 2021,8,e,2,CNNs,Image,"MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition. 
What are the spatial dimensions of the output image if a $2 \times 2$ filter is convolved with a $3 \times 3$ image for paddings of 0,1 , and 2 , and strides of 1 and 2 ? Fill in the dimensions below:",$2 \times 2-4 \times 4-6 \times 6$
MIT Fall 2021,8,f,1,CNNs,Image,"MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition. 
 Rec writes a bit of python code to implement their tiny CNN classifier for images of 2D tetris pieces, following examples they have seen in $6.036$. They include in the comments the dimensions of the numpy arrays, where known.

For performing binary classification, what activation function should Rec use for $f$ inal_act and which loss function should Rec use?","Sigmoid + Negative Log Likelihood Loss
"
MIT Fall 2021,8,g,1,CNNs,Image,"MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition. 
If Rec wants to allow for more than two classes, which activation function should they use for final_act and which loss function?",Softmax + Cross Entropy
MIT Fall 2021,8,h,1,CNNs,Image,"MIT grad student Rec Urrent would like to submit an entry to win this year's Grand ML Tetris Competition, which gives awards to the smallest neural networks which can identify tetris pieces with the highest accuracy. Rec seeks to make a convolutional neural network that can accurately classify single-channel $3 \times 3$ images of $2 \mathrm{D}$ tetris pieces as being either a line-shaped piece, or a corner-shaped piece, using just one $2 \times 2$ filter. Let's help Rec win this competition. 
What are dimensions of $w$ and $\mathrm{b}$ for i) binary classification vs. ii) $k$-class classification?","For binary classification $u$ is: $[1,1] \quad$ and $b$ is: $[1]$
For $k$-class classification $u$ is: $[1, \mathrm{k}] \quad$ and $\mathrm{b}$ is: $[\mathrm{k}]$"
MIT Fall 2021,8,i,1,CNNs,Text,"Write an expression for the derivative of the binary classification loss with respect to z2, where z = conv2d(x , fcoef , padding =0 , stride =1), a = ReLU( z ), a_sum = z1.sum( dim = -1).sum( dim = -1), z2 = w.T @ a_sum + b
You may express your answer using g for the output of final act and y for the example label.",g-y
MIT Fall 2021,8,j.i,0.5,CNNs,Text,"Using ∂L/∂b = (g − y), z = conv2d(x , fcoef , padding =0 , stride =1), a = ReLU( z )
a_sum = z1.sum( dim = -1).sum( dim = -1), z2 = w.T @ a_sum + b, write an expression for gradient of the loss with respect to w of the output layer when the loss is negative log likelihood of predicted output g and actual output y? You may express your answers in terms of a_sum.",dl/dw = z1_sum(g-y)
MIT Fall 2021,8,j.ii,0.5,CNNs,Text,"Using ∂L/∂b = (g − y), z = conv2d(x , fcoef , padding =0 , stride =1), a = ReLU( z )
a_sum = z1.sum( dim = -1).sum( dim = -1), z2 = w.T @ a_sum + b, write an expression for gradient of the loss with respect to b of the output layer when the loss is negative log likelihood of predicted output g and actual output y? You may express your answers in terms of a_sum.",dl/db = z1_sum(g-y)
MIT Fall 2021,8,k,1,CNNs,Image,"Assume we apply a filter with weights $[[f 1, f 2],[f 3, f 4]]$ to this $3 \times 3$ image:
with stride 1 and padding 0 and perform back propagation. Which filter weights may have non-zero gradients? Why? Under what conditions will those gradients be non-zero?",$\mathrm{f} 1$ and $\mathrm{f} 3$ are the only weights that will receive gradients because only those weights get multiplied by non-zero features. The gradients to those weights will be non-zero if $2 *(f 1+f 3)>0$ because of the ReLU activation function.