{
       "Semester": "Spring 2022",
       "Question Number": "4",
       "Part": "a",
       "Points": 1.5,
       "Topic": "Regression",
       "Type": "Text",
       "Question": "We're given a data set $D=\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n}$, where $x^{(i)} \in R^{d}$ and $y^{(i)} \in R$. Let $X$ be a $d \times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute\n $$\n W_{o l s}=\left(X X^{T}\right)^{-1} X Y^{T}\n $$\n Using ridge regression, we can compute\n $$\n W_{\text {ridge }}=\left(X X^{T}+\lambda I\right)^{-1} X Y^{T}\n $$\n We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:\n $$\n h(x ; W)=W^{T} x .\n $$\n Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\text {ridge }}$ is equal to $(0,0, \ldots, 0)$. Note also that we are not using an explicit offset/bias term.\n If we initialized our unit with $W_{\text {ols }}$ and did batch gradient descent (summing the error over all the data points) with squared loss, no regularization, and a fixed small step size, which of the following would most typically happen.\nA. The weights would change substantially at the beginning, but then converge back to the values we initialized with.\nB. The weights would not change.\nC. The weights would make small oscillations around the initial weights.\nD. The weights would converge to a different value.\nE. Something else would happen.",
       "Solution": "B. The weights would not change"
}
