{
       "Question number": "1",
       "Sub-Question number": "d",
       "Question": "We're given a data set $D=\\left\\{\\left(x^{(i)}, y^{(i)}\\right)\\right\\}_{i=1}^{n}$, where $x^{(i)} \\in R^{d}$ and $y^{(i)} \\in R$. Let $X$ be a $d \\times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \\times n$ vector containing the values of $y^{(i)}$. Using the ordinary least-squares formula, we can compute $$ W_{o l s}=\\left(X X^{T}\\right)^{-1} X Y^{T} $$ Using ridge regression, we can compute $$ W_{\\text {ridge }}=\\left(X X^{T}+\\lambda I\\right)^{-1} X Y^{T} $$ We decide to try to use these methods to initialize a single-unit \"\"neural network\"\" with a linear activation function. Assume that $X X^{T}$ is neither singular nor equal to the identity matrix, and that neither $W_{\\text {ols }}$ nor $W_{\\text {ridge }}$ is equal to $(0,0, \\ldots, 0)$. If we initialized our neuron with $W_{\\text {ols }}$ and did stochastic gradient descent (one data point at a time) with squared loss and a fixed small step size, explain why the weights would make small oscillations around the initial weights.",
       "Solution": "In expectation the steps should be small motions around the optimum. If someone says it will (or might) do something else (like hop out of this and get stuck somewhere else) that\u2019s most of the credit if it is well reasoned and explained."
}