{
       "Semester": "Spring 2022",
       "Question Number": "4",
       "Part": "i",
       "Points": 1.0,
       "Topic": "Regression",
       "Type": "Text",
       "Question": "We're given a data set $D=\\left\\{\\left(x^{(i)}, y^{(i)}\\right)\\right\\}_{i=1}^{n}$, where $x^{(i)} \\in R^{d}$ and $y^{(i)} \\in R$. Let $X$ be a $d \\times n$ matrix in which the $x^{(i)}$ are the columns and let $Y$ be a $1 \\times n$ vector containing the values of $y^{(i)}$. Using the analytical regression (ordinary least-squares) formula, we can compute\n $$\n W_{o l s}=\\left(X X^{T}\\right)^{-1} X Y^{T}\n $$\n Using ridge regression, we can compute\n $$\n W_{\\text {ridge }}=\\left(X X^{T}+\\lambda I\\right)^{-1} X Y^{T}\n $$\n We decide to try to use these methods to initialize a single-unit 'neural network' with a linear activation function and no offset:\n $$\n h(x ; W)=W^{T} x .\n $$\n Assume that $X X^{T}$ is invertible and not equal to the identity matrix, and that neither $W_{o l s}$ nor $W_{\\text {ridge }}$ is equal to $(0,0, \\ldots, 0)$. Note also that we are not using an explicit offset/bias term. Ryu thinks getting initial parameters from Ryo's hypothesis might be a good way to initialize a two-layer neural network. Consider the case where we have a simple neural network with - Two units in the hidden layer - Sigmoid activation function in the hidden layer - Linear activation function on the output unit So, the hypothesis is: $$ g=w_{a} \\sigma\\left(w_{b} x+w_{c}\\right)+w_{e} \\sigma\\left(w_{f} x+w_{g}\\right)+w_{d} $$ If Ryu first trained Ryo's hypothesis to reach a local optimum using gradient descent to obtain $w_{a}, w_{b}, w_{c}, w_{d}$, then initialized the more complex network above using $w_{e}=w_{a}$, $w_{f}=w_{b}$, and $w_{g}=w_{c}$, with $w_{d}$ as before, and did batch gradient descent with squared loss and a fixed small step size, explain why the weights would converge to a different value.",
       "Solution": "Because the two units are initialized exactly the same, the gradients for\nboth of them will be the same. So, it is as if we had a single linear unit, ran it through\na sigmoid, and then added an offset."
}
