{
       "Semester": "Spring 2022",
       "Question Number": "2",
       "Part": "d",
       "Points": 2.0,
       "Topic": "Neural Networks",
       "Type": "Image",
       "Question": "A recent paper gives us reason to think that, rather than regularizing the weights used in a deep network that is used for classification, it is good to regularize the $z$ values, before the activation function is applied, in the last layer.\n\nConsider a neural network with one hidden layer and a single output unit where the activation function $f$ is a sigmoid, as shown below.\nWe can specify the DARC objective function $J(\\theta, \\lambda)$, where the parameters $\\theta=\\left(w_{11}, w_{12}, w_{21}, w_{22}, v_{1}, v_{2}\\right)$ which depends on data set $D=\\left\\{\\left(x^{(i)}, y^{(i)}\\right)\\right\\}_{i-1}^{N}$ as\n$$\nJ(\\theta, \\lambda)=\\sum_{i} \\mathcal{L}_{n l l}\\left(f\\left(z^{(i)}\\right), y^{(i)}\\right)+\\lambda \\sum_{i}\\left(z^{(i)}\\right)^{2}\n$$\nwhere $z^{(i)}$ is the value of the output unit on example $i$ before it goes into the activation function. \nWould the DARC strategy of regularizing $z$ be good if we were, instead, doing regression and $f(x)=x$ ? Explain why or why not.",
       "Solution": "No, because we need the output to be able to attain its target value, which will be made impossible by penalizing the magnitude of the output."
}