{
       "Semester": "Spring 2022",
       "Question Number": "2",
       "Part": "b",
       "Points": 2.0,
       "Topic": "Neural Networks",
       "Type": "Image",
       "Question": "A recent paper gives us reason to think that, rather than regularizing the weights used in a deep network that is used for classification, it is good to regularize the $z$ values, before the activation function is applied, in the last layer.\n\nConsider a neural network with one hidden layer and a single output unit where the activation function $f$ is a sigmoid, as shown below.\nWe can specify the DARC objective function $J(\\theta, \\lambda)$, where the parameters $\\theta=\\left(w_{11}, w_{12}, w_{21}, w_{22}, v_{1}, v_{2}\\right)$ which depends on data set $D=\\left\\{\\left(x^{(i)}, y^{(i)}\\right)\\right\\}_{i-1}^{N}$ as\n$$\nJ(\\theta, \\lambda)=\\sum_{i} \\mathcal{L}_{n l l}\\left(f\\left(z^{(i)}\\right), y^{(i)}\\right)+\\lambda \\sum_{i}\\left(z^{(i)}\\right)^{2}\n$$\nwhere $z^{(i)}$ is the value of the output unit on example $i$ before it goes into the activation function. \n What is the derivative with respect to $w_{11}$ of the typical regularization term, which penalizes the squares of the weights? How do these two regularizers differ?\n",
       "Solution": "$2 \\lambda w_{11}$. One depends on the input."
}