{
       "Semester": "Spring 2019",
       "Question Number": "8",
       "Part": "b",
       "Points": 2.666666667,
       "Topic": "Neural Networks",
       "Type": "Text",
       "Question": "In this problem we will investigate regularization for neural networks.\nKim constructs a fully connected neural network with $L=2$ layers using mean squared error (MSE) loss and ReLU activation functions for the hidden layer, and a linear activation for the output layer. The network is trained with a gradient descent algorithm on a data set of $n$ points $\\left\\{\\left(x^{(1)}, y^{(1)}\\right), \\ldots,\\left(x^{(n)}, y^{(n)}\\right)\\right\\}$.\nRecall that the update rule for weights $W^{1}$ can be specified in terms of step size $\\eta$ and the gradient of the loss function with respect to weights $W^{1}$. This gradient can be expressed in terms of the activations $A^{l}$, weights $W^{l}$, pre-activations $Z^{l}$, and partials $\\frac{\\partial L}{\\partial A^{2}}$, $\\frac{\\partial A^{l}}{\\partial Z^{l}}$, for $l=1,2$ :\n$$\nW^{1}:=W^{1}-\\eta \\sum_{i=1}^{n} \\frac{\\partial L\\left(h\\left(x^{(i)} ; W\\right), y^{(i)}\\right)}{\\partial W^{1}}\n$$\nwhere $h(\\cdot)$ is the input-output mapping implemented by the entire neural network, and\n$$\n\\frac{\\partial L}{\\partial W^{1}}=\\frac{\\partial Z^{1}}{\\partial W^{1}} \\cdot \\frac{\\partial A^{1}}{\\partial Z^{1}} \\cdot W^{2} \\cdot \\frac{\\partial A^{2}}{\\partial Z^{2}} \\cdot \\frac{\\partial L}{\\partial A^{2}}\n$$\nThe new update rule for weights $W^{1}$ which also penalizes the sum of squared values of all individual weights in the network:\n$$\nL^{n e w}=L\\left(h\\left(x^{(i)} ; W\\right), y^{(i)}\\right)+\\lambda\\|W\\|^{2}\n$$\nwhere $\\lambda$ denotes the regularization trade-off parameter is W^{1}:=(1-2 \\lambda \\eta) W^{1}-\\eta \\sum \\frac{\\partial L}{\\partial W^{1}}, where $\\alpha=1-2 \\lambda \\eta$. Explain how this new update rule helps the neural network reduce over\ftting to the data.",
       "Solution": "For reasonable $\\lambda$ and $\\eta$, the weights are scaled by a factor less than 1 at each iteration. (If $1-2 \\lambda \\eta>1$, the weights will rapidly grow and diverge.) A value of $|\\alpha|<1$ pushes the weights toward zero in general, except those weights that are needed to fit substantial subsets of the data (i.e., those weights that are needed to keep the data loss term $L$ low)."
}