{
       "Semester": "Spring 2019",
       "Question Number": "8",
       "Part": "c",
       "Points": 2.666666667,
       "Topic": "Neural Networks",
       "Type": "Text",
       "Question": "In this problem we will investigate regularization for neural networks.\nKim constructs a fully connected neural network with $L=2$ layers using mean squared error (MSE) loss and ReLU activation functions for the hidden layer, and a linear activation for the output layer. The network is trained with a gradient descent algorithm on a data set of $n$ points $\\left\\{\\left(x^{(1)}, y^{(1)}\\right), \\ldots,\\left(x^{(n)}, y^{(n)}\\right)\\right\\}$.\nRecall that the update rule for weights $W^{1}$ can be specified in terms of step size $\\eta$ and the gradient of the loss function with respect to weights $W^{1}$. This gradient can be expressed in terms of the activations $A^{l}$, weights $W^{l}$, pre-activations $Z^{l}$, and partials $\\frac{\\partial L}{\\partial A^{2}}$, $\\frac{\\partial A^{l}}{\\partial Z^{l}}$, for $l=1,2$ :\n$$\nW^{1}:=W^{1}-\\eta \\sum_{i=1}^{n} \\frac{\\partial L\\left(h\\left(x^{(i)} ; W\\right), y^{(i)}\\right)}{\\partial W^{1}}\n$$\nwhere $h(\\cdot)$ is the input-output mapping implemented by the entire neural network, and\n$$\n\\frac{\\partial L}{\\partial W^{1}}=\\frac{\\partial Z^{1}}{\\partial W^{1}} \\cdot \\frac{\\partial A^{1}}{\\partial Z^{1}} \\cdot W^{2} \\cdot \\frac{\\partial A^{2}}{\\partial Z^{2}} \\cdot \\frac{\\partial L}{\\partial A^{2}}\n$$\nGiven that we are training a neural network with gradient descent, what happens when we increase the regularization trade-off parameter $\\lambda$ too much, while holding the step size $\\eta$ fixed?",
       "Solution": "With too large a $\\lambda, \\alpha$ may approach zero and the weights would be too strongly penalized and thus tend to zero, preventing the neural network from fitting the available training data. That is to say, the network is pushed towards an overly \"generalized\" constant output based on zero or near-zero weights. With even larger values of $\\lambda, \\alpha$ may become negative causing oscillations in weights. With $|\\alpha|$ larger than 1 , the weights will grow in magnitude without bound."
}