{
       "Question number": "6",
       "Sub-Question number": "3",
       "Question": "Provide one reason why stochastic gradient descent can be better than traditional (batch) gradient descent when applied to neural networks.",
       "Solution": "SGD can jump out of local minima more easily, since it's more noisy. Alternatively, you can note that as you increase your batch size, your update gradient asymptotically approaches the true gradient. Thus, you can split your batch into n parts, yielding n updates with generally better than $\\frac{1}{n}$ accuracy relative to the true gradient, yielding more progress per computation time. SGD takes this to an extreme. Both answers are correct, but not equivalent."
}