{
       "Question number": "6",
       "Sub-Question number": "1",
       "Question": "Assume you are given a neural network with $L$ layers to minimize a loss function $\\mathcal{L}$\n\n$$\n\\begin{aligned}\nh(\\mathbf{x}) &=\\mathbf{w}^{\\top} \\phi_{1}(\\mathbf{x}) \\\\\n\\phi_{1}(\\mathbf{x}) &=\\sigma\\left(\\mathbf{U}_{1} \\phi_{2}(\\mathbf{x})\\right) \\\\\n& \\vdots \\\\\n\\phi_{\\ell}(\\mathbf{x}) &=\\sigma\\left(\\mathbf{U}_{\\ell} \\phi_{\\ell+1}(\\mathbf{x})\\right) \\\\\n& \\vdots \\\\\n\\phi_{L}(\\mathbf{x}) &=\\sigma\\left(\\mathbf{U}_{L} \\mathbf{x}\\right)\n\\end{aligned}\n$$\n\n(Note that the subscript of $\\phi$ starts at 1 at the end of the network, and increases to $L$ as we make our way back to the start) Let us define $a_{\\ell}=\\mathbf{U}_{\\ell} \\phi_{\\ell+1}(\\mathbf{x})$ such that $\\phi_{\\ell}=\\sigma\\left(a_{\\ell}\\right)$. Let $\\delta_{\\ell}=\\frac{\\partial \\mathcal{L}}{\\partial a_{\\ell}}$. Express $\\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{U}_{\\ell}}$ in terms of $\\delta_{\\ell}$. (assume $1<\\ell<L$ )",
       "Solution": "$$\n\\begin{aligned}\n\\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{U}_{\\ell}} &=\\frac{\\partial \\mathcal{L}}{\\partial a_{\\ell}} \\frac{\\partial a_{\\ell}}{\\partial \\mathbf{U}_{\\ell}} \\\\\n&=\\delta_{\\ell} \\phi_{\\ell+1}(\\mathbf{x})^{T}\n\\end{aligned}\n$$"
}