{
       "Semester": "Fall 2019",
       "Question Number": "4",
       "Part": "g",
       "Points": 1.5,
       "Topic": "Neural Networks",
       "Type": "Text",
       "Question": "Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \\in \\mathbb{R}^{d}$ and output $y^{\\text {pred }} \\in \\mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \\in \\mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\\text {pred }}$ have dimensions $d \\times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \\times 1$. \nOtto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:\n$$\nJ(x, y)=\\frac{1}{2}\\left\\|y^{\\text {pred }}-y\\right\\|^{2}=\\frac{1}{2}\\left(y^{\\text {pred }}-y\\right)^{T}\\left(y^{\\text {pred }}-y\\right)\n$$\nCompute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:\n1. Let $\\partial f^{(1)} / \\partial z^{(1)}$ be an $m \\times 1$ matrix, provided to you.\n2. Let $\\partial f^{(2)} / \\partial z^{(2)}$ be a $d \\times 1$ matrix, provided to you.\n3. If $A x=y$ where $A$ is a $m \\times n$ matrix and $x$ is $n \\times 1$ and $y$ is $m \\times 1$, then let $\\partial y / \\partial A=x$.\n4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.\nLeila says having more layers is better. Let $m$ be much smaller than d. Leila adds 10 more hidden layers all with linear activation before Otto's current hidden layer (which has sigmoid activation function $f^{(1)}$ ) such that each hidden layer has $m$ units. What would you expect to see with your training and test accuracy, compared to just having one hidden layer with activation $f^{(1)}$ ?",
       "Solution": "The intermediary hidden layers do not add any expressivity to the network, and we would expect similar training and test accuracy as compared to the single $f^{(1)}$ hidden layer network. This may, however, require different number of training iterations with the same available data, in order to achieve similar accuracy."
}