{
       "Semester": "Fall 2019",
       "Question Number": "4",
       "Part": "h",
       "Points": 1.5,
       "Topic": "Neural Networks",
       "Type": "Text",
       "Question": "Otto N. Coder is exploring different autoencoder architectures. Consider the following autoencoder with input $x \\in \\mathbb{R}^{d}$ and output $y^{\\text {pred }} \\in \\mathbb{R}^{d}$. The autoencoder has one hidden layer with $m$ hidden units: $z^{(1)}, a^{(1)} \\in \\mathbb{R}^{m}$. Assume $x, z^{(2)}$, and $y^{\\text {pred }}$ have dimensions $d \\times 1$. Also let $z^{(1)}$ and $a^{(1)}$ have dimensions $m \\times 1$. \nOtto trains the autoencoder with back-propagation. The loss for a given datapoint $x, y$ is:\n$$\nJ(x, y)=\\frac{1}{2}\\left\\|y^{\\text {pred }}-y\\right\\|^{2}=\\frac{1}{2}\\left(y^{\\text {pred }}-y\\right)^{T}\\left(y^{\\text {pred }}-y\\right)\n$$\nCompute the following intermediate partial derivatives. For the following questions, write your answer in terms of $x, y, y^{p r e d}, W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, f^{(1)}, f^{(2)}$ and any previously computed or provided partial derivative. Also note that:\n1. Let $\\partial f^{(1)} / \\partial z^{(1)}$ be an $m \\times 1$ matrix, provided to you.\n2. Let $\\partial f^{(2)} / \\partial z^{(2)}$ be a $d \\times 1$ matrix, provided to you.\n3. If $A x=y$ where $A$ is a $m \\times n$ matrix and $x$ is $n \\times 1$ and $y$ is $m \\times 1$, then let $\\partial y / \\partial A=x$.\n4. In your answers below, we will assume multiplications are matrix multiplication; to indicate element-wise multiplication, use the symbol *.\nOtto's other friend Leila says having more layers is better. Let $m$ be much smaller than d. Leila adds 10 more hidden layers all with linear activation before Neil suggests to have several layers with non-linear activation function. He says Otto should regularize the number of active hidden units. Loosely speaking, we consider the average activation of a hidden unit $j$ in our hidden layer 1 (which has sigmoid activation function $\\left.f^{(1)}\\right)$ to be the average of the activation of $a_{j}^{(1)}$ over the points $x_{i}$ in our training dataset of size $N$ :\n$$\n\\hat{p}_{j}=\\frac{1}{N} \\Sigma_{i=1}^{N} a_{j}^{(1)}\\left(x_{i}\\right)\n$$\nAssume we would like to enforce the constraint that the average activation for each hidden unit $\\hat{p}_{j}$ is close to some hyperparameter $p$. Usually, $p$ is very small (say $p<0.05$ ).\nWhat is the best format for a regularization penalty given hyperparameter $p$ and the average activation for all our hidden units: $\\hat{p}_{j}$ ? Select one of the following:\nA. Hinge loss: $\\Sigma_{j} \\max \\left(0,\\left(1-\\hat{p}_{j}\\right) p\\right)$\nB. NLL: $\\Sigma_{j}\\left(-p \\log \\frac{p}{\\hat{p}_{j}}-(1-p) \\log \\frac{(1-p)}{\\left(1-\\hat{p}_{j}\\right)}\\right)$\nC. Squared loss: $\\Sigma_{j}\\left(\\hat{p}_{j}-p\\right)^{2}$\nD. l2 norm: $\\Sigma_{j}\\left(\\hat{p}_{j}\\right)^{2}$  ",
       "Solution": "Either NLL or squared loss should work, encouraging $p$ and $\\hat{p}_{j}$ to be close. NLL loss might better handle wide range in the magnitudes of $\\hat{p}_{j}$."
}