{
       "Semester": "Spring 2018",
       "Question Number": "4",
       "Part": "b",
       "Points": 4.0,
       "Topic": "Classifiers",
       "Type": "Text",
       "Question": "Consider a classification problem in which there are $K$ possible output classes, $1, \\ldots, K$. We have studied using NLL as a loss function in such cases, but that assumes that all mistaken classifications are equally significant. Instead, we'll consider a case where some mistakes are worse than others (e.g., mis-identifying a cow as a horse is not as bad as calling it a mouse). Define the cost matrix $c_{g, a}$ to be the cost for guessing class $g$ when the actual class is $a$. For convenience, we'll write $c_{j}$ for the column of the matrix $\\left[c_{1, j}, c_{2, j}, \\ldots, c_{K, j}\\right]^{T}$ representing the costs of all the possible guesses when $j$ is the actual value. We will use a simple neural network with a softmax activation function, so our prediction $p$ will be a $K \\times 1$ vector: $$ \\begin{aligned} &p=\\operatorname{softmax}(z) \\\\ &z=W^{T} x \\end{aligned} $$ Assume inputs are $d \\times 1$ so $W$ is $d \\times K$. Our loss function, for a prediction vector $p$ when the target output is value $y \\in\\{1, \\ldots, K\\}$ is the expected cost of the prediction: $$ L_{c}(p, y)=\\sum_{k=1}^{K} p_{k} c_{k y}=p^{T} c_{y} $$ So, the overall training objective is to minimize, over a data set of $n$ points, $$ J_{c}(W)=\\sum_{i=1}^{n} L_{c}\\left(p^{(i)}, y^{(i)}\\right) $$ (a) Select which of the following cost matrices $c$ corresponds to each situation described below. A. $\\left[\\begin{array}{llll}1 & 0 & 0 & 0 \\\\ 0 & 1 & 0 & 0 \\\\ 0 & 0 & 1 & 0 \\\\ 0 & 0 & 0 & 1\\end{array}\\right]$ B. $\\left[\\begin{array}{llll}0 & 1 & 1 & 1 \\\\ 1 & 0 & 1 & 1 \\\\ 1 & 1 & 0 & 1 \\\\ 1 & 1 & 1 & 0\\end{array}\\right]$ C. $\\left[\\begin{array}{llll}0 & 0 & 1 & 1 \\\\ 0 & 0 & 1 & 1 \\\\ 1 & 1 & 0 & 0 \\\\ 1 & 1 & 0 & 0\\end{array}\\right]$ D. $\\left[\\begin{array}{cccc}0 & .5 & .5 & .5 \\\\ 2 & 0 & 1 & 1 \\\\ 2 & 1 & 0 & 1 \\\\ 2 & 1 & 1 & 0\\end{array}\\right]$ E. $\\left[\\begin{array}{cccc}0 & 2 & 2 & 2 \\\\ .5 & 0 & 1 & 1 \\\\ .5 & 1 & 0 & 1 \\\\ .5 & 1 & 1 & 0\\end{array}\\right]$. What would the change to the weights $W$ be, in one step of stochastic gradient descent on $J_{c}$, with input $x$ and target output $y$, and step size $\\eta$ ? Computing $\\partial p / \\partial z$ is kind of hairy. It is a $K \\times K$ matrix. You can write your answer in terms of it without computing it. You may also use $x, y, W$, and/or $c$ in your solution.",
       "Solution": "$$\n-\\eta \\cdot x \\cdot\\left(\\frac{\\partial p}{\\partial z} \\cdot c_{y}\\right)^{T}\n$$\n. To calculate the SGD update, we first need to calculate $\\frac{\\partial J_{c}}{\\partial W}$. We use chain rule.\n$$\n\\frac{\\partial J_{c}}{\\partial W}=\\frac{\\partial J_{c}}{\\partial L_{c}} \\frac{\\partial L_{c}}{\\partial p} \\frac{\\partial p}{\\partial z} \\frac{\\partial z}{\\partial W}=1 \\cdot\\left(c_{y}^{T}\\right)\\left(\\frac{\\partial p^{T}}{\\partial z}\\right) \\cdot x=x \\cdot\\left(\\frac{\\partial p}{\\partial z} \\cdot c_{y}\\right)^{T}\n$$\nThe SGD update is then\n$$\n-\\eta \\cdot x \\cdot\\left(\\frac{\\partial p}{\\partial z} \\cdot c_{y}\\right)^{T}\n$$"
}