{
       "Question number": "5",
       "Sub-Question number": "c.i",
       "Question": "We'll consider here a simple one-dimensional convolutional neural network layer. This is a feature map created by a single filter whose parameters we must learn. The filter represents a local pattern detector that is applied in every position of the input signal. The feature map therefore transforms an input vector (one dimensional signal) into another vector (one-dimensional feature map). To train such a layer on its own, i.e., not as part of a bigger network as we typically do, we can imagine having training pairs $(x, y)$ where $x$ is the input signal as a vector and $y$ is a binary vector representing whether the relevant pattern appeared in a particular position or not. Specifically,\n- Input $x$ is a one-dimensional vector of length $d$.\n- Target $y$ is also a one-dimensional vector of length. $d$. One \"pixel\" in the output, $y_{j}$, has value 1 if the input pixels $x_{j-1}, x_{j}, x_{j+1}$, centered at $j$, exhibit the target pattern and 0 if they do not.\n- The filter is represented by a weight vector $w$ consisting of three values.\n- The output of the network is a vector $\\hat{y}$ whose $j^{t h}$ coordinate (pixel) is $\\hat{y}_{j}=\\sigma\\left(z_{j}\\right.$ ) where $z_{j}=\\left[x_{j-1} ; x_{j} ; x_{j+1}\\right]^{T} w$ and $\\sigma(\\cdot)$ is the sigmoid function. Assume that $x_{0}$ and $x_{d+1}$ are 0 for the purposes of computing outputs.\n- We have a training set $D=\\left(x^{(1)}, y^{(1)}\\right), \\ldots,\\left(x^{(n)}, y^{(n)}\\right)$.\n- We measure the loss between the target binary vector $y$ and the network output $\\hat{y}$ pixel by pixel using cross-entropy (Negative Log-Likelihood or NLL). The aggregate loss over the whole training set is\n$$\nL(w, D)=\\sum_{i=1}^{n} \\sum_{j=1}^{d} \\operatorname{NLL}\\left(y_{j}^{(i)}, \\hat{y}_{j}^{(i)}\\right)\n$$ We'd like to use a simple stochastic gradient descent (SGD) algorithm for estimating the filter parameters $w$. But wait... how is the algorithm stochastic? Given the framing of our problem there may be multiple ways to write a valid SGD update.\nFor each of the following cases of SGD, write an update rule for $w$, in terms of step size $\\eta, \\nabla_{w} \\mathrm{NLL}$, target pixel values $y_{j}^{(i)}$ and actual pixel values $\\hat{y}_{j}^{(i)}$.\nUpdate based on all pixels of example ",
       "Solution": "$w \\leftarrow w-\\eta \\sum_{j=1}^{d} \\nabla_{w} N\\left(y_{j}^{(i)}, \\hat{y}_{j}^{(i)}\\right)$"
}