{
       "Semester": "Fall 2019",
       "Question Number": "7",
       "Part": "d",
       "Points": 2.0,
       "Topic": "RNNs",
       "Type": "Text",
       "Question": "We have seen in class recurrent neural networks (RNNs) that are structured as:\n$$\n\\begin{aligned}\nz_{t}^{1} &=W^{s s} s_{t-1}+W^{s x} x_{t} \\\\\ns_{t} &=f_{1}\\left(z_{t}^{1}\\right) \\\\\nz_{t}^{2} &=W^{o} s_{t} \\\\\np_{t} &=f_{2}\\left(z_{t}^{2}\\right)\n\\end{aligned}\n$$\nwhere we have set biases to zero. Here $x_{t}$ is the input and $y_{t}$ the actual output for $\\left(x_{t}, y_{t}\\right)$ sequences used for training, with $p_{t}$ as the RNN output (during or after training).\nAssume our first RNN, call it RNN-A, has $s_{t}, x_{t}, p_{t}$ all being vectors of shape $2 \\times 1$. In addition, the activation functions are simply $f_{1}(z)=z$ and $f_{2}(z)=z$.\nNow consider a modified RNN, call it RNN-B, that does the following:\n$$\n\\begin{aligned}\nz_{t}^{1} &=W^{s s x}\\left[\\begin{array}{c}\ns_{t-1} \\\\\nx_{t}\n\\end{array}\\right] \\\\\ns_{t} &=z_{t}^{1} \\\\\nz_{t}^{2} &=W^{o x}\\left[\\begin{array}{l}\ns_{t} \\\\\nx_{t}\n\\end{array}\\right] \\\\\np_{t} &=f_{2}\\left(z_{t}^{2}\\right)\n\\end{aligned}\n$$\nwhere $s_{t}, x_{t}, p_{t}$ are all vectors of shape $2 \\times 1,\\left[\\begin{array}{c}s_{t-1} \\\\ x_{t}\\end{array}\\right]$ and $\\left[\\begin{array}{l}s_{t} \\\\ x_{t}\\end{array}\\right]$ are vectors of shape $4 \\times 1$.\nImagine we are using RNN-B to generate a description sentence given an input word, as in language modeling. The input is a single $2 \\times 1$ vector embedding, $x_{1}$, that encodes the input word. The output will be a sequence of words $p_{1}, p_{2}, \\ldots, p_{n}$ that provide a description of that word. In this setting, what would be an appropriate activation function $f_{2}$ ?\n",
       "Solution": "Softmax to select a best next word."
}