{
       "Semester": "Spring 2021",
       "Question Number": "12",
       "Part": "c",
       "Points": 3.0,
       "Topic": "Decision Trees",
       "Type": "Image",
       "Question": "Here is a standard regression tree of a fixed size. It has 5 scalar parameters $\\left(s_{1}, s_{2}, v_{1}, v_{2}, v_{3}\\right)$ and two discrete choices of feature to split on, denoted by integers $j$ and $k$.\nWe are given a training data set $\\mathcal{D}_{\\operatorname{train}}=\\left\\{\\left(x^{(j)}, y^{(j)}\\right)\\right\\}$ where the dimension of $x^{(j)}$ is $d$.\nTerry would like to make a \"smoother\" tree by replacing the tests at the nodes with neuralnetwork logistic classifiers and by combining predictions from the branches, so that we can think of the tree as a parametric model and optimize the parameters using gradient descent. More concretely, at each internal node, the test will be replaced by $\\mathrm{NN}(x ; \\theta)$, a neural network that takes an entire input vector $x$, of dimension $d$, as input and generates an output in the range $[0,1]$ by using a sigmoid unit on the output.\nYou can think of any node $T_{i}$ of a tree as producing an output value as follows:\n- If $T_{\\mathrm{i}}$ is a leaf, then the output on input $x, T_{\\mathrm{i}}(x)$, is a constant $v_{\\mathrm{i}}$. (corresponding to \"yes\" branch), then the output on input $x$ is\n$$\nT_{i}(x)=\\left(1-\\mathrm{NN}\\left(x ; \\theta^{(i)}\\right)\\right) T_{\\mathrm{na}}(x)+\\mathrm{NN}\\left(x ; \\theta^{(i)}\\right) T_{\\mathrm{yas}}(x) .\n$$\nThat is, it is a weighted combination of the results of the children, where the neural network at the parent node, with parameters $\\theta^{(i)}$, modulates the combination of the results of the children.\n\nWe will consider the specific case where NN is a single unit with a sigmoidal activation function, so that\n$$\n\\mathrm{NN}\\left(x ; W^{(i)}, W_{0}^{(\\mathrm{i})}\\right)=\\sigma\\left(W^{(i)^{T}} x+W_{0}^{(i)}\\right)\n$$\nwhere $W^{(i)}$ is a vector of length $d$ and $W_{0}^{(i)}$ is a scalar and $\\sigma$ is the sigmoid function.\nWhat is $\\partial T_{1}(x) / \\partial W^{(1)}$ in this particular model? Please use the following shorthand:\n- $T=T_{1}(x)$\n- $O=\\mathrm{NN}\\left(x ; W^{(1)}, W_{0}^{(1)}\\right)$\n- $T_{\\text {no }}=$ the output of the \"no\" branch of $T_{1}$\n- $T_{\\text {yes }}=$ the output of the \"yes\" branch of $T_{1}$\nExpress your answer in terms of these quantities, $x$, and parameters $\\left(W^{(1)}, W^{(2)}, W_{0}^{(1)}, W_{0}^{(2)}, v_{1}, v_{2}, v_{3}\\right)$, as needed, but do not leave any derivatives in it.",
       "Solution": "Using shorthands:\n$$\nT=(1-O) T_{\\text {no }}+O T_{\\text {yas }}\n$$\nOnly $O$ is a function of $W^{(1)}$. Also recall that the derivative of the sigmoid can be simplified as: $\\sigma^{\\prime}(g(w))=\\sigma(g(w))(1-\\sigma(g(w))) g^{\\prime}(w)$. Therefore, (more Latex here)"
}