{
       "Semester": "Spring 2021",
       "Question Number": "12",
       "Part": "b",
       "Points": 7.0,
       "Topic": "Decision Trees",
       "Type": "Image",
       "Question": "Here is a standard regression tree of a fixed size. It has 5 scalar parameters $\\left(s_{1}, s_{2}, v_{1}, v_{2}, v_{3}\\right)$ and two discrete choices of feature to split on, denoted by integers $j$ and $k$.\nWe are given a training data set $\\mathcal{D}_{\\operatorname{train}}=\\left\\{\\left(x^{(j)}, y^{(j)}\\right)\\right\\}$ where the dimension of $x^{(j)}$ is $d$.\nTerry would like to make a \"smoother\" tree by replacing the tests at the nodes with neuralnetwork logistic classifiers and by combining predictions from the branches, so that we can think of the tree as a parametric model and optimize the parameters using gradient descent. More concretely, at each internal node, the test will be replaced by $\\mathrm{NN}(x ; \\theta)$, a neural network that takes an entire input vector $x$, of dimension $d$, as input and generates an output in the range $[0,1]$ by using a sigmoid unit on the output.\nYou can think of any node $T_{i}$ of a tree as producing an output value as follows:\n- If $T_{\\mathrm{i}}$ is a leaf, then the output on input $x, T_{\\mathrm{i}}(x)$, is a constant $v_{\\mathrm{i}}$. (corresponding to \"yes\" branch), then the output on input $x$ is\n$$\nT_{i}(x)=\\left(1-\\mathrm{NN}\\left(x ; \\theta^{(i)}\\right)\\right) T_{\\mathrm{na}}(x)+\\mathrm{NN}\\left(x ; \\theta^{(i)}\\right) T_{\\mathrm{yas}}(x) .\n$$\nThat is, it is a weighted combination of the results of the children, where the neural network at the parent node, with parameters $\\theta^{(i)}$, modulates the combination of the results of the children.\n\nWe will consider the specific case where NN is a single unit with a sigmoidal activation function, so that\n$$\n\\mathrm{NN}\\left(x ; W^{(i)}, W_{0}^{(\\mathrm{i})}\\right)=\\sigma\\left(W^{(i)^{T}} x+W_{0}^{(i)}\\right)\n$$\nwhere $W^{(i)}$ is a vector of length $d$ and $W_{0}^{(i)}$ is a scalar and $\\sigma$ is the sigmoid function.\n\nConsider the dataset shown in the plot below right, where $d=2$. Each integer value on the plot (one of $5,-2$, or 8 ) corresponds to a datapoint whose input $x$ features are the coordinates of the point on the plot and whose output $y$ value is the printed number.\nProvide the parameters of a tree-predictor, corresponding to the model shown above left, that make accurate predictions on the dataset.",
       "Solution": "$W^{(1)}=[100,100]^{T}$\n$W_{0}^{(1)}=0$\n$W^{(2)}=[-100,100]^{T}$, or $W^{(2)}=[100,-100]^{T}$\n $W_{0}^{(2)}=100^{T}$, or $W_{0}^{(2)}=-100$ (should match with the answer above).\n$v_{1}=-2$, or $v_{1}=5$ (depends on the answer above).\n $v_{2}=5$ or $v_{2}--2$ (depends on the answer above).\n$v_{3}=8$"
}