﻿Semester,Question Number,Part,Points,Topic,Type,Question,Solution
Harvard Spring 2015,1,N/A,10,Clustering,Text,"Imagine that you have N data and you wish to find K clusters using K-Means++. As- suming that N > K, can the K-Means++ algorithm choose the same datum twice to become a cluster center? Why or why not?","The K-Means++ algorithm will never choose the same datum twice to become a cen- ter. This is because the distribution over the data items is proportional to the squared distance to the closest cluster center. When a datum is a cluster center, this distribution will assign zero probability for that item."
Harvard Spring 2015,2,a,5,Clustering,Image,"(Link) In the two figures below, draw the dendrogram for the data on the left, where the y- axis provides their values. In the top figure, use the single-linkage criterion (min over between-group distances) and in the bottom figure use the complete-linkage criterion (max over between-group distances).",Solution is diagram
Harvard Spring 2015,2,b,5,Clustering,Image,"(Link) In the two figures below, draw the dendrogram for the data on the left, where the y- axis provides their values. In the top figure, use the single-linkage criterion (min over between-group distances) and in the bottom figure use the complete-linkage criterion (max over between-group distances).",Solution is diagram
Harvard Spring 2015,3,N/A,10,Classifiers,Text,"Suppose that K1(x,x′) and K2(x,x′) are both valid kernel functions. Recall that a valid kernel is one that corresponds to an inner product in some (possibly infinite- dimensional) feature space and produces a matrix Kij = K(xi, xj) that is a positive semi-definite for any finite set of examples x1, x2, . . . , xN. Show that
K(x, x′) = αK1(x, x′) + βK2(x, x′)
is a valid kernel if K1(x, x′) and K2(x, x′) are both valid kernels and α, β > 0. [Hint: It may be useful to recall that a matrix K is positive semi-definite if yTKy ≥ 0, ∀y.]","Say the kernel function and set of points create matrices K1, K2, and K, corresponding to K1(·, ·), K2(·, ·), and K(·, ·), respectively.
It suffices to show that for any vector y, yTKy ≥ 0. This follows algebraically
yTKy = yT (αK1 + βK2) y (1)
= αyTK1y + βyTK2y (2) ≥0+0 (3)
where the inequality follows by assumptions α, β > 0 and K1, K2 valid kernel functions (i.e. yTK1y ≥ 0 for both functions K1 and K2)."
Harvard Spring 2015,4,N/A,10,Classifiers,Text,"Suppose that we have a data set and we train two support vector machines as follows. We train the first SVM on a random subset of the data. Then we add the remainder of the data and train another SVM on the complete data set. How might the size of the optimal margin change from the first to the second SVM? Would you expect it to increase, decrease, stay the same, or do something else? Provide an explanation and/or diagrams to make your case.","The margin for the full dataset will decrease or stay the same. Specifically, if the subset contains the support vectors from the full data set, the margin for the full data set stays the same; otherwise, the margin for the full data set decreses.
More data points means more contraints. The margin found on the full data set satis- fies all the classfication contraints in the subset problem, but the solution may not be optimized for the subset. One can also illustrate this with diagrams."
Harvard Spring 2015,5,N/A,10,Reinforcement Learning,Text,Suppose Andy has a donut-eating utility function UA(donut) and Brian has a donut- eating utility function UB(donut). If UA(donut) = 7 × (UB(donut))2 − 42. Explain whether or not Andy and Brian have the same donut-eating preferences.,"The two utility functions are not the same. They would be the same if one is a mono- tonically increasing function of the other, but this is not the case for parabolic function.
To show this, we see if UA(d1) = 1, then, UB(d1) = −35; if UB(d2) = −1, then UB(d2) = −35. Andy prefers d1 to d2, but Brian has no preferences between the two donuts.
Note that the absolute values of the utility functions do not matter; only the relative values matter. That is, given any x1 and x2, two utility functions UA and UB are the same if and only if UA(x1) − UA(x2) > 0 ⇐⇒ UB(x1) − UB(x2) > 0, UA(x1) − UA(x2) < 0 ⇐⇒ UB(x1) − UB(x2) < 0, UA(x1) − UA(x2) = 0 ⇐⇒ UB(x1) − UB(x2) = 0."
Harvard Spring 2015,6,a,5,Reinforcement Learning,Text,"The update rule for Q-learning is $Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma + \max\limits_{a^{'}}Q(s^{'},a^{'} - Q(s,a) \right]$,
where $s′$ is the state you actually enter after performing action a in state s and r is the reward you actually receive. Consider two states $S = {s1, s2}$ and actions $A = {a1, a2}$, and current Q-values. \begin{tabular}{ c c c }
& $a1$ & $a2$ \\ 
$s1$ & 3 & 2 \\ 
$s2$ & 4 & 6 
\end{tabular}
\newline
Suppose the agent is in state s1. Using $\epsilon$-greedy, how would it decide to act?",The best action a1 is selected with probability 1 − $\epsilon$. An action is selected at random (with uniform probability) with probability $\epsilon$.
Harvard Spring 2015,6,b,5,Reinforcement Learning,Text,"The update rule for Q-learning is $Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma + \max\limits_{a^{'}}Q(s^{'},a^{'} - Q(s,a) \right]$,
where $s′$ is the state you actually enter after performing action a in state s and r is the reward you actually receive. Consider two states $S = {s1, s2}$ and actions $A = {a1, a2}$, and current Q-values. \begin{tabular}{ c c c }
& $a1$ & $a2$ \\ 
$s1$ & 3 & 2 \\ 
$s2$ & 4 & 6 
\end{tabular}
\newline
Suppose the agent exploits in $s1$ and lands in $s2$. Which $Q$-value would be updated, and what is the value for $\max\limits_{a^{'}} Q(s′, a′)$ used in the update?","Because the agent exploits in s1, it takes action $a1$ from $s1$. Thus Q(s1, a1) will be updated. The value for $\max\limits_{a^{'}} Q(s′, a′)$ used in the update is $Q(s2, a2) = 6$."
Harvard Spring 2015,6,c,5,Reinforcement Learning,Text,State one advantage of policy iteration over value iteration for planning.,"Policy iteration takes as most as many iterations to reach the optimal policy as value iteration, and in practice usually takes far fewer iterations. Policy iteration has a definite stopping condition: when the policy does not change after two suc- cessive iterations, the algorithm is completed. Policy iteration can also be modi- fied to take advantage of approximate solutions to the value function, particularly in problems with a large number of states in which the linear system cannot be solved practically by matrix inversion."
Harvard Spring 2015,7,N/A,15,Clustering,Text,"We are given a mixture model in the form $$p(x|\pi,\{\theta_k \}_{k=1}^{k}) = \sum_{k=1}^{K} \pi_{k}p(x|\theta_k)$$
where $x \in RD$. The mean of the kth component distribution $p(x | \theta_k)$ is given by $\mu_k$. What is the mean of the overall mixture?","From the definition of expectation
$$E(x|\pi,\{\theta_k \}_{k=1}^{k}) = \int{p(x|\pi,\{\theta_k \}_{k=1}^{k})xdx}$$
$$=\int{\sum_{k=1}^{K} \pi_{k}p(x| \theta_k)xdx}$$
$$=\sum_{k=1}^{K}\pi_{k}\int{ p(x| \theta_k)xdx}$$
$$=\sum_{k=1}^{K}\pi_{k}E[x|\theta_k] = \sum_{k=1}^{K} \pi_k\mu_k$$"
Harvard Spring 2015,8,a,5,Optimization,Image,"(Link) In diagram A,what does thedashed line(II) depict,int erms of our model, $\theta$_{0} and $\theta$_{0}? What does the arrow depict?","The dashed line (II) depicts the lower bound on the log marginal likelihood given by the expected complete data log likelihood (plus an entropy term) correspond- ing to the posterior at θ0 (because it’s tight at $\theta_0$). Specifically it is
Q(\theta; \theta_0) = Ep(z|x,\theta_0) [log p(x, z|\theta) − log p(z|x, θ0)] (11) = Ep(z|x,\theta_0) [log p(x, z|\theta)] + H[p(z|x, \theta_0)] 
The arrow represents maximizing this function with respect to $\theta$ (the M-step). Note that the maximum of this function only depends on the expected complete data log likelihood (the entropy term is fixed)."
Harvard Spring 2015,8,b,5,Optimization,Image,"(Link) In diagram B, what does the dashed line (III) depict? What do the arrows depict?","The dashed line (III) depicts the updated lower bound on the log marginal likeli- hood given by the Q function at θ1.
$$Q(\thata; \theta_1) = E_{p(z|x,\theta_1)} [log p(x, z|\theta) − log p(z|x, \theta_1)] $$ $$= E_p(z|x,\theta_1) [log p(x, z|θ)] + H[p(z|x, θ1)] (14)
The arrows represent moving from Q(θ; θ0) to Q(θ; θ1), which corresponds to the E-step."