﻿Semester,Question Number,Part,Points,Topic,Type,Question,Solution
Harvard Spring 2021,1,a,1,Bayesian network,Image,(Diagram) (Question),Solution
Harvard Spring 2021,1,b,3,Bayesian network,Image,(Diagram) (Question),Solution
Harvard Spring 2021,1,c,2,Bayesian network,Image,(Diagram) (Question),Solution
Harvard Spring 2021,1,d,2,Bayesian network,Image,(Diagram) (Question),Solution
Harvard Spring 2021,1,e,1,Bayesian network,Image,(Diagram) (Question),Solution
Harvard Spring 2021,1,f,2,Bayesian network,Text,Would adding any one of the missing edges in the Bayesian network result in the network representing more distribution or fewer distributions? Briefly justify your answer.,"More distributions. Various ways to see this - it removes the requirement of local independence, it adds more paths, and it also adds more parameters"
Harvard Spring 2021,2,a,2,Hidden Markov Models,Image,(Diagram) (Question),Solution
Harvard Spring 2021,2,b,2,Hidden Markov Models,Image,(Diagram) (Info) (Question),Solution
Harvard Spring 2021,2,c,2,Hidden Markov Models,Image,(Diagram) (Info) (Question),Solution
Harvard Spring 2021,2,d,1,Hidden Markov Models,Text,"You consider using the HMM to predict the next state $p(s_{t+1}|x_1,\cdots, x_t)$ by first identifying the most likely seuqence of states $s_1^{*}\cdots s_t^{*}$ given $x_1\cdots, x_t$, and then predicting $$p(s_{t+1}|x_1,\cdots, x_t)\propto p(s_{t+1}|s_1^{*}\cdots s_t^{*}) = p(s_{t+1}|s_{t}^{*}$ What is wrong with this?",This is wrong because it puts all the probabiliuty on the most likely sequence of states (the point estimate) when we should marginalize out over all possible sequences
Harvard Spring 2021,3,a,2,Clustering,Image,(Diagram) (Question),Solution
Harvard Spring 2021,3,b,2,Clustering,Image,(Diagram) (Question),Solution
Harvard Spring 2021,3,c,1,Clustering,Image,(Diagram) (Question),Solution
Harvard Spring 2021,3,d,2,Clustering,Text,"Consider a genera setting with D-dimensional data and principal components with a set of eigenvalue of $\{\lambda_{1}, \lambda_{2}, \cdots, \lambda_{D}\}$ that has low variance. Does this tend to be ana indicator that reducing the dimension of the data via PCA will be effective or not particularly effective? Briefly justify your answer.","Low variance in eigenvalues indicates components each carrying a similar amounf of ""information"" in the data, uniformly across components. This suggests that PCA may be relatively ineffective, with a number of ocmponents needed to explain the data."
Harvard Spring 2021,3,e,1,Clustering,Text,Give one data property that would lead you to strongly prefer K-means clustering over Hierachical agglomerative clustering. Briefly justify your answer.,Two possible answers: (1) high dimension data since K-means doesn't suffer curse of dimensionality (2) a lot of data - k means sclaes linearly not quadratically
Harvard Spring 2021,4,a,2,Optimization,Image,Question,Solution
Harvard Spring 2021,4,b,1,Optimization,Text,"[For this problem, write the probability density function for a Normal distribution as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$; this denotes the value of the PDF for a Normal distribution with mean $\mu$ and variance $\sigma^2$ at some point $x$. There is no need to work with the actual expression for a Normal distribution.]
Suppose that a freezer that is used by HUDS contains a noisy sensor that sometimes malfunctions. The measurements are $\left\{x_n\right\}_{n=1}^N$, where each $x_n$ is a real number. Each measurement is sampled independently, according to the following distribution:
- With probability $\alpha, 0<\alpha<1$, the sensor works correctly and returns a value distributed as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$, where $\mu$ is the true temperature and $\sigma>0$.
- With probability $1-\alpha$, the sensor fails and returns a value distributed as $\mathcal{N}\left(x ; 0, \epsilon^2\right)$, for some $\epsilon>0$.

The parameters of the model are $\{\alpha, \mu, \sigma, \epsilon\}$. For measurement $x_n$, we use variable $z_n$ to denote whether the sensor is functioning correctly $\left(z_n=1\right)$ or incorrectly $\left(z_n=0\right)$.

Write down the expression for the probability density $p\left(x_n, z_n ; \alpha, \mu, \sigma, \epsilon\right)$ for the $n$th reading. [Use the ""power trick"", i.e. use the $z$ value as an exponent]
","$p\left(x_n, z_n ; \alpha, \mu, \sigma, \epsilon\right)=\left[\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right]^{z_n}\left[(1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right]^{1-z_n}$"
Harvard Spring 2021,4,c,3,Optimization,Text,"[For this problem, write the probability density function for a Normal distribution as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$; this denotes the value of the PDF for a Normal distribution with mean $\mu$ and variance $\sigma^2$ at some point $x$. There is no need to work with the actual expression for a Normal distribution.]
Suppose that a freezer that is used by HUDS contains a noisy sensor that sometimes malfunctions. The measurements are $\left\{x_n\right\}_{n=1}^N$, where each $x_n$ is a real number. Each measurement is sampled independently, according to the following distribution:
- With probability $\alpha, 0<\alpha<1$, the sensor works correctly and returns a value distributed as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$, where $\mu$ is the true temperature and $\sigma>0$.
- With probability $1-\alpha$, the sensor fails and returns a value distributed as $\mathcal{N}\left(x ; 0, \epsilon^2\right)$, for some $\epsilon>0$.

The parameters of the model are $\{\alpha, \mu, \sigma, \epsilon\}$. For measurement $x_n$, we use variable $z_n$ to denote whether the sensor is functioning correctly $\left(z_n=1\right)$ or incorrectly $\left(z_n=0\right)$.

Write down the expression for the complete-data log likelihood,
$$
\ln \left(p\left(\left\{x_n, z_n\right\}_{n=1}^N ; \alpha, \mu, \sigma, \epsilon\right)\right) .
$$
[Your answer should be expressed as sums of log terms.]
","Complete data log likelihood is
$$
\begin{aligned}
\ln \left(p\left(\left\{x_n, z_n\right\}_{n=1}^N ; \alpha, \mu, \sigma, \epsilon\right)\right)=& \ln \left(\prod_n p\left(x_n, z_n ; \alpha, \mu, \sigma, \epsilon\right)\right) \\
=& \ln \left(\prod_n\left[\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right]^{z_n}\left[(1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right]^{1-z_n}\right) \\
=& \sum_{n=1}^N\left(z_n \ln \left(\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right)+\left(1-z_n\right) \ln \left((1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right)\right) \\
=&\left.\sum_{n=1}^N z_n \ln \alpha+\sum_n z_n \ln \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right)+\\
& \sum_n\left(1-z_n\right) \ln (1-\alpha)+\sum_n\left(1-z_n\right) \ln \mathcal{N}\left(x_n ; 0, \epsilon^2\right)
\end{aligned}
$$
"
Harvard Spring 2021,4,d,2,Optimization,Text,"[For this problem, write the probability density function for a Normal distribution as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$; this denotes the value of the PDF for a Normal distribution with mean $\mu$ and variance $\sigma^2$ at some point $x$. There is no need to work with the actual expression for a Normal distribution.]
Suppose that a freezer that is used by HUDS contains a noisy sensor that sometimes malfunctions. The measurements are $\left\{x_n\right\}_{n=1}^N$, where each $x_n$ is a real number. Each measurement is sampled independently, according to the following distribution:
- With probability $\alpha, 0<\alpha<1$, the sensor works correctly and returns a value distributed as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$, where $\mu$ is the true temperature and $\sigma>0$.
- With probability $1-\alpha$, the sensor fails and returns a value distributed as $\mathcal{N}\left(x ; 0, \epsilon^2\right)$, for some $\epsilon>0$.

The parameters of the model are $\{\alpha, \mu, \sigma, \epsilon\}$. For measurement $x_n$, we use variable $z_n$ to denote whether the sensor is functioning correctly $\left(z_n=1\right)$ or incorrectly $\left(z_n=0\right)$.


E-step: Derive an expression for the conditional probapility,
$$
q_n=p\left(z_n=1 \mid x_n ; \alpha, \mu, \sigma, \epsilon\right) .
$$
[Give an exact expression, not something that is proportional to $q_n$.]
","The conditional probability $q_n$ is given by
$$
\begin{aligned}
q_n=p\left(z_n=1 \mid x_n ; \alpha, \mu, \sigma, \epsilon\right) &=\frac{p\left(x_n \mid z_n=1\right) p\left(z_n=1\right)}{p\left(x_n \mid z_n=1\right) p\left(z_n=1\right)+p\left(x_n \mid z_n=0\right) p\left(z_n=0\right)} \\
&=\frac{\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)}{\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)+(1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)}
\end{aligned}
$$
"
Harvard Spring 2021,4,e,2,Optimization,Text,"[For this problem, write the probability density function for a Normal distribution as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$; this denotes the value of the PDF for a Normal distribution with mean $\mu$ and variance $\sigma^2$ at some point $x$. There is no need to work with the actual expression for a Normal distribution.]
Suppose that a freezer that is used by HUDS contains a noisy sensor that sometimes malfunctions. The measurements are $\left\{x_n\right\}_{n=1}^N$, where each $x_n$ is a real number. Each measurement is sampled independently, according to the following distribution:
- With probability $\alpha, 0<\alpha<1$, the sensor works correctly and returns a value distributed as $\mathcal{N}\left(x ; \mu, \sigma^2\right)$, where $\mu$ is the true temperature and $\sigma>0$.
- With probability $1-\alpha$, the sensor fails and returns a value distributed as $\mathcal{N}\left(x ; 0, \epsilon^2\right)$, for some $\epsilon>0$.

The parameters of the model are $\{\alpha, \mu, \sigma, \epsilon\}$. For measurement $x_n$, we use variable $z_n$ to denote whether the sensor is functioning correctly $\left(z_n=1\right)$ or incorrectly $\left(z_n=0\right)$.


M-step: Derive an expression for the expected complete-data log likelihood,
$$
\mathbb{E}_{z \sim q}\left[\ln \left(p\left(\left\{x_n, z_n\right\}_{n=1}^N ; \alpha, \mu, \sigma, \epsilon\right)\right)\right] .
$$
Here, "" $z \sim q$ "" means "" $z_n$ is distributed according to $q_n$, for each $n$."" [Your answer should be expressed as sums of $\log$ terms.]
","The expected complete-data log likelihood is
$$
\begin{aligned}
E_{z \sim q} \ln \left(p\left(\left\{x_n, z_n\right\}_{n=1}^N ; \alpha, \mu, \sigma, \epsilon\right)\right)=& \sum_{n=1}^N\left(q_n \ln \left(\alpha \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right)+\left(1-q_n\right) \ln \left((1-\alpha) \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right)\right) \\
\left.=\sum_{n=1}^N\left[q_n \ln \alpha+q_n \ln \mathcal{N}\left(x_n ; \mu, \sigma^2\right)\right)\right] \\
&+\sum_n\left[\left(1-q_n\right) \ln (1-\alpha)+\left(1-q_n\right) \ln \mathcal{N}\left(x_n ; 0, \epsilon^2\right)\right]
\end{aligned}
$$
"
Harvard Spring 2021,4,f,1,Optimization,Text,What is it about the M-step in typical applications that makes the EM algorithm convenient for working with models with latent variables?,"Typically, the M-step will have a solution via a closed form, analytic expression, making E-M very fast and robust."
Harvard Spring 2021,5,a,2,Markov Decision Process,Image,(Diagram) (Question),Solution
Harvard Spring 2021,5,b,2,Markov Decision Process,Image,(Diagram) (Question),Solution
Harvard Spring 2021,5,c,2,Markov Decision Process,Image,(Diagram) (Question),Solution
Harvard Spring 2021,5,d,2,Markov Decision Process,Image,(Diagram) (Question),Solution
Harvard Spring 2021,5,e,1,Markov Decision Process,Image,(Diagram) (Question),Solution
Harvard Spring 2021,6,a,1,Reinforcement Learning,Text,"What do we mean when we say that Q-learning and SARSA learning are ""model-free"" reinforcement learning methods?","Neither method learns $r(s,a)$ and $p(s'|s,a)$. They do not learn to predict the reward from an action or the next state distribution."
Harvard Spring 2021,6,b,1,Reinforcement Learning,Text,"True or False: The behavior of Q-learning agent needs to be ""greedy in the limit"" for Q-learning to learn the Q-values corresponding to the optimal policy",FALSE
Harvard Spring 2021,6,c,1,Reinforcement Learning,Text,"Briefly, why are Q-learning and SARSA designed to learn Q-values rather than just MDF values V(s); ie. why learn ""state-action values"" rather than just ""state values""","The value function $V(s)$ does not provide enough information, without also learning $r(s,a)$ and $p(s'|s,a)$, to know how to act! In comparison, $\pi(s)\in argmax_n Q(s,a)$ tells an agent how to act with Q-values"
Harvard Spring 2021,6,d,1,Reinforcement Learning,Text,"Consider an MDP with two states $S=\{$ state 1 , state 2$\}$ and two actions $\{$ left, right $\}$ and an RL agent with the following Q-values:
$\begin{array}{ccc} & \text { left } & \text { right } \\ \text { state1 } & 6 & 4 \\ \text { state } 2 & 2 & 3\end{array}$
Suppose the agent is in state1. What is the distribution over the action the agent takes when using an $\epsilon$-greedy policy that explores with probability $\epsilon>0$ ?
",with prob $1-\eps$ take action left otherwise take one of left and right uniformly at random
Harvard Spring 2021,6,e,1,Reinforcement Learning,Text,"The update rule for Q-learning is as follows, where $\alpha$ is the learning rate and $\gamma$ the discount factor:
$$
Q(s, a) \leftarrow Q(s, a)+\alpha\left(r+\gamma \cdot \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right) .
$$
Suppose the agent takes action left in state1, and transitions to state2. Which $Q$-value (or $Q$-values) is updated, and with which action $a^{\prime}$ and state $s^{\prime}$ ?
","the agent updates Q(state1, left), and uses the value of Q(state2, right) for the update, i.e. adopting s'=state2 and a'=right"
Harvard Spring 2021,6,f,2,Reinforcement Learning,Text,State one advantage of SARSA over Q-learning and one advantage of Q-learning over SARSA,"Q over SARSA: off--policy, can learn optimal policy even while continuing to adapt to the environment via eps-greedy; Q is also less succeptible to ""local minima"" or learning the wrong (suboptimal) policy than SARSA, since exploration in SARSA has to be coupled with ""greedy in the limit"". SARSA over Q-learning: simpler, does not have ""max a"" component; Provide risk-aversion"