Lecture 12 - Parameter Learning of partially observed BNs
Parameter Learning of partially observed BNs
Introduction
Topics covered:
- Partially-Observed Graphical Models (GMs)
- Expectation-Maximization (EM)
- K-Means Clustering
Partially-Observed Graphical Models
Mixture Models
- A density model $p(x)$ may be multi-modal.
- Can we model it as a mixture of uni-modal distributions?
Unobserved Variables
- A variable can be unobserved (latent) because:
- It is difficult or impossible to measure (e.g., causes of a disease, evolutionary ancestors).
- It is only sometimes measured (e.g., faulty sensors).
- It is an imaginary quantity meant to provide a simplified but useful view of the data generation process (e.g., mixture assignments).
- Discrete latent variables can be used for cluster assignments.
- Continuous latent variables can be used for dimensionality reduction.
Challenges in Learning with Latent Variables
-
In fully observed IID settings, the log-likelihood decomposes into a sum of local terms:
\[\ell_c(\theta;D) = \log p(x,z \mid \theta) = \log p(z \mid \theta_z) + \log p(x \mid z,\theta_x)\] -
With latent variables, all parameters become coupled via marginalization:
\[\ell_c(\theta;D) = \log \sum_{z} p(x,z \mid \theta) = \log \sum_{z} p(z \mid \theta_z) p(x \mid z, \theta_x)\]- Sum over $z$ is inside the log.
Strategy:
- Guess value of $Z$.
- Apply Maximum Likelihood Estimation (MLE) to estimate best model parameters based on $Z$.
- Infer most likely $Z$ based on MLE parameter estimates.
- Return to step 2 until $Z$ stops changing.
Expectation-Maximization (EM)
Gaussian Mixture Models (GMMs)
- Consider a mixture of $K$ Gaussian components:
- This model can be used for unsupervised clustering.
-
$Z$ is a latent class indicator:
\[p(z_n) = \text{multi}(z_n : \pi) = \prod_k (\pi_k)^{z_n^k}\] -
$X$ is a conditional Gaussian variable with a class-specific mean and covariance:
\[p(x_n \mid z_n^k = 1, \mu, \Sigma) = \frac{1}{(2\pi)^{m/2} |\Sigma_k|^{1/2}} \exp \left\{ -\frac{1}{2} (x_n - \mu_k)^T \Sigma_k^{-1} (x_n - \mu_k) \right\}\] -
Likelihood:
\[p(x_n \mid \mu, \Sigma) = \sum_k \pi_k N(x_n \mid \mu_k, \Sigma_k)\]
EM for GMMs
Algorithm:
- Start:
- Guess the value of centroids $\mu_k$ and covariances $\Sigma_k$ of each of the $K$ clusters.
- Loop:
-
E-Step: Compute the expected value of the sufficient statistics of the hidden variables under current estimates of parameters:
\[\tau_n^{k(t)} = p\left(z_n^k = 1 \mid x, \mu^{(t)}, \Sigma^{(t)}\right) = \frac{\pi_k^{(t)} N(x_n \mid \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_i \pi_i^{(t)} N(x_n \mid \mu_i^{(t)}, \Sigma_i^{(t)})}\] -
M-Step: Using the current expected value of the hidden variables, compute the parameters that maximize the likelihood:
\[\mu_k^{(t+1)} = \frac{\sum_n \tau_n^{k(t)} x_n}{\sum_n \tau_n^{k(t)}}\] \[\Sigma_k^{(t+1)} = \frac{\sum_n \tau_n^{k(t)} (x_n - \mu_k^{(t+1)})(x_n - \mu_k^{(t+1)})^T}{\sum_n \tau_n^{k(t)}}\]
K-Means vs. EM
- K-means clustering is the hard-assignment version of EM for mixture of Gaussians:
-
K-Means E-Step:
\[z_n^{(t)} = \arg\max_k (x_n - \mu_k^{(t)})^T \Sigma_k^{(t)^{-1}} (x_n - \mu_k^{(t)})\] -
K-Means M-Step:
\[\mu_k^{(t+1)} = \frac{\sum_n \delta(z_n^{(t)}, k) x_n}{\sum_n \delta(z_n^{(t)}, k)}\]
-
Why Does EM Work?
-
The expected complete log-likelihood is a lower bound on the log-likelihood:
\[\ell(\theta; x) \geq \langle \ell_c(\theta; x, z) \rangle_q + H_q\] -
EM Algorithm at Convergence:
\[\theta_{t+1} = \arg\max_{\theta} Q(\theta', \theta_t)\]
Foreshadowing Variational Inference
-
EM: Let $q_t(z \mid x) = p(z \mid x, \theta_t)$. Maximize $p(x \mid \theta)$ by iterating:
\[\theta_{t+1} = \arg\max_{\theta} Q(\theta', \theta_t)\] -
Variational Inference:
-
Let $q(z \mid x)$ be some family that is easier to optimize:
\[\log p(x \mid \theta) \geq \mathbb{E}_{z \sim q}[\log p(x, z \mid \theta)] + H(q)\] -
ELBO (Evidence Lower Bound):
\[\text{ELBO} = \mathbb{E}_{z \sim q}[\log p(x, z \mid \theta)] + H(q)\]
-
Conclusion
Questions?