Lecture 01 - Course Introduction, Introduction to PGMs

Course Introductory Admin-related details and an introduction to PGMs in the scope of this course

Logistics

Course Schedule / Calendar

Grading

Grading Scale

Homework Policies

Project

Extra Credit: Lecture Notes

Attendance / Participation

Academic Integrity

Mental Health & Wellbeing

Introduction to Probabilistic Graphical Models (PGMs)

What is a Graphical Model?

A graphical model is a technique for representing a probability distribution to make inferences in situations involving uncertainty, which is essential for tasks like speech recognition and computer vision. Typically, we have a dataset consisting of samples: $D = {X_{1}^{(i)},X_{2}^{(i)},…,X_{m}^{(i)} }_{i=1}^N$. The relationships between the elements in each $X$ can be represented by a graph $G$. From this, we derive the model $M_G$.

A graph for a model

Graphical models allow us to consider fundamental questions with respect to three branching categories:

The following section investigates how PGMs help us answer these questions.

Representation:

Joint Probability Distribution of Multiple Variables

The joint probability distribution of a set of random variables represents the probability of all possible combinations of values that the variables can take. Given a set of variables ( X_1, X_2, \dots, X_n ), their joint probability distribution is expressed as:

$ P(X_1, X_2, \dots, X_n) $

This represents the probability that all variables take specific values simultaneously. If the variables are Boolean (i.e., each ( X_i ) can either be 0 or 1), then the joint probability distribution represents the probability of each possible combination of 0s and 1s across the variables.

Example: Joint Probability for a Set of Boolean Variables

Suppose we have a set of Boolean variables $ (X_1, X_2, \dots, X_n) $. These variables can each take values from the set $ { 0, 1 } $. The joint probability distribution would describe the probability of each combination of these 8 variables.

The total number of configurations in this joint probability distribution is:

$ 2^{8} = 256 $

This means there are 256 different possible states for these 8 Boolean variables. For example, one possible configuration might be:

$ X_1 = 0, X_2 = 1, X_3 = 0, \dots, X_8 = 1 $

The probability for this configuration would be:

$ P(X_1 = 0, X_2 = 1, \dots, X_8 = 1) $

Thus, for $n$ Boolean variables, we would have $2^{n}$ possible configurations in the joint probability distribution.


Can We Disallow Certain State Configurations?

Yes, certain state configurations can be disallowed based on domain knowledge, assumptions, or constraints. This can be done either by conditioning the distribution on certain variables or by modifying the joint distribution to set certain configurations to zero probability.

Example: Conditioning and Disallowing Certain Configurations

If we have the same 8 Boolean variables $ X_1, X_2, … , X_8 $, and we know that certain combinations are impossible based on domain constraints, we can condition the joint probability distribution to exclude these configurations.

For example, suppose the configuration $ X_1 = 0, X_2 = 1, X_3 = 0, \dots, X_8 = 1 $ is not possible. We can set:

$ P(X_1 = 0, X_2 = 1, X_3 = 0,…, X_8 = 1) = 0 $

Alternatively, we may choose to focus only on a subset of possible configurations. For example, if we know that $ X_5 $ and $ X_7 $ cannot both be 1 simultaneously, we can redefine the joint distribution as:

P(X_{1}, X_{2}, \dots, X_{8} \mid X_{5} \neq 1 \, \text{or} \, X_{7} \neq 1)

This limits the set of allowed configurations to those that respect this constraint.


Focus on Subset of Configurations

It is possible to focus on a subset of configurations by restricting the model. This can be achieved by conditioning the joint probability distribution on a set of variables or by focusing only on valid state configurations.

Example: Conditioning and Subset Focus

If we are only interested in configurations where $ X_1 = 1 $, we can condition the distribution on $ X_1 = 1 $. The resulting probability distribution is then:

P(X_{2}, X_{3}, \dots, X_{8} \mid X_{1} = 1)

This effectively narrows the focus of the distribution to only the cases where $ X_1 = 1 $, and ignores all other configurations of $ X_1 $.

In Markov Models or Markov Decision Processes (MDPs), the model’s assumptions can be used to simplify the dependency structure and focus on specific subsets of states or variables. For example, in MDPs, the agent’s actions may only influence the next state and immediate reward, and the probability distribution over future states may depend only on the current state and action, not the entire history.

By restricting the joint probability distribution, you can focus on the most relevant or feasible configurations in the problem.

Inference

To answer queries like $ P(X_i \mid D) $, we apply Bayes’ Rule:

$ P(X_i \mid D) = \frac{P(X_i, D)}{P(D)} $

Where $ P(X_i, D) $ is the joint probability of $ X_i $ and the observed data $ D $. This is computed by marginalizing over all unobserved variables:

$ P(X_i \mid D) = \frac{\sum_{X_1, \dots, X_n} P(X_1, \dots, X_n, D)}{P(D)} $

Here, the sum is over all possible configurations of the unobserved variables. The denominator $ P(D) $ is the marginal likelihood.

In simpler cases, if the variables are independent, the joint distribution factorizes. For instance, if $ X_i $ is independent of other variables, then:

$ P(X_i \mid D) = P(X_i) $

This reduces the computation, as we don’t need to sum over other variables.

Graphical models like Bayesian Networks exploit conditional independencies between variables, allowing us to factorize the joint distribution:

$ P(X_1, \dots, X_n) = \prod_{i} P(X_i \mid \text{parents}(X_i)) $

This structure simplifies inference by reducing the number of configurations to sum over, making it more efficient.

Note: If all $ X_i are independent, P(X_8 \mid X_1) = P(X_8) $.

As a more specific case, view the sample calculation below:

Learning: Finding the Right Model

When we talk about learning in the context of probabilistic models, we usually aim to choose the model $ M $ from a set of possible models $ \mathcal{H} $ that best explains or fits the observed data $ D $.

The goal is to find the model $ M $ that maximizes some objective function, typically the likelihood of the data, or some score function, denoted as $ F(D; M) $. This can be written as:

$ M = \arg\max_{M \in \mathcal{H}} F(D; M) $

Where:

Constraining the Hypothesis Space $ \mathcal{H} $

To make the search for the optimal model more efficient, we need to constrain the hypothesis space $ \mathcal{H} $. Without constraints, the space of possible models may be too large to search efficiently.

Suggested methods to Constrain $ \mathcal{H} $:

  1. Prior Knowledge: Use domain knowledge to narrow down the set of possible models. This could involve specifying certain structures, such as limiting the models to certain types of graphical models (e.g., Bayesian networks, Markov chains) or making assumptions about the data (e.g., assuming Gaussian distributions).

  2. Regularization: Apply regularization techniques to prevent overfitting and guide the search for simpler models. This can include:
    • L1 regularization (Lasso) or L2 regularization (Ridge) to penalize large model parameters.
    • Bayesian priors that favor certain kinds of models over others.
  3. Model Family: Instead of searching over all possible models, restrict the search to a family of models (e.g., linear regression, decision trees, etc.), which reduces the hypothesis space $ \mathcal{H} $.

  4. Simplifying Assumptions: For instance, in graphical models, you might assume conditional independence between certain variables to reduce the complexity of the search space.

Example of how imposing a dependency structure simplifies representation of a PGM of the inside of a cell:

\begin{aligned} P(X_{1}, X_{2}, X_{3}, X_{4}, X_{5}, X_{6}, X_{7}, X_{8}) &= P(X_{1}) \, P(X_{2}) \, P(X_{3} \mid X_{1}) \, P(X_{4} \mid X_{2}) \, P(X_{5} \mid X_{2}) \, P(X_{6} \mid X_{3}, X_{4}) \, P(X_{7} \mid X_{6}) \, P(X_{8} \mid X_{6}, X_{5}) \\ \end{aligned}

The benefits of this representation are listed below.

Why PGMs?

How does a PGM differ from a GM?

The two types of GMs are:

Bayesian network A directed acyclic graph (DAG).
Markov random field An undirected graph.

For instance, the Bayes net uses a directed acyclic graph (DAG). Each node in a Bayes net has a Markov blanket, composed of its parents, its children, and its children’s parents. Every node is conditionally independent of the nodes outside its Markov Blanket. Therefore, the local conditional probabilities as well as the graph structure completely determine the joint probability distribution. This model represents causality relationships can be used to generate new data.They are acyclic in order to ensure no feedback loops, maintaining a clear causality structure.

By contrast, the Markov random field uses an undirected graph. Every node is conditionally independent of the other graph nodes, except for its immediate neighbors. To determine the joint probability distribution, we need to know local contingency functions as well as structural cliques. This model represents correlation between variables, but cannot explicitly generate new data. Note that these structures can include cycles, unlike Bayesian Networks.

Towards structural specification of probability distributions

A brief elucidation of the evolution and key developments of graphical models over time may be necessary. View the incomplete genealogy of GMs below:

Genealogy of Graphical Models

Fancier GMs

GMs can be applied in numerous more advanced ways to solve complex problems in areas like reinforcement learning, machine translation, etc.

Why GMs?