Lecture 02 - Statistics review

A statistics review, including MLE and MAP

Logistics Review
Homework info
Probability Basics
Estimation Methods
Linear Regression
Optimization

Logistics Review

Class webpage: lengerichlab.github.io/pgm-spring-2025
Lecture scribe sign-up sheet
Readings,Class Announcements, Assignment Submissions: Canvas
Instructor: Ben Lengerich
- Office Hours: Thursday 2:30-3:30pm, 7278 Medical Sciences Center
- Email: lengerich@wisc.edu
TA: Chenyang Jiang
- Office Hours: Monday 11am-12pm, 1219 Medical Sciences Center
- Email: cjiang77@wisc.edu

HomeWork info

Released: Due next Friday at midnight.
- PDF and Latex solution template (.tex) available on website.
Submit via: Canvas
Most preferred format:
- PDF with your solution written in the provided solution box using LaTeX.

Probability Basics

Random Variables:
- Discrete: Values that come from a countable set, meaning they can take on specific, distinct values. For example, the outcome of a coin flip ((0) for tails, (1) for heads) or the number of students in a classroom. Discrete random variables are often associated with scenarios where the possible outcomes can be listed or enumerated.
- Continuous: Values that come from an interval or range on the real number line, meaning they can take on infinitely many possible values within that range. For instance, a person’s height (e.g., 170.5 cm) or the time it takes to complete a task (e.g., 2.34 hours). Continuous random variables are typically associated with measurements and involve smooth distributions.
PMF and PDF:
- Probability Mass Function (PMF): (P(X = x)) for discrete (X).
- Probability Density Function (PDF): (f(x)) for continuous (X).
Bernoulli Distribution
P(X = x) = \theta^x (1 - \theta)^{1-x}, \, x \in \{0, 1\}
The Bernoulli distribution is one of the simplest discrete probability distributions, representing a single trial with two possible outcomes: success ((x = 1)) and failure ((x = 0)). It is parameterized by (\theta), which denotes the probability of success. This distribution is commonly used to model binary outcomes, such as flipping a coin, where the probability of heads or tails remains constant across trials.
- Example: A fair coin flip ($\theta = 0.5$).
  In this case, the probability of success (e.g., heads) and failure (e.g., tails) are both equal to 0.5, making it a symmetric Bernoulli distribution. This example demonstrates how Bernoulli distributions are foundational for understanding more complex probabilistic models, such as the Binomial distribution.
Gaussian Distribution
f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
The Gaussian distribution, also known as the normal distribution, is a continuous probability distribution widely used in statistics and machine learning. It is characterized by its symmetric, bell-shaped curve, which is determined by two parameters:
1. $\mu$ (mean): Represents the center of the distribution, indicating where most data points are likely to be concentrated.
2. $\sigma$ (standard deviation): Controls the spread or dispersion of the data, with larger values indicating a wider distribution.
The Gaussian distribution is fundamental in statistical theory due to the Central Limit Theorem, which states that the sum of a large number of independent random variables tends to follow a Gaussian distribution, regardless of the original distributions of the variables.
- “Normal”: Because of Central Limit Theorem.
  The term “normal” arises from the Central Limit Theorem, which underscores the prevalence of this distribution in natural and social phenomena. For example, measurements like height, weight, and test scores often approximate a normal distribution.
- “Standard Normal”: When $\mu = 0$, $\sigma = 1$.
  The standard normal distribution is a special case of the Gaussian distribution with a mean of 0 and a standard deviation of 1. It is often used in z-score calculations and statistical inference, where raw data are transformed into a standardized form for comparison across datasets.
Central Limit Theorem
- Let $X_1, X_2, \dots, X_n$ be i.i.d. random variables with mean $ \mu $ and variance $ \sigma^2 $.
- Define the sample mean:
\overline{X}_n = \frac{1}{n} \sum_{i=1}^n X_i
- Then, as $ n \to \infty $:
\frac{\overline{X}_n - \mu}{\sigma / \sqrt{n}} \to N(0, 1)
Explanation: The Central Limit Theorem (CLT) is a fundamental result in probability and statistics, stating that the distribution of the sample mean of i.i.d. random variables approaches a normal distribution as the sample size $ n $ increases, regardless of the original distribution of the variables. Key points include:
1. Independence and Identically Distributed (i.i.d.): The variables $ X_1, X_2, \dots, X_n $ must be independent and drawn from the same distribution.
2. Sample Mean ($ \overline{X}_n $): Represents the average of the observations, serving as an estimator of the population mean $ \mu $.
3. Standardized Form: The standardized version of $ \overline{X}_n $, given by $ \frac{\overline{X}_n - \mu}{\sigma / \sqrt{n}} $, converges to the standard normal distribution $ N(0, 1) $.
Joint, Marginal, and Conditional Probabilities
- Joint: $P(A, B)$, the probability of two events occurring together.
- Marginal: $P(A) = \sum_B P(A, B)$, the sum of joint probabilities over one variable.
- Conditional: $P(A \mid B) = \frac{P(A, B)}{P(B)}$, the probability of $A$ given $B$.
Expectation and Variance
- Expectation:
  - Discrete: $E[X] = \sum_x xP(X = x)$
  - Continuous: $E[X] = \int x f(x) dx$
- Variance:
  - Definition: $Var(X) = E[(X - E[X])^2]$
  - Equivalent: $Var(X) = E[X^2] - (E[X])^2$
- Linearity of Expectation
- Property:
  - $E[aX + b] = aE[X] + b$
- Multiple Variables:
  - $E[X_1 + X_2] = E[X_1] + E[X_2]$
Expectation of Functions
- Formula:
  - Discrete: $E[g(X)] = \sum_x g(x)P(X = x)$
  - Continuous: $E[g(X)] = \int g(x)f(x)dx$
- Example (Discrete):
  - If $X \sim Bernoulli(\theta)$ and $g(X) = X^2$, then: $E[g(X)] = 1^2\theta + 0^2(1 - \theta) = \theta$
- Example (Continuous):
  - If $X \sim Uniform(0, 1)$ and $g(X) = X^2$, then: $E[g(X)] = \int_0^1 x^2 dx = \frac{1}{3}$
Variance of Functions
- Definition:
  - $Var(g(X)) = E[(g(X) - E[g(X)])^2]$
  - Equivalent: $Var(g(X)) = E[g(X)^2] - (E[g(X)])^2$
Covariance and Correlation
- Covariance:
  - $Cov(X, Y) = E[(X - E[X])(Y - E[Y])]$
  - Covariance measures the degree to which two random variables change together. A positive value indicates that as one variable increases, the other tends to increase as well, while a negative value indicates the opposite.
- Properties:
  - $Cov(X, X) = Var(X)$: The covariance of a variable with itself is its variance.
  - If $X$ and $Y$ are independent: $Cov(X, Y) = 0$, meaning there is no linear relationship between the variables.
- Correlation:
  - $\rho(X, Y) = \frac{Cov(X, Y)}{\sqrt{Var(X)Var(Y)}}$: Correlation is a normalized version of covariance, which scales the relationship between $X$ and $Y$ to the range $[-1, 1]$.
  - $\rho = 1$: Perfect positive linear relationship.
  - $\rho = 0$: No linear relationship.
  - $\rho = -1$: Perfect negative linear relationship.
  - Correlation ($\rho$) can vary depending on the context and the nature of the data. It provides a relative measure of linear dependence that can be influenced by changes in scale or units. Correlation can also be quantified and analyzed using specific functions to better understand the relationships within different contexts.

-Bayes’ Rule

Formula: $P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}$
Example (Medical test):
$P(disease \mid positive \ test) = \frac{P(positive \ test \mid disease) P(disease)}{P(positive \ test)}$
In this example, the goal is to calculate the probability that a patient has a disease given a positive test result. This is a typical application of Bayes’ Rule in medical diagnostics.
$P(positive \ test \mid disease)$ represents the sensitivity of the test (true positive rate), indicating the likelihood of a positive result if the patient has the disease.
$P(disease)$ is the prior probability, or the prevalence of the disease in the population.
$P(positive \ test)$ is the overall probability of a positive test, which accounts for both true positives and false positives.
By combining these factors, Bayes’ Rule helps adjust the prior belief about the disease based on the observed test result, allowing for more accurate diagnosis. This is especially useful when the disease is rare, as the prior probability significantly impacts the result.

Estimation Methods

Introduction to Estimation
- Goal of Estimation:
  - Infer unknown parameters $\theta$ from observed data.
  - Estimation aims to use sample data to make inferences about the population, bridging the gap between data and statistical models.
- Types of Estimation:
  - Point Estimation:
    - Provides a single best guess for the parameter (e.g., Maximum Likelihood Estimation, MLE).
    - Example: Estimating the mean of a population using the sample mean.
  - Interval Estimation:
    - Provides a range of plausible values for the parameter (e.g., confidence intervals).
    - Example: Giving a 95% confidence interval for the mean, indicating that the true mean is likely within this range.
- Common Methods:
  - Maximum Likelihood Estimation (MLE):
    - Finds the parameter value that maximizes the likelihood of observing the given data.
    - Example: Used in logistic regression or fitting distributions.
  - Maximum A Posteriori (MAP):
    - Combines prior knowledge with the data to estimate the parameter, often used in Bayesian inference.
  - Method of Moments:
    - Matches sample moments (e.g., mean, variance) with theoretical moments to estimate parameters.
Maximum Likelihood Estimation (MLE)
- Definition:
  - Find $\hat{\theta}$ that maximizes the likelihood of observing the given data.
  - $\hat{\theta} = \underset{\theta}{\text{argmax}} \, L(\theta)$, where $L(\theta) = P(\text{data} \mid \theta)$.
- Interpretation:
  - $L(\theta)$ represents the probability of the observed data given $\theta$.
  - The MLE selects the parameter $\theta$ that makes the observed data the most “likely.”
  - It’s a widely used method due to its simplicity and strong statistical properties under large sample conditions.
Example (MLE with Bernoulli Distribution)
- Dataset: $X = {1, 0, 1, 1, 0}$.
- Bernoulli distribution:
  - $P(X = 1 \mid \theta) = \theta$ and $P(X = 0 \mid \theta) = 1 - \theta$.
  - Likelihood function: $L(\theta) = \prod_i \theta^{x_i} (1 - \theta)^{1 - x_i}$.
- Log-likelihood function:
  - $\ell(\theta) = \log L(\theta) = \sum_{i=1}^n x_i \log \theta + (1 - x_i) \log (1 - \theta)$.
- Derivative and solution:
  - $\frac{d\ell}{d\theta} = \frac{k}{\theta} - \frac{n - k}{1 - \theta}$, where $k = \sum x_i$.
  - $\hat{\theta} = \frac{k}{n}$.
This example shows the computational steps to estimate $\theta$, demonstrating how MLE converts complex probabilistic models into simpler optimization problems.
Limitations of MLE
- The MLE:
  - Does not always exist (e.g., when data provides insufficient information).
  - Is not necessarily unique (e.g., multimodal likelihood functions).
  - May not be admissible (e.g., in small sample sizes, other estimators may perform better).
Maximum A Posteriori (MAP) Estimation
- Definition:
  - $\hat{\theta}_{MAP} = \underset{\theta}{\text{argmax}} \, P(\theta \mid \text{data}) \propto P(\text{data} \mid \theta) P(\theta)$.
- Interpretation:
  - $P(\text{data} \mid \theta)$ is the likelihood.
  - $P(\theta)$ incorporates prior knowledge or beliefs about $\theta$.
  - MAP extends MLE by integrating prior information, providing more robust estimates in certain scenarios.
Regularization and MAP
- MLE with Regularization:
  - Adds a penalty term to the log-likelihood to avoid overfitting:
    - $\hat{\theta}_{\text{reg}} = \underset{\theta}{\text{argmax}} \, [\log L(\theta) - \lambda R(\theta)]$.
- MAP as Penalized MLE:
  - By letting $P(\theta) \propto e^{-\lambda R(\theta)}$, MAP naturally incorporates the penalty:
    - $\hat{\theta}{MAP} = \underset{\theta}{\text{argmax}} \, [\log L(\theta) + \log P(\theta)] = \hat{\theta}{\text{reg}}$.
Method of Moments
- Definition:
  - Matches sample moments (e.g., mean, variance) to theoretical moments to estimate parameters.
- Example:
  - Bernoulli Distribution:
    - $E[X] = \theta$, so $\hat{\theta} = \bar{X}$ (sample mean).
  - Gaussian Distribution:
    - $E[X] = \mu$, $Var(X) = \sigma^2$.
    - $\hat{\mu} = \bar{X}$, $\hat{\sigma}^2 = \frac{1}{n} \sum (X_i - \bar{X})^2$.
The method of moments provides a simple and intuitive alternative to MLE, particularly for models where moments are directly tied to parameters.

Linear Regression

Model Definition
- $y = X \beta + \epsilon$, where:
  - $y$: Response variable (dependent variable).
  - $X$: Design matrix (independent variables or features).
  - $\beta$: Coefficients (parameters to estimate).
  - $\epsilon$: Error term (often assumed to be $N(0, \sigma^2)$).
-Goal
- Estimate $\beta$.
Linear Regression Evaluation Metrics
- Coefficient of Determination ($R^2$):
  - $R^2 = 1 - \frac{SS_{residual}}{SS_{total}}$
  - Measures the proportion of variance explained by the model.
- Mean Squared Error (MSE):
  - $MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2$
  - Measures the squared error between predictions and actual values.
- Mean Absolute Error (MAE):
  - $MAE = \frac{1}{n} \sum y_i - \hat{y}_i $
  - Measures the absolute error between predictions and actual values.
Ordinary Least Squares (OLS)
- Objective:
  - Minimize the sum of squared residuals: $ \hat{\beta}{OLS} = \arg \min\beta |y - X\beta|^2$
- Residuals:
  - $e_i = y_i - \hat{y}_i$
- Solution:
  - $\hat{\beta}_{OLS} = (X^T X)^{-1} X^T y$
Regularization in Linear Regression (MAP)
- Ridge Regression (L2 Regularization):
- Adds an $L2$ penalty: $ \hat{\beta}{ridge} = \arg \min\beta |y - X\beta|^2 + \lambda |\beta|^2$
- Equivalent MAP interpretation:
  - Prior on coefficients: $\beta \sim N(0, \frac{\sigma^2}{\lambda})$.
  - MAP estimate maximizes $P(\beta \mid y)$ by incorporating prior information.
- Lasso Regression (L1 Regularization):
- Adds an $L1$ penalty: $ \hat{\beta}{lasso} = \arg \min\beta |y - X\beta|^2 + \lambda |\beta|_1$
- Equivalent to $\beta \sim Laplace(0, \frac{\sigma^2}{\lambda})$.
Extensions of Linear Regression
- Polynomial Regression:
- Adds polynomial terms: $y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots$
- Generalized Linear Models:
- Extends to non-normal distributions by a link function: $g(E[Y]) = X\beta$
- Interaction Terms:
- Includes interactions between predictors: $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2$

Optimization

Convexity
- Definition: $ \forall \lambda \in [0,1], \, f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y). $
- Convex functions satisfy a relaxed linearity, meaning the function value at any point along the line segment between $x$ and $y$ is no greater than the weighted average of the function values at $x$ and $y$.
Strictly Convex Function
Definition: $ \forall \lambda \in (0,1), \, f(\lambda x + (1-\lambda)y) < \lambda f(x) + (1-\lambda)f(y). $
Strict convexity ensures the function curves strictly below the weighted average line unless $x=y$, which provides stronger guarantees in optimization.
Strongly Convex Function
Definition: $ \exists \mu > 0, \, x \mapsto f(x) - \mu |x|^2 \text{ is convex.} $
Alternatively: $ f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y) - \mu \lambda (1-\lambda)|x-y|^2. $
Strong convexity strengthens the guarantees of convexity by penalizing deviations, providing faster convergence in optimization algorithms.
**Convexity Aids Optimization
Implications
- If $ f $ is convex and differentiable: $ f(y) \geq f(x) + \nabla f(x)^T (y-x). $
- Key results:
  - Convex function: All local minima are global minima.
  - Strictly convex function: A local minimum is unique and global.
  - Strongly convex function: Ensures a unique local minimum with better optimization convergence properties.
Convexity simplifies optimization problems by ensuring the absence of local minima other than the global one, and strong convexity further aids optimization with faster convergence.
Convexity is Overrated
- Practical Perspective:
  - Yann LeCun highlights that suitable architectures often outweigh the necessity of convexity.
  - Example: Shallow (convex) classifiers vs. Deep (non-convex) classifiers.
- Non-Convex Advantages:
  - Non-convex loss functions can improve accuracy and speed.
  - Even shallow architectures like SVMs benefit from non-convex loss.
While convexity ensures theoretical guarantees, real-world machine learning tasks often achieve better performance by embracing non-convex approaches, especially in modern neural networks.