Lecture 21: LLMs from a Probabilistic Perspective 2: Training on Unlabeled Data

layout: distill title: Lecture Notes Template description: An example of a distill-style lecture notes that showcases the main elements. date: 2025-04-24

lecturers:

  • name: Ben Lengerich url: “https://lengerichlab.github.io/”

authors:

  • name: Joshua Salinas

  • name: Mitchell Stephens


Announcements

  • Project presentations: April 29 and May 1.
  • Submit peer review forms on Canvas each day to earn up to 2% bonus.
  • Due by: Friday, May 2.

Unsupervised Training of LLMs

Maximum Likelihood Estimation (MLE)

  • GPT models maximize the likelihood of observed sequences:

    $\max_{\theta} \sum_{i=1}^T \log P_{\theta}(x_i \mid x_{<i})$

  • This is a Directed Probabilistic Graphical Model.
  • Predicting even simple tokens (e.g., “is” in a factual sentence) requires grammar, factual knowledge, and context resolution.
  • Positional encodings are added to input embeddings to provide sequence order information.

Historical Context

  • MLE-based models have existed since 2007.
  • 2007 Google MLE-based language model had 2 trillion tokens and 300 billion n-grams.
  • Early models focused on n-grams, not embeddings.
  • Lack of deep representation and contextual modeling.
  • The LLM breakthrough came with transformers, large datasets, and GPU acceleration.

GPU Importance

  • CPUs typically have 4-8 cores with complex control units.
  • GPUs split tasks across hundreds of simpler cores for efficient matrix multiplication.
  • Downside: GPU cores don’t communicate directly and have slower memory access, but they enable faster computation overall.


Scale and Emergent Capabilities

  • As model scale increases, new capabilities emerge unexpectedly.
  • Examples: in-context learning, chain-of-thought reasoning.
  • Refer to: Scaling Laws for Neural Language Models (Kaplan et al. 2021).

Example:
Predicting “is” in “The capital of France __ Paris” requires:

  • Subject-verb agreement
  • Recognition of factual structures
  • Geographical knowledge

In-Context Learning

  • After training a few models, the LLM learns how to apply (P(x)) efficiently.
  • Zero-Shot: Only given instructions, no examples.
  • Few-Shot with Instruction: Task and examples are provided.
  • Few-Shot with Examples Only: The model must infer the task from examples without instructions.


Chain-of-Thought

  • Chain-of-Thought prompts the model to explain its reasoning.
  • Example: showing all steps to derive a complex equation.
  • Helps the model adapt quickly and respond more accurately, especially for complex tasks.

Why Does This Work?

  • No definitive proof, but hypotheses include:
    • Task identification as implicit Bayesian inference.
    • Transformers performing in-context gradient descent.
    • Possibly a combination of both.

Scale

  • Scaling adds randomness and diversity, which prevents overfitting as models cancel each other’s errors (ensemble effect).
  • More parameters increase variance and randomness.
  • This can be beneficial for specialized models but makes data filtering and preparation challenging.
  • Fitting on poor-quality data leads to bad distributions.


Challenges of MLE

KL Divergence Minimization

  • MLE objective:

    \[\arg\max_{\theta} \mathbb{E}_{x \sim P_{\text{data}}}[\log P_{\theta}(x)] = \arg\min_{\theta} \text{KL}(P_{\text{data}} \| P_{\theta})\]
  • Issues:

    • Repetitiveness, memorization of training data.
    • Lack of semantic grounding or task objectives.
    • Must assign probabilities to incoherent sequences.
    • Not task-aware — doesn’t optimize for accuracy or meaning.

Entropy and Confidence

  • Token-level entropy:

    $H_t = -\sum_v P(x_t = v \mid x_{<t}) \log P(x_t = v \mid x_{<t})$

  • Lower entropy = higher confidence.
  • Low entropy can lead to generic or repetitive outputs.
  • Sampling strategies (greedy, top-k, nucleus) help control diversity.

  • ELBO (Evidence Lower Bound):

    $\log p(x \mid \theta) \geq \mathbb{E}_{z \sim q}[\log p(x,z \mid \theta)] + H(q)$

  • Maximizing ELBO encourages higher entropy in latent variables.
  • MLE lacks explicit entropy control.

Solutions to MLE Limitations

  • Entropy Regularization:

    $\theta^* = \arg\max_{\theta} \mathbb{E}{x \sim P{\text{data}}}[\log P_{\theta}(x)] + \lambda \cdot H[P_{\theta}(x)]$

  • Smooth Labeling:

    $y’_i = \begin{cases} 1 - \epsilon & y = 1 \\epsilon / (V - 1) & y \neq 1\end{cases}$

  • Other methods:

    • Contrastive learning (e.g., noise contrastive estimation).
    • Preference- or utility-based training (e.g., RLHF).
    • Risk minimization (optimize downstream task loss).
    • Scheduled sampling.
    • Objectives for coverage/diversity.
    • Penalizing low entropy.

References and Tools