Lecture 21
Lecture 21: LLMs from a Probabilistic Perspective 2: Training on Unlabeled Data
layout: distill title: Lecture Notes Template description: An example of a distill-style lecture notes that showcases the main elements. date: 2025-04-24
lecturers:
- name: Ben Lengerich url: “https://lengerichlab.github.io/”
authors:
-
name: Joshua Salinas
-
name: Mitchell Stephens
Announcements
- Project presentations: April 29 and May 1.
- Submit peer review forms on Canvas each day to earn up to 2% bonus.
- Due by: Friday, May 2.
Unsupervised Training of LLMs
Maximum Likelihood Estimation (MLE)
-
GPT models maximize the likelihood of observed sequences:
$\max_{\theta} \sum_{i=1}^T \log P_{\theta}(x_i \mid x_{<i})$
- This is a Directed Probabilistic Graphical Model.
- Predicting even simple tokens (e.g., “is” in a factual sentence) requires grammar, factual knowledge, and context resolution.
- Positional encodings are added to input embeddings to provide sequence order information.
Historical Context
- MLE-based models have existed since 2007.
- 2007 Google MLE-based language model had 2 trillion tokens and 300 billion n-grams.
- Early models focused on n-grams, not embeddings.
- Lack of deep representation and contextual modeling.
- The LLM breakthrough came with transformers, large datasets, and GPU acceleration.
GPU Importance
- CPUs typically have 4-8 cores with complex control units.
- GPUs split tasks across hundreds of simpler cores for efficient matrix multiplication.
- Downside: GPU cores don’t communicate directly and have slower memory access, but they enable faster computation overall.
Scale and Emergent Capabilities
- As model scale increases, new capabilities emerge unexpectedly.
- Examples: in-context learning, chain-of-thought reasoning.
- Refer to: Scaling Laws for Neural Language Models (Kaplan et al. 2021).
Example:
Predicting “is” in “The capital of France __ Paris” requires:
- Subject-verb agreement
- Recognition of factual structures
- Geographical knowledge
In-Context Learning
- After training a few models, the LLM learns how to apply (P(x)) efficiently.
- Zero-Shot: Only given instructions, no examples.
- Few-Shot with Instruction: Task and examples are provided.
- Few-Shot with Examples Only: The model must infer the task from examples without instructions.
Chain-of-Thought
- Chain-of-Thought prompts the model to explain its reasoning.
- Example: showing all steps to derive a complex equation.
- Helps the model adapt quickly and respond more accurately, especially for complex tasks.
Why Does This Work?
- No definitive proof, but hypotheses include:
- Task identification as implicit Bayesian inference.
- Transformers performing in-context gradient descent.
- Possibly a combination of both.
Scale
- Scaling adds randomness and diversity, which prevents overfitting as models cancel each other’s errors (ensemble effect).
- More parameters increase variance and randomness.
- This can be beneficial for specialized models but makes data filtering and preparation challenging.
- Fitting on poor-quality data leads to bad distributions.
Challenges of MLE
KL Divergence Minimization
-
MLE objective:
\[\arg\max_{\theta} \mathbb{E}_{x \sim P_{\text{data}}}[\log P_{\theta}(x)] = \arg\min_{\theta} \text{KL}(P_{\text{data}} \| P_{\theta})\] -
Issues:
- Repetitiveness, memorization of training data.
- Lack of semantic grounding or task objectives.
- Must assign probabilities to incoherent sequences.
- Not task-aware — doesn’t optimize for accuracy or meaning.
Entropy and Confidence
-
Token-level entropy:
$H_t = -\sum_v P(x_t = v \mid x_{<t}) \log P(x_t = v \mid x_{<t})$
- Lower entropy = higher confidence.
- Low entropy can lead to generic or repetitive outputs.
- Sampling strategies (greedy, top-k, nucleus) help control diversity.
Links to Variational Inference
-
ELBO (Evidence Lower Bound):
$\log p(x \mid \theta) \geq \mathbb{E}_{z \sim q}[\log p(x,z \mid \theta)] + H(q)$
- Maximizing ELBO encourages higher entropy in latent variables.
- MLE lacks explicit entropy control.
Solutions to MLE Limitations
-
Entropy Regularization:
$\theta^* = \arg\max_{\theta} \mathbb{E}{x \sim P{\text{data}}}[\log P_{\theta}(x)] + \lambda \cdot H[P_{\theta}(x)]$
-
Smooth Labeling:
$y’_i = \begin{cases} 1 - \epsilon & y = 1 \\epsilon / (V - 1) & y \neq 1\end{cases}$
-
Other methods:
- Contrastive learning (e.g., noise contrastive estimation).
- Preference- or utility-based training (e.g., RLHF).
- Risk minimization (optimize downstream task loss).
- Scheduled sampling.
- Objectives for coverage/diversity.
- Penalizing low entropy.
References and Tools
- Radford et al., Improving Language Understanding by Generative Pre-Training.
- Holtzman et al., The Curious Case of Neural Text Degeneration.
- Hugging Face TxT360: https://huggingface.co/spaces/LLM360/TxT360
- GPU Machine Learning: On-Premises vs. Cloud
- Kaplan et al., 2021, Scaling Laws for Neural Language Models.