Lecture 19: The Attention Mechanism
An in-depth exploration of attention mechanisms in neural networks, from their motivation and foundational forms (hard and soft attention) to their applications in image captioning and the Transformer architecture.
The Attention Mechansism
Motivation: Why Attention?
In sequence modeling tasks like machine translation, traditional models such as RNNs or LSTMs encode an input sequence into a single fixed-length vector. This vector is then used by a decoder to generate the target sequence. However, this strategy compresses all the information into a bottleneck, making it difficult for the model to retain long-range dependencies or fine-grained contextual relationships.
Let’s consider the task of translating the English sentence “Where is the library?” into Spanish. The translation depends heavily on the word “the”, which maps to “la” in “la biblioteca”. However, if the input becomes more complex—e.g., “Where is the huge public library?”—the correct translation of “the” still needs to align with “la”, even though several words now intervene.
This raises a question: Do we really need to encode the whole sentence into one fixed vector just to translate a single word? Or can we instead selectively focus on relevant parts of the input as needed?

The attention mechanism was introduced as a way to soften this bottleneck. Rather than relying on a single fixed-length encoding, the model learns to focus on the most relevant parts of the input sequence for each step in the output.
Hard Attention
Hard attention is a mechanism where the model makes a discrete decision about where to attend at each time step. This means it chooses a single position (or subset) in the input sequence to focus on, essentially making a zero-one selection. The benefit of this approach is that it can provide interpretable decisions, which makes it attractive in explainable ML.
However, hard attention comes with a major drawback: it is non-differentiable. Because the selection is discrete, we cannot use standard backpropagation to train the model. Instead, training must rely on techniques like reinforcement learning, which are often slower and more unstable.

Soft Attention
To overcome the training difficulties of hard attention, Bahdanau et al. (2015) proposed soft attention, a differentiable alternative. Instead of selecting a single word, soft attention assigns a probability distribution over all input words, effectively computing a weighted average of their representations.
At each decoder step, the model produces an attention weight $\alpha_{ij}$ for every encoder hidden state $h_j$, reflecting its relevance to the current decoding step $i$. These weights are then used to compute a context vector $c_i$, which is a weighted combination of the encoder states.


The soft attention mechanism can be formalized as follows:
- $h_j$ are the encoder hidden states,
- $s_{i-1}$ is the decoder’s previous state,
- $e_{ij}$ is an alignment score indicating how well $h_j$ aligns with the output at step $i$,
- $\alpha_{ij}$ is the normalized attention weight.

Variants of Attention
Monotonic Attention
Sometimes, we know in advance that the alignment between input and output will be monotonic—that is, the output follows the input order (e.g., in speech recognition). In such cases, we can restrict the attention to move forward without revisiting earlier tokens.

Global vs. Local Attention
Another important distinction in attention types lies in the scope of attention:
- Global attention attends over all source positions. This is powerful but computationally expensive.
- Local attention focuses on a small window of positions near a predicted central point, reducing cost and often improving performance on long sequences.


Attention in Image Captioning
To better understand attention mechanisms, let’s look at the task of image captioning, where a model generates a textual description for a given image.
Baseline: CNN + RNN
A common pipeline uses a Convolutional Neural Network (CNN) to extract image features and an RNN (or LSTM) to decode those features into a sequence of words.

The CNN processes the input image (e.g., of size $H \times W \times 3$) into a feature vector of dimension $D$. This feature vector initializes the hidden state of the RNN, which then sequentially predicts words.
However, this approach suffers from a critical limitation: the RNN only sees the entire image once at the beginning. During decoding, it lacks the ability to focus on different regions of the image that might be relevant to different words in the caption.
Soft Attention for Image Captioning
To address this, soft attention can be applied to image captioning. Instead of encoding the entire image into one global feature, the CNN outputs a grid of spatial features—each corresponding to a different region of the image.
At each timestep, the decoder attends to this grid and computes a weighted combination of features (the context vector), allowing the model to dynamically focus on different regions when predicting different words.

Let $L$ be the number of spatial locations in the CNN feature grid and $D$ the dimensionality of each feature. The attention weights $\alpha_{i} \in \mathbb{R}^L$ are learned for each decoder step, and the context vector is:
\[z = \sum_{i=1}^L \alpha_i \cdot \text{feature}_i\]This allows the decoder to generate more precise and descriptive captions.
CNNs and Hard vs. Soft Attention
Historically, CNNs were considered to operate in a hard attention-like manner during image captioning because they compress the visual scene into a single global vector, limiting spatial flexibility.
In contrast, with soft attention, we maintain a spatial grid of features and summarize it via a learned distribution.

Key difference:
- Hard attention chooses one region (e.g., $a$ or $b$) based on probabilities $p_a, p_b$, etc.
- Soft attention computes: \(z = p_a a + p_b b + p_c c + p_d d\)
This weighted average makes gradients flow through all paths, making soft attention amenable to standard gradient descent.
Multi-Headed Attention
In practice, using multiple attention heads—each learning to focus on different aspects of the input—has proven highly effective. This idea is a core component of the Transformer architecture.
Each head computes its own attention distribution and context vector. These are then concatenated and transformed to produce the final representation.

Applications include:
- Program synthesis (Allamanis et al., 2016): One head for copying variable names, another for generating common tokens.
- Machine translation (Vaswani et al., 2017): Independently learned heads capture different linguistic patterns.
This diversity is key to the success of the Transformer.
Self-Attention
After exploring traditional attention mechanisms in encoder-decoder architectures, we now shift focus to self-attention, a core component of modern models like the Transformer.
Self-attention, also referred to as intra-attention, is a mechanism that allows each position in a sequence to attend to all other positions in the same sequence. This enables the model to capture long-range dependencies and contextual relationships efficiently—without relying on recurrence.
What is Self-Attention?
In self-attention, every token in the input sequence is transformed by interacting with all other tokens via attention weights. The goal is to build a representation of a token that incorporates contextual information from the entire sequence.

Let the embedded inputs be $r_1, r_2, r_3, r_4$ for the sentence “Let’s represent this sentence.” For each word $r_i$, we:
- Compare it with all others using a learned compatibility (or alignment) function.
- Use the resulting scores to compute attention weights.
- Generate a new representation $r_i’$, which is a weighted combination of all representations.
- Pass this new vector through a feedforward neural network (FFNN) to obtain the output $r_i’’$.
Why Use Self-Attention?
Self-attention offers several key advantages over recurrence or convolution-based methods:
- Constant path length between any two tokens. In RNNs, to connect distant tokens, information must pass through many sequential steps. Self-attention enables direct access between all pairs, improving long-range dependency modeling.
- Gating/multiplicative interactions, similar to mechanisms in LSTMs or GRUs, help regulate flow of information and improve expressivity.
- Parallelizability: Since there’s no recurrence, all tokens can be processed simultaneously (in parallel), allowing massive speedups during training.
- Potential to replace sequence modeling entirely—this was the foundational motivation behind the Transformer architecture.
Is Self-Attention Efficient?
Despite the power of self-attention, a natural concern is computational cost. Self-attention has a complexity of:
\[\mathcal{O}(\text{length}^2 \cdot \text{dim})\]because it computes pairwise interactions for all token pairs in the sequence. Let’s compare this to alternatives:
Layer Type | FLOPs (Floating Point Ops) |
---|---|
Self-Attention | ( \mathcal{O}(\text{length}^2 \cdot \text{dim}) ) |
RNN (LSTM) | ( \mathcal{O}(\text{length} \cdot \text{dim}^2) ) |
Convolution | ( \mathcal{O}(\text{length} \cdot \text{dim}^2 \cdot \text{kernel_width}) ) |
In practice, self-attention is very efficient on GPUs/TPUs due to its parallelizability and matrix-friendly computation style.
The Transformer
Word Embeddings
The Transformer starts with word embeddings—continuous-valued vector representations of tokens. These embeddings can be initialized in various ways:
- Random embeddings: A simple approach without prior knowledge.
- TF-IDF embeddings: Weighted based on frequency-inverse document frequency.
- Pretrained embeddings: From models like Word2Vec or GloVe that capture semantic relationships across corpora.

These embeddings form the input representation, capturing static word meaning. However, static vectors lack contextual information.
Contextual Embeddings
Words can take on different meanings depending on context. In the sentence:
“The King doth wake tonight and takes his rouse…“
the word “King” must be interpreted in its Shakespearean context. Transformers update embeddings by applying self-attention, allowing representations to be refined by surrounding words.


These changes to the embedding are governed by self-attention layers, which compute context-aware representations.
Encoder Self-Attention
The encoder in the Transformer takes the input sequence and transforms it into a series of contextualized embeddings. This is achieved using self-attention, which allows every input token to attend to all others.
The attention weights are computed using the scaled dot-product attention:
Where:
- $Q$: query matrix
- $K$: key matrix
- $V$: value matrix
- ( d_k ): dimensionality of key vectors (used for scaling)
This process enables each token to gather information from the rest of the sequence, weighted by relevance.

Each encoder block contains two key sub-layers:
- A multi-head self-attention mechanism that enables learning different attention patterns across multiple subspaces.
- A position-wise feed-forward network (FFNN) applied independently to each position.
Each sub-layer is surrounded by:
- Residual connection: Adds the original input back to the sub-layer output.
- Layer normalization: Stabilizes and accelerates training.
This entire structure is repeated $N$ times to produce progressively refined token representations that incorporate increasing levels of contextual information.
Decoder Self-Attention
The decoder block in the Transformer uses masked self-attention to prevent peeking into future tokens. This mechanism calculates attention scores via dot products of query and key vectors, then uses softmax to generate weights for summing the value vectors.

This allows the decoder to produce output tokens sequentially, using prior outputs and the encoder’s output as context.
Transformer Tricks
Key design elements that make the Transformer effective:
- Self-Attention: Enables each token to attend to every other token, capturing long-range dependencies.
- Multi-Head Attention: Allows the model to learn various attention patterns in parallel.
- Normalized Dot-product Attention: Prevents scale issues by dividing dot-products by $\sqrt{d_k}$.
- Positional Encoding: Injects order information to compensate for lack of recurrence or convolution.
Positional Encodings
Positional encodings (PEs) enable Transformers to model sequence order. Originally proposed by
However, new research suggests that Transformers can still learn positional information even without explicit encodings

This raises interesting questions about inductive biases and how much structure Transformers can learn from data alone.