Lecture 19: The Attention Mechanism

An in-depth exploration of attention mechanisms in neural networks, from their motivation and foundational forms (hard and soft attention) to their applications in image captioning and the Transformer architecture.

The Attention Mechansism

Motivation: Why Attention?

In sequence modeling tasks like machine translation, traditional models such as RNNs or LSTMs encode an input sequence into a single fixed-length vector. This vector is then used by a decoder to generate the target sequence. However, this strategy compresses all the information into a bottleneck, making it difficult for the model to retain long-range dependencies or fine-grained contextual relationships.

Let’s consider the task of translating the English sentence “Where is the library?” into Spanish. The translation depends heavily on the word “the”, which maps to “la” in “la biblioteca”. However, if the input becomes more complex—e.g., “Where is the huge public library?”—the correct translation of “the” still needs to align with “la”, even though several words now intervene.

This raises a question: Do we really need to encode the whole sentence into one fixed vector just to translate a single word? Or can we instead selectively focus on relevant parts of the input as needed?

**Figure 1:** Machine translation examples showing how "the" aligns with "la" in both simple and complex sentence structures. Attention allows models to flexibly track such alignments across varying contexts.

The attention mechanism was introduced as a way to soften this bottleneck. Rather than relying on a single fixed-length encoding, the model learns to focus on the most relevant parts of the input sequence for each step in the output.

Hard Attention

Hard attention is a mechanism where the model makes a discrete decision about where to attend at each time step. This means it chooses a single position (or subset) in the input sequence to focus on, essentially making a zero-one selection. The benefit of this approach is that it can provide interpretable decisions, which makes it attractive in explainable ML.

However, hard attention comes with a major drawback: it is non-differentiable. Because the selection is discrete, we cannot use standard backpropagation to train the model. Instead, training must rely on techniques like reinforcement learning, which are often slower and more unstable.

**Figure 2:** An example of hard attention in action from Lei et al. (2016), where specific phrases in a review are selected as contributing to a 5-star rating for "look". The highlighted regions are interpretable, but training such models is difficult.

Soft Attention

To overcome the training difficulties of hard attention, Bahdanau et al. (2015) proposed soft attention, a differentiable alternative. Instead of selecting a single word, soft attention assigns a probability distribution over all input words, effectively computing a weighted average of their representations.

At each decoder step, the model produces an attention weight $\alpha_{ij}$ for every encoder hidden state $h_j$, reflecting its relevance to the current decoding step $i$. These weights are then used to compute a context vector $c_i$, which is a weighted combination of the encoder states.

**Figure 3:** (Left) The attention distribution over input words for translating "I love coffee" to "Me gusta el café". (Right) Attention in a many-to-many encoder-decoder setup, where each output word attends differently to the input.

The soft attention mechanism can be formalized as follows:

\begin{aligned} c_i &= \sum_{j=1}^{T_x} \alpha_{ij} h_j \\ \alpha_{ij} &= \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})} \\ e_{ij} &= a(s_{i-1}, h_j) \end{aligned}

$h_j$ are the encoder hidden states,
$s_{i-1}$ is the decoder’s previous state,
$e_{ij}$ is an alignment score indicating how well $h_j$ aligns with the output at step $i$,
$\alpha_{ij}$ is the normalized attention weight.

**Figure 4:** Key equations for computing soft attention: the alignment score, mixture weights (attention distribution), and final context vector.

Variants of Attention

Monotonic Attention

Sometimes, we know in advance that the alignment between input and output will be monotonic—that is, the output follows the input order (e.g., in speech recognition). In such cases, we can restrict the attention to move forward without revisiting earlier tokens.

Global vs. Local Attention

Another important distinction in attention types lies in the scope of attention:

Global attention attends over all source positions. This is powerful but computationally expensive.
Local attention focuses on a small window of positions near a predicted central point, reducing cost and often improving performance on long sequences.

**Figure 6:** (Left) In global attention, every encoder hidden state contributes to the context vector. (Right) Local attention narrows down focus to a few positions around a predicted alignment point.

Attention in Image Captioning

To better understand attention mechanisms, let’s look at the task of image captioning, where a model generates a textual description for a given image.

Baseline: CNN + RNN

A common pipeline uses a Convolutional Neural Network (CNN) to extract image features and an RNN (or LSTM) to decode those features into a sequence of words.

**Figure 7:** A standard CNN-RNN pipeline for image captioning. The CNN encodes the image into a fixed-length feature vector, which the RNN uses to generate a caption word-by-word.

The CNN processes the input image (e.g., of size $H \times W \times 3$) into a feature vector of dimension $D$. This feature vector initializes the hidden state of the RNN, which then sequentially predicts words.

However, this approach suffers from a critical limitation: the RNN only sees the entire image once at the beginning. During decoding, it lacks the ability to focus on different regions of the image that might be relevant to different words in the caption.

Soft Attention for Image Captioning

To address this, soft attention can be applied to image captioning. Instead of encoding the entire image into one global feature, the CNN outputs a grid of spatial features—each corresponding to a different region of the image.

At each timestep, the decoder attends to this grid and computes a weighted combination of features (the context vector), allowing the model to dynamically focus on different regions when predicting different words.

**Figure 8:** With soft attention, the model generates a distribution over spatial locations in the CNN feature map and computes a weighted average (context vector) for each decoder step.

Let $L$ be the number of spatial locations in the CNN feature grid and $D$ the dimensionality of each feature. The attention weights $\alpha_{i} \in \mathbb{R}^L$ are learned for each decoder step, and the context vector is:

\[z = \sum_{i=1}^L \alpha_i \cdot \text{feature}_i\]

This allows the decoder to generate more precise and descriptive captions.

CNNs and Hard vs. Soft Attention

Historically, CNNs were considered to operate in a hard attention-like manner during image captioning because they compress the visual scene into a single global vector, limiting spatial flexibility.

In contrast, with soft attention, we maintain a spatial grid of features and summarize it via a learned distribution.

**Figure 9:** (Top) Hard attention samples one feature from the grid based on a learned distribution. This is non-differentiable and requires reinforcement learning. (Bottom) Soft attention computes a weighted sum over all features and can be trained via backpropagation.

Key difference:

Hard attention chooses one region (e.g., $a$ or $b$) based on probabilities $p_a, p_b$, etc.
Soft attention computes: $z = p_a a + p_b b + p_c c + p_d d$

This weighted average makes gradients flow through all paths, making soft attention amenable to standard gradient descent.

Multi-Headed Attention

In practice, using multiple attention heads—each learning to focus on different aspects of the input—has proven highly effective. This idea is a core component of the Transformer architecture.

Each head computes its own attention distribution and context vector. These are then concatenated and transformed to produce the final representation.

**Figure 10:** Multiple attention heads can learn to focus on different syntactic or semantic aspects of a sentence. For example, one head may focus on coreference while another tracks syntax or positional structure.

Applications include:

Program synthesis (Allamanis et al., 2016): One head for copying variable names, another for generating common tokens.
Machine translation (Vaswani et al., 2017): Independently learned heads capture different linguistic patterns.

This diversity is key to the success of the Transformer.

Self-Attention

After exploring traditional attention mechanisms in encoder-decoder architectures, we now shift focus to self-attention, a core component of modern models like the Transformer.

Self-attention, also referred to as intra-attention, is a mechanism that allows each position in a sequence to attend to all other positions in the same sequence. This enables the model to capture long-range dependencies and contextual relationships efficiently—without relying on recurrence.

What is Self-Attention?

In self-attention, every token in the input sequence is transformed by interacting with all other tokens via attention weights. The goal is to build a representation of a token that incorporates contextual information from the entire sequence.

**Figure 11:** A schematic illustration of self-attention (adapted from Vaswani et al., 2017). Each embedded input vector (e.g., for "represent") is compared with all others ("Let’s", "this", "sentence") via a compatibility function (cmp), and their contributions are weighted and summed before passing through a feedforward network.

Let the embedded inputs be $r_1, r_2, r_3, r_4$ for the sentence “Let’s represent this sentence.” For each word $r_i$, we:

Compare it with all others using a learned compatibility (or alignment) function.
Use the resulting scores to compute attention weights.
Generate a new representation $r_i’$, which is a weighted combination of all representations.
Pass this new vector through a feedforward neural network (FFNN) to obtain the output $r_i’’$.

Why Use Self-Attention?

Self-attention offers several key advantages over recurrence or convolution-based methods:

Constant path length between any two tokens. In RNNs, to connect distant tokens, information must pass through many sequential steps. Self-attention enables direct access between all pairs, improving long-range dependency modeling.
Gating/multiplicative interactions, similar to mechanisms in LSTMs or GRUs, help regulate flow of information and improve expressivity.
Parallelizability: Since there’s no recurrence, all tokens can be processed simultaneously (in parallel), allowing massive speedups during training.
Potential to replace sequence modeling entirely—this was the foundational motivation behind the Transformer architecture.

Is Self-Attention Efficient?

Despite the power of self-attention, a natural concern is computational cost. Self-attention has a complexity of:

\[\mathcal{O}(\text{length}^2 \cdot \text{dim})\]

because it computes pairwise interactions for all token pairs in the sequence. Let’s compare this to alternatives:

Layer Type	FLOPs (Floating Point Ops)
Self-Attention	( \mathcal{O}(\text{length}^2 \cdot \text{dim}) )
RNN (LSTM)	( \mathcal{O}(\text{length} \cdot \text{dim}^2) )
Convolution	( \mathcal{O}(\text{length} \cdot \text{dim}^2 \cdot \text{kernel_width}) )

In practice, self-attention is very efficient on GPUs/TPUs due to its parallelizability and matrix-friendly computation style.

The Transformer

Word Embeddings

The Transformer starts with word embeddings—continuous-valued vector representations of tokens. These embeddings can be initialized in various ways:

Random embeddings: A simple approach without prior knowledge.
TF-IDF embeddings: Weighted based on frequency-inverse document frequency.
Pretrained embeddings: From models like Word2Vec or GloVe that capture semantic relationships across corpora.

**TF-IDF Embedding.** Shows a transformation from sparse word-document matrices into continuous embedding space.

These embeddings form the input representation, capturing static word meaning. However, static vectors lack contextual information.

Contextual Embeddings

Words can take on different meanings depending on context. In the sentence:

“The King doth wake tonight and takes his rouse…“

the word “King” must be interpreted in its Shakespearean context. Transformers update embeddings by applying self-attention, allowing representations to be refined by surrounding words.

**Contextualization of embeddings.** Left: Static representation of “King”. Right: Updated embedding after attending to words like “Scotland” and “murdered predecessor.”

These changes to the embedding are governed by self-attention layers, which compute context-aware representations.

Encoder Self-Attention

The encoder in the Transformer takes the input sequence and transforms it into a series of contextualized embeddings. This is achieved using self-attention, which allows every input token to attend to all others.

The attention weights are computed using the scaled dot-product attention:

A(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Where:

$Q$: query matrix
$K$: key matrix
$V$: value matrix
( d_k ): dimensionality of key vectors (used for scaling)

This process enables each token to gather information from the rest of the sequence, weighted by relevance.

**Encoder self-attention mechanism.** The encoder performs dot-product attention among all input tokens. Attention weights are used to compute a weighted sum of the value vectors. The result is passed through a feed-forward network and normalization layers.

Each encoder block contains two key sub-layers:

A multi-head self-attention mechanism that enables learning different attention patterns across multiple subspaces.
A position-wise feed-forward network (FFNN) applied independently to each position.

Each sub-layer is surrounded by:

Residual connection: Adds the original input back to the sub-layer output.
Layer normalization: Stabilizes and accelerates training.

This entire structure is repeated $N$ times to produce progressively refined token representations that incorporate increasing levels of contextual information.

Decoder Self-Attention

The decoder block in the Transformer uses masked self-attention to prevent peeking into future tokens. This mechanism calculates attention scores via dot products of query and key vectors, then uses softmax to generate weights for summing the value vectors.

This allows the decoder to produce output tokens sequentially, using prior outputs and the encoder’s output as context.

Transformer Tricks

Key design elements that make the Transformer effective:

Self-Attention: Enables each token to attend to every other token, capturing long-range dependencies.
Multi-Head Attention: Allows the model to learn various attention patterns in parallel.
Normalized Dot-product Attention: Prevents scale issues by dividing dot-products by $\sqrt{d_k}$.
Positional Encoding: Injects order information to compensate for lack of recurrence or convolution.

Positional Encodings

Positional encodings (PEs) enable Transformers to model sequence order. Originally proposed by , PEs are added to word embeddings and typically use sinusoidal functions.

However, new research suggests that Transformers can still learn positional information even without explicit encodings .

**Position information can still emerge.** Transformers trained without positional encodings still learn layer-wise relative position signals.

This raises interesting questions about inductive biases and how much structure Transformers can learn from data alone.