Attention Mechanism

The attention mechanism is a fundamental component of modern neural networks, particularly in natural language processing and computer vision. It allows models to focus on relevant parts of the input when processing information.

This interactive demo visualizes how attention works with:

Query (Q): What each token "asks for"
Key (K): What each token "offers"
Value (V): The actual information content
Attention Weights: How much focus each token receives

The demo uses causal masking (lower triangular) typical in language models, where tokens can only attend to previous positions.

Mathematical Foundation

The attention mechanism computes a weighted average of value vectors, where weights are determined by the compatibility between queries and keys:

$$\operatorname{Attention}(Q,K,V) = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Step-by-step process:

Linear Transformations: $Q = XW_Q$, $K = XW_K$, $V = XW_V$
Compute Scores: $S = \frac{QK^T}{\sqrt{d_k}}$ (scaled dot-product)
Apply Mask: Set future positions to $-\infty$ for causal attention
Normalize: $A = \operatorname{softmax}(S)$ (attention weights sum to 1)
Weighted Sum: $O = AV$ (final output)

Why scaling by $\sqrt{d_k}$? This prevents the dot products from becoming too large, which would push the softmax into regions with extremely small gradients.

How to Use This Demo

Select a sequence from the dropdown to see how attention processes different inputs
Hover over input tokens to see their corresponding Q, K, V vectors highlighted in the matrices
Hover over output embeddings to trace how they were computed from attention weights and values
Click "Randomize Weights" to see different random weight initializations and their effects
Observe the causal mask in the attention weights matrix - notice how future positions are masked out

Color Coding:

Blue (Q): Query vectors - what each token "asks for"
Green (K): Key vectors - what each token "offers"
Red (V): Value vectors - information each token contains
Light Blue (O): Output embeddings - final attention results

Understanding the Visualization

Educational Notes:

This demo shows untrained attention with random initialization
It's designed to build visual intuition for attention mechanics, not realistic outputs
In real applications, these weights are learned through training
Fixed dimensions ($d_{model} = 4$, $d_k = 4$) allow for clear visualization

Key Insights:

Notice how the attention weights matrix is lower triangular due to causal masking
Each row in the attention weights sums to 1.0 (softmax normalization)
Different weight initializations can dramatically change the attention patterns
The output is always a weighted combination of the value vectors

Real-world Applications:

Language models (GPT, BERT) use multi-head attention
Machine translation systems rely heavily on attention
Computer vision transformers use attention for image processing

Input Sequence:

Input Tokens

Query, Key, Value Matrices

Query (Q)

Key (K)

Value (V)

Attention Weights, Values & Output

Attention Weights (A)

Values (V)

Output (O)

Mathematical Foundations

Query, Key, Value Transformations: The input embeddings are transformed into three different representations using learned weight matrices: $Q = XW_Q$, $K = XW_K$, $V = XW_V$ where $X$ is the input embedding matrix. These transformations allow each token to simultaneously ask for information (Query), offer information (Key), and contain information (Value).

Attention Weights: $A = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$ where each row sums to 1. The attention mechanism computes compatibility scores between queries and keys using the scaled dot-product. The scaling factor $\sqrt{d_k}$ prevents the dot products from becoming too large, which would push the softmax into regions with extremely small gradients. The softmax operation ensures that attention weights are normalized probabilities that sum to 1 for each position.

Output Computation: $O = AV$ - the final output is a weighted combination of value vectors. Each output position receives a weighted sum of all value vectors, where the weights are determined by the attention scores. This allows the model to dynamically aggregate information from relevant positions in the input sequence based on learned patterns of relevance.