Attention Mechanism

The attention mechanism is a fundamental component of modern neural networks, particularly in natural language processing and computer vision. It allows models to focus on relevant parts of the input when processing information.

This interactive demo visualizes how attention works with:

  • Query (Q): What each token "asks for"
  • Key (K): What each token "offers"
  • Value (V): The actual information content
  • Attention Weights: How much focus each token receives

The demo uses causal masking (lower triangular) typical in language models, where tokens can only attend to previous positions.

Mathematical Foundation

The attention mechanism computes a weighted average of value vectors, where weights are determined by the compatibility between queries and keys:

Attention(Q,K,V)=softmax(QKTdk)V

Step-by-step process:

  1. Linear Transformations: Q=XWQ, K=XWK, V=XWV
  2. Compute Scores: S=QKTdk (scaled dot-product)
  3. Apply Mask: Set future positions to for causal attention
  4. Normalize: A=softmax(S) (attention weights sum to 1)
  5. Weighted Sum: O=AV (final output)

Why scaling by dk? This prevents the dot products from becoming too large, which would push the softmax into regions with extremely small gradients.

How to Use This Demo

  • Select a sequence from the dropdown to see how attention processes different inputs
  • Hover over input tokens to see their corresponding Q, K, V vectors highlighted in the matrices
  • Hover over output embeddings to trace how they were computed from attention weights and values
  • Click "Randomize Weights" to see different random weight initializations and their effects
  • Observe the causal mask in the attention weights matrix - notice how future positions are masked out

Color Coding:

  • Blue (Q): Query vectors - what each token "asks for"
  • Green (K): Key vectors - what each token "offers"
  • Red (V): Value vectors - information each token contains
  • Light Blue (O): Output embeddings - final attention results

Understanding the Visualization

Educational Notes:

  • This demo shows untrained attention with random initialization
  • It's designed to build visual intuition for attention mechanics, not realistic outputs
  • In real applications, these weights are learned through training
  • Fixed dimensions (dmodel=4, dk=4) allow for clear visualization

Key Insights:

  • Notice how the attention weights matrix is lower triangular due to causal masking
  • Each row in the attention weights sums to 1.0 (softmax normalization)
  • Different weight initializations can dramatically change the attention patterns
  • The output is always a weighted combination of the value vectors

Real-world Applications:

  • Language models (GPT, BERT) use multi-head attention
  • Machine translation systems rely heavily on attention
  • Computer vision transformers use attention for image processing

Input Sequence & Query, Key, Value Matrices

Input Tokens
The
[0.77, 0.34, 0.46, 0.67]
cat
[-0.07, 0.10, 0.96, -0.56]
sat
[-0.54, 0.58, 0.15, -0.24]
Query (Q)
Key (K)
Value (V)
Q=XWQ, K=XWK, V=XWV where X is the input embedding matrix

Attention Weights, Values & Output

Attention Weights (A)
Values (V)
Output (O)
The
-0.65
0.63
-0.33
0.19
cat
-0.29
0.39
-0.22
-0.11
sat
-0.17
0.23
-0.09
-0.02
Attention weights: A=softmax(QKTdk) where each row sums to 1
Output: O=AV - weighted combination of value vectors