Attention Mechanism

The attention mechanism is a fundamental component of modern neural networks, particularly in natural language processing and computer vision. It allows models to focus on relevant parts of the input when processing information.

This interactive demo visualizes how attention works with:

Query (Q): What each token "asks for"
Key (K): What each token "offers"
Value (V): The actual information content
Attention Weights: How much focus each token receives

The demo uses causal masking (lower triangular) typical in language models, where tokens can only attend to previous positions.

Mathematical Foundation

The attention mechanism computes a weighted average of value vectors, where weights are determined by the compatibility between queries and keys:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

Step-by-step process:

Linear Transformations: $Q = X W_{Q}$ , $K = X W_{K}$ , $V = X W_{V}$
Compute Scores: $S = \frac{Q K^{T}}{\sqrt{d_{k}}}$ (scaled dot-product)
Apply Mask: Set future positions to $- \infty$ for causal attention
Normalize: $A = softmax (S)$ (attention weights sum to 1)
Weighted Sum: $O = A V$ (final output)

Why scaling by $\sqrt{d_{k}}$ ? This prevents the dot products from becoming too large, which would push the softmax into regions with extremely small gradients.

How to Use This Demo

Select a sequence from the dropdown to see how attention processes different inputs
Hover over input tokens to see their corresponding Q, K, V vectors highlighted in the matrices
Hover over output embeddings to trace how they were computed from attention weights and values
Click "Randomize Weights" to see different random weight initializations and their effects
Observe the causal mask in the attention weights matrix - notice how future positions are masked out

Color Coding:

Blue (Q): Query vectors - what each token "asks for"
Green (K): Key vectors - what each token "offers"
Red (V): Value vectors - information each token contains
Light Blue (O): Output embeddings - final attention results

Understanding the Visualization

Educational Notes:

This demo shows untrained attention with random initialization
It's designed to build visual intuition for attention mechanics, not realistic outputs
In real applications, these weights are learned through training
Fixed dimensions ( $d_{m o d e l} = 4$ , $d_{k} = 4$ ) allow for clear visualization

Key Insights:

Notice how the attention weights matrix is lower triangular due to causal masking
Each row in the attention weights sums to 1.0 (softmax normalization)
Different weight initializations can dramatically change the attention patterns
The output is always a weighted combination of the value vectors

Real-world Applications:

Language models (GPT, BERT) use multi-head attention
Machine translation systems rely heavily on attention
Computer vision transformers use attention for image processing

Input Sequence:

Input Sequence & Query, Key, Value Matrices

Input Tokens

The

[0.77, 0.34, 0.46, 0.67]

cat

[-0.07, 0.10, 0.96, -0.56]

sat

[-0.54, 0.58, 0.15, -0.24]

Query (Q)

Key (K)

Value (V)

Q = X W_{Q}

K = X W_{K}

V = X W_{V}

where

X

is the input embedding matrix

Attention Weights, Values & Output

Attention Weights (A)

Values (V)

Output (O)

The

-0.65

0.63

-0.33

0.19

cat

-0.29

0.39

-0.22

-0.11

sat

-0.17

0.23

-0.09

-0.02

Attention weights:

A = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})

where each row sums to 1

Output:

O = A V

- weighted combination of value vectors