Instructions:
• Select a sequence from the dropdown to see how attention processes different inputs
• Hover over input tokens to see their corresponding Q, K, V vectors highlighted
• Hover over output embeddings to trace how they were computed from attention weights and values
• Click "Randomize Weights" to see different random weight initializations
• All computations follow: Attention(Q,K,V)=softmax(QKTdk)V

Educational Purpose:
• This demo shows untrained attention with random initialization
• Designed to build visual intuition for attention mechanics, not realistic outputs
• Uses causal masking (lower triangular) typical in language models
• Fixed dimensions: dmodel=4, dk=4 for clear visualization

Visualization:
Blue (Q): Query vectors - what each token "asks for"
Green (K): Key vectors - what each token "offers"
Red (V): Value vectors - information each token contains
Attention Weights: Shows which tokens attend to which (causal mask applied)
Light Blue (O): Output embeddings - final attention results

Input Sequence & Query, Key, Value Matrices

Input Tokens
The
[0.92, 0.68, 0.34, 0.28]
cat
[-0.44, 0.15, -0.42, 0.97]
sat
[-0.42, 0.80, 0.14, -0.89]
Query (Q)
Key (K)
Value (V)
Q=XWQ, K=XWK, V=XWV where X is the input embedding matrix

Attention Weights, Values & Output

Attention Weights (A)
Values (V)
Output (O)
The
-0.57
0.29
0.24
-0.44
cat
-0.46
0.29
-0.01
0.03
sat
-0.27
0.12
-0.05
-0.11
Attention weights: A=softmax(QKTdk) where each row sums to 1
Output: O=AV - weighted combination of value vectors