Instructions:
• Select a sequence from the dropdown to see how attention processes different inputs
• Hover over input tokens to see their corresponding Q, K, V vectors highlighted
• Hover over output embeddings to trace how they were computed from attention weights and values
• Click "Randomize Weights" to see different random weight initializations
• All computations follow: Attention(Q,K,V)=softmax(QKTdk)V

Educational Purpose:
• This demo shows untrained attention with random initialization
• Designed to build visual intuition for attention mechanics, not realistic outputs
• Uses causal masking (lower triangular) typical in language models
• Fixed dimensions: dmodel=4, dk=4 for clear visualization

Visualization:
Blue (Q): Query vectors - what each token "asks for"
Green (K): Key vectors - what each token "offers"
Red (V): Value vectors - information each token contains
Attention Weights: Shows which tokens attend to which (causal mask applied)
Light Blue (O): Output embeddings - final attention results

Input Sequence & Query, Key, Value Matrices

Input Tokens
The
[0.60, -0.64, -0.77, -0.89]
cat
[0.94, 0.81, 0.97, -0.59]
sat
[0.18, 0.86, -0.04, 0.07]
Query (Q)
Key (K)
Value (V)
Q=XWQ, K=XWK, V=XWV where X is the input embedding matrix

Attention Weights, Values & Output

Attention Weights (A)
Values (V)
Output (O)
The
-0.45
-0.03
-0.44
-0.10
cat
-0.05
0.17
0.13
-0.22
sat
0.05
0.26
0.26
-0.15
Attention weights: A=softmax(QKTdk) where each row sums to 1
Output: O=AV - weighted combination of value vectors