Attention Mechanism

Instructions:
• Select a sequence from the dropdown to see how attention processes different inputs
• Hover over input tokens to see their corresponding Q, K, V vectors highlighted
• Hover over output embeddings to trace how they were computed from attention weights and values
• Click "Randomize Weights" to see different random weight initializations
• All computations follow:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

Educational Purpose:
• This demo shows untrained attention with random initialization
• Designed to build visual intuition for attention mechanics, not realistic outputs
• Uses causal masking (lower triangular) typical in language models
• Fixed dimensions:

d_{m o d e l} = 4

d_{k} = 4

for clear visualization

Visualization:
• Blue (Q): Query vectors - what each token "asks for"
• Green (K): Key vectors - what each token "offers"
• Red (V): Value vectors - information each token contains
• Attention Weights: Shows which tokens attend to which (causal mask applied)
• Light Blue (O): Output embeddings - final attention results

Input Sequence:

Input Sequence & Query, Key, Value Matrices

Input Tokens

The

[0.60, -0.64, -0.77, -0.89]

cat

[0.94, 0.81, 0.97, -0.59]

sat

[0.18, 0.86, -0.04, 0.07]

Query (Q)

Key (K)

Value (V)

Q = X W_{Q}

K = X W_{K}

V = X W_{V}

where

X

is the input embedding matrix

Attention Weights, Values & Output

Attention Weights (A)

Values (V)

Output (O)

The

-0.45

-0.03

-0.44

-0.10

cat

-0.05

0.17

0.13

-0.22

sat

0.05

0.26

0.26

-0.15

Attention weights:

A = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})

where each row sums to 1

Output:

O = A V

- weighted combination of value vectors