The attention mechanism is a fundamental component of modern neural networks, particularly in natural language processing and computer vision. It allows models to focus on relevant parts of the input when processing information.
This interactive demo visualizes how attention works with:
The demo uses causal masking (lower triangular) typical in language models, where tokens can only attend to previous positions.
The attention mechanism computes a weighted average of value vectors, where weights are determined by the compatibility between queries and keys:
Step-by-step process:
Why scaling by $\sqrt{d_k}$? This prevents the dot products from becoming too large, which would push the softmax into regions with extremely small gradients.
Color Coding:
Educational Notes:
Key Insights:
Real-world Applications:
Query, Key, Value Transformations: The input embeddings are transformed into three different representations using learned weight matrices: $Q = XW_Q$, $K = XW_K$, $V = XW_V$ where $X$ is the input embedding matrix. These transformations allow each token to simultaneously ask for information (Query), offer information (Key), and contain information (Value).
Attention Weights: $A = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$ where each row sums to 1. The attention mechanism computes compatibility scores between queries and keys using the scaled dot-product. The scaling factor $\sqrt{d_k}$ prevents the dot products from becoming too large, which would push the softmax into regions with extremely small gradients. The softmax operation ensures that attention weights are normalized probabilities that sum to 1 for each position.
Output Computation: $O = AV$ - the final output is a weighted combination of value vectors. Each output position receives a weighted sum of all value vectors, where the weights are determined by the attention scores. This allows the model to dynamically aggregate information from relevant positions in the input sequence based on learned patterns of relevance.