Attention Mechanism
Attention allows models to focus on relevant parts of the input when producing each output. It's the core innovation behind Transformers and modern NLP.
The Intuition
When translating "The cat sat on the mat" to French, the model should "attend" to "cat" when generating "chat". Attention computes relevance scores between all input-output pairs.
Attention(Q, K, V) = softmax(Q * K^T / √d_k) * V
Q = Query, K = Key, V = Value
Q = Query, K = Key, V = Value
Simplified Attention Example
This demonstrates how attention scores are computed between words.
python
Output:
Click "Run Code" to see output
Types of Attention
Self-Attention
Each word attends to all words in the same sequence.
Cross-Attention
Decoder attends to encoder outputs (e.g., in translation).
Multi-Head
Multiple attention mechanisms run in parallel.