Attention Mechanism

Attention allows models to focus on relevant parts of the input when producing each output. It's the core innovation behind Transformers and modern NLP.

The Intuition

When translating "The cat sat on the mat" to French, the model should "attend" to "cat" when generating "chat". Attention computes relevance scores between all input-output pairs.

Attention(Q, K, V) = softmax(Q * K^T / √d_k) * V

Q = Query, K = Key, V = Value

Simplified Attention Example

This demonstrates how attention scores are computed between words.

python
Output:
Click "Run Code" to see output

Types of Attention

Self-Attention

Each word attends to all words in the same sequence.

Cross-Attention

Decoder attends to encoder outputs (e.g., in translation).

Multi-Head

Multiple attention mechanisms run in parallel.