Word Embeddings

Word embeddings represent words as dense vectors in continuous space, where similar words are close together. They capture semantic meaning and relationships between words.

Key Insight: "You shall know a word by the company it keeps" — Words appearing in similar contexts have similar meanings.

Why Embeddings?

One-Hot Encoding ✗

Traditional approach: sparse, high-dimensional.

cat: [1, 0, 0, 0, ..., 0]
dog: [0, 1, 0, 0, ..., 0]
Vocabulary size: 50,000+

✗ No semantic similarity, huge dimensions

Word Embeddings ✓

Dense vectors in low-dimensional space.

cat: [0.2, -0.5, 0.8, ...]
dog: [0.3, -0.4, 0.7, ...]
Dimensions: 100-300

✓ Captures meaning, similar words are close

Popular Embedding Methods

Word2Vec (2013)

Learn embeddings by predicting context words.

CBOW: Predict word from context
Skip-gram: Predict context from word
Trained on large text corpus

Fast, captures semantic relationships (king - man + woman ≈ queen)

GloVe (2014)

Global Vectors for Word Representation.

Uses word co-occurrence statistics
Combines global matrix factorization + local context
Pre-trained on Wikipedia, Common Crawl

Good for capturing word analogies and relationships

FastText (2016)

Extension of Word2Vec with subword information.

Represents words as bag of character n-grams
Example: "where" → ["wh", "whe", "her", "ere", "re"]
Handles out-of-vocabulary words

✓ Works with rare words, morphologically rich languages

Contextual Embeddings (BERT, GPT)

Different embeddings for same word in different contexts.

"bank" in "river bank" vs "savings bank" → different vectors
Generated by transformer models
State-of-the-art for most NLP tasks

✓ Captures context, polysemy | ✗ Computationally expensive

Embedding Properties

Semantic Similarity

Similar words have similar vectors.

cosine_similarity(cat, dog) ≈ 0.8
cosine_similarity(cat, car) ≈ 0.2

Analogies (Vector Arithmetic)

Relationships encoded in vector space.

king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
walking - walk + swim ≈ swimming

Clustering

Related words cluster together.

Animals: cat, dog, bird, fish
Colors: red, blue, green, yellow
Countries: USA, France, Japan, Brazil

Using Pre-trained Embeddings

Popular Pre-trained Models

Word2Vec: Google News (3B words, 300d)
GloVe: Wikipedia + Gigaword (6B tokens, 50-300d)
FastText: Common Crawl (600B tokens, 300d)
BERT: Contextual, 768d (base) or 1024d (large)

Applications

Text Classification
Use embeddings as features for classifiers
Semantic Search
Find similar documents using vector similarity
Recommendation
Recommend similar items based on embeddings
Machine Translation
Map words across languages
Named Entity Recognition
Input features for sequence models
Sentiment Analysis
Capture emotional nuances

Choosing Embeddings

Static embeddings (Word2Vec, GloVe): Fast, good for simple tasks
FastText: Best for morphologically rich languages, rare words
Contextual (BERT, GPT): State-of-the-art, but slower and heavier
Dimension size: 100-300 for static, 768+ for contextual
Always try pre-trained embeddings before training from scratch

Key Takeaway: Word embeddings transform words into meaningful vectors. Word2Vec and GloVe are classics, but contextual embeddings (BERT) are now state-of-the-art for most tasks.