Word Embeddings

Word embeddings represent words as dense vectors in continuous space, where similar words are close together. They capture semantic meaning and relationships between words.

Key Insight: "You shall know a word by the company it keeps" — Words appearing in similar contexts have similar meanings.

Why Embeddings?

One-Hot Encoding ✗

Traditional approach: sparse, high-dimensional.

cat: [1, 0, 0, 0, ..., 0]
dog: [0, 1, 0, 0, ..., 0]
Vocabulary size: 50,000+

✗ No semantic similarity, huge dimensions

Word Embeddings ✓

Dense vectors in low-dimensional space.

cat: [0.2, -0.5, 0.8, ...]
dog: [0.3, -0.4, 0.7, ...]
Dimensions: 100-300

✓ Captures meaning, similar words are close

Popular Embedding Methods

Word2Vec (2013)

Learn embeddings by predicting context words.

CBOW: Predict word from context
Skip-gram: Predict context from word
Trained on large text corpus

Fast, captures semantic relationships (king - man + woman ≈ queen)

GloVe (2014)

Global Vectors for Word Representation.

Uses word co-occurrence statistics
Combines global matrix factorization + local context
Pre-trained on Wikipedia, Common Crawl

Good for capturing word analogies and relationships

FastText (2016)

Extension of Word2Vec with subword information.

Represents words as bag of character n-grams
Example: "where" → ["wh", "whe", "her", "ere", "re"]
Handles out-of-vocabulary words

✓ Works with rare words, morphologically rich languages

Contextual Embeddings (BERT, GPT)

Different embeddings for same word in different contexts.

"bank" in "river bank" vs "savings bank" → different vectors
Generated by transformer models
State-of-the-art for most NLP tasks

✓ Captures context, polysemy | ✗ Computationally expensive

Embedding Properties

Semantic Similarity

Similar words have similar vectors.

cosine_similarity(cat, dog) ≈ 0.8
cosine_similarity(cat, car) ≈ 0.2

Analogies (Vector Arithmetic)

Relationships encoded in vector space.

king - man + woman ≈ queen
Paris - France + Italy ≈ Rome
walking - walk + swim ≈ swimming

Clustering

Related words cluster together.

Animals: cat, dog, bird, fish
Colors: red, blue, green, yellow
Countries: USA, France, Japan, Brazil

Using Pre-trained Embeddings

Popular Pre-trained Models

Word2Vec: Google News (3B words, 300d)

GloVe: Wikipedia + Gigaword (6B tokens, 50-300d)

FastText: Common Crawl (600B tokens, 300d)

BERT: Contextual, 768d (base) or 1024d (large)

Applications

Text Classification

Use embeddings as features for classifiers

Semantic Search

Find similar documents using vector similarity

Recommendation

Recommend similar items based on embeddings

Machine Translation

Map words across languages

Named Entity Recognition

Input features for sequence models

Sentiment Analysis

Capture emotional nuances

Choosing Embeddings

▹Static embeddings (Word2Vec, GloVe): Fast, good for simple tasks

▹FastText: Best for morphologically rich languages, rare words

▹Contextual (BERT, GPT): State-of-the-art, but slower and heavier

▹Dimension size: 100-300 for static, 768+ for contextual

▹Always try pre-trained embeddings before training from scratch

Key Takeaway: Word embeddings transform words into meaningful vectors. Word2Vec and GloVe are classics, but contextual embeddings (BERT) are now state-of-the-art for most tasks.