NLP Basics & Tokenization
Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. The first step is converting text into a format models can process: tokenization.
Challenge: Language is complex, ambiguous, and context-dependent. NLP bridges the gap between human communication and machine understanding.
Text Preprocessing Pipeline
Tokenization Methods
Word Tokenization
Split by whitespace and punctuation.
Output: ["Hello", ",", "world", "!"]
✓ Simple, intuitive | ✗ Large vocabulary, can't handle unknown words
Character Tokenization
Split into individual characters.
Output: ["H", "e", "l", "l", "o"]
✓ Small vocabulary, no unknown words | ✗ Long sequences, loses word meaning
Subword Tokenization (BPE, WordPiece)
Split into meaningful subword units. Best of both worlds!
Output: ["un", "happiness"] or ["un", "happy", "ness"]
✓ Handles unknown words, reasonable vocabulary size | Used in: BERT, GPT, T5
Text Normalization
Stemming
Chop off word endings to get root.
flies → fli
better → better
Fast but crude, may not be real words
Lemmatization
Reduce to dictionary form (lemma).
flies → fly
better → good
More accurate, uses linguistic knowledge
Common NLP Tasks
Challenges in NLP
Key Takeaway: Tokenization is the foundation of NLP. Modern models use subword tokenization (BPE, WordPiece) to balance vocabulary size and coverage.