NLP Basics & Tokenization

Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. The first step is converting text into a format models can process: tokenization.

Challenge: Language is complex, ambiguous, and context-dependent. NLP bridges the gap between human communication and machine understanding.

Text Preprocessing Pipeline

1
Lowercasing
Convert all text to lowercase for consistency
2
Tokenization
Split text into words or subwords
3
Remove Punctuation
Strip special characters (optional)
4
Remove Stop Words
Filter common words (the, is, at)
5
Stemming/Lemmatization
Reduce words to root form
6
Vectorization
Convert tokens to numbers

Tokenization Methods

Word Tokenization

Split by whitespace and punctuation.

Input: "Hello, world!"
Output: ["Hello", ",", "world", "!"]

✓ Simple, intuitive | ✗ Large vocabulary, can't handle unknown words

Character Tokenization

Split into individual characters.

Input: "Hello"
Output: ["H", "e", "l", "l", "o"]

✓ Small vocabulary, no unknown words | ✗ Long sequences, loses word meaning

Subword Tokenization (BPE, WordPiece)

Split into meaningful subword units. Best of both worlds!

Input: "unhappiness"
Output: ["un", "happiness"] or ["un", "happy", "ness"]

✓ Handles unknown words, reasonable vocabulary size | Used in: BERT, GPT, T5

python
Output:
Click "Run Code" to see output

Text Normalization

Stemming

Chop off word endings to get root.

running → run
flies → fli
better → better

Fast but crude, may not be real words

Lemmatization

Reduce to dictionary form (lemma).

running → run
flies → fly
better → good

More accurate, uses linguistic knowledge

Common NLP Tasks

Text Classification
Categorize documents (spam, sentiment, topic)
Named Entity Recognition
Extract entities (person, location, organization)
Machine Translation
Translate between languages
Question Answering
Answer questions from text
Text Summarization
Generate concise summaries
Sentiment Analysis
Determine emotional tone

Challenges in NLP

⚠️ Ambiguity
I saw her duck (bird or action?)
⚠️ Context Dependence
Bank (river or financial?)
⚠️ Sarcasm & Irony
Great, another meeting!
⚠️ Multiple Languages
Different grammar, scripts, idioms
⚠️ Informal Text
Slang, typos, abbreviations (lol, brb)

Key Takeaway: Tokenization is the foundation of NLP. Modern models use subword tokenization (BPE, WordPiece) to balance vocabulary size and coverage.