NLP Basics & Tokenization

Natural Language Processing (NLP) enables computers to understand, interpret, and generate human language. The first step is converting text into a format models can process: tokenization.

Challenge: Language is complex, ambiguous, and context-dependent. NLP bridges the gap between human communication and machine understanding.

Text Preprocessing Pipeline

Lowercasing

Convert all text to lowercase for consistency

Tokenization

Split text into words or subwords

Remove Punctuation

Strip special characters (optional)

Remove Stop Words

Filter common words (the, is, at)

Stemming/Lemmatization

Reduce words to root form

Vectorization

Convert tokens to numbers

Tokenization Methods

Word Tokenization

Split by whitespace and punctuation.

Input: "Hello, world!"
Output: ["Hello", ",", "world", "!"]

✓ Simple, intuitive | ✗ Large vocabulary, can't handle unknown words

Character Tokenization

Split into individual characters.

Input: "Hello"
Output: ["H", "e", "l", "l", "o"]

✓ Small vocabulary, no unknown words | ✗ Long sequences, loses word meaning

Subword Tokenization (BPE, WordPiece)

Split into meaningful subword units. Best of both worlds!

Input: "unhappiness"
Output: ["un", "happiness"] or ["un", "happy", "ness"]

✓ Handles unknown words, reasonable vocabulary size | Used in: BERT, GPT, T5

python

Output:

Click "Run Code" to see output

Text Normalization

Stemming

Chop off word endings to get root.

running → run
flies → fli
better → better

Fast but crude, may not be real words

Lemmatization

Reduce to dictionary form (lemma).

running → run
flies → fly
better → good

More accurate, uses linguistic knowledge

Common NLP Tasks

Text Classification

Categorize documents (spam, sentiment, topic)

Named Entity Recognition

Extract entities (person, location, organization)

Machine Translation

Translate between languages

Question Answering

Answer questions from text

Text Summarization

Generate concise summaries

Sentiment Analysis

Determine emotional tone

Challenges in NLP

⚠️ Ambiguity

I saw her duck (bird or action?)

⚠️ Context Dependence

Bank (river or financial?)

⚠️ Sarcasm & Irony

Great, another meeting!

⚠️ Multiple Languages

Different grammar, scripts, idioms

⚠️ Informal Text

Slang, typos, abbreviations (lol, brb)

Key Takeaway: Tokenization is the foundation of NLP. Modern models use subword tokenization (BPE, WordPiece) to balance vocabulary size and coverage.