Text Classification

Text classification assigns predefined categories to text documents. It's one of the most common NLP tasks, powering spam filters, sentiment analysis, topic categorization, and more.

Task: Given text input, predict which category it belongs to (binary or multi-class).

Common Applications

📧

Spam Detection

Spam vs Ham (legitimate email)

😊

Sentiment Analysis

Positive, Negative, Neutral

📰

Topic Classification

Sports, Politics, Tech, etc.

🏷️

Intent Detection

User intent in chatbots

⭐

Review Rating

1-5 stars from review text

🚨

Content Moderation

Toxic, offensive, safe

Approaches

Traditional ML (Feature Engineering)

Extract features manually, then use ML classifiers.

1. TF-IDF or Bag of Words for features
2. Train: Naive Bayes, Logistic Regression, SVM
3. Fast, interpretable, works with small data

✓ Simple, fast, good baseline | ✗ Manual feature engineering

Deep Learning (End-to-End)

Learn features automatically from raw text.

1. Embeddings (Word2Vec, GloVe, BERT)
2. Neural networks: CNN, RNN, LSTM, Transformer
3. State-of-the-art performance

✓ Automatic features, high accuracy | ✗ Needs more data, slower

Traditional ML Pipeline

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Sample data
texts = [
    "I love this product, it's amazing!",
    "Terrible experience, very disappointed",
    "Great quality, highly recommend",
    "Waste of money, don't buy",
    "Excellent service and fast delivery",
    "Poor quality, broke after one use"
]
labels = [1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.33, random_state=42
)

# TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=100)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train Naive Bayes
clf = MultinomialNB()
clf.fit(X_train_vec, y_train)

# Predict
y_pred = clf.predict(X_test_vec)

print("Predictions:", y_pred)
print("Actual:     ", y_test)
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.2f}")

Output:

Click "Run Code" to see output

Deep Learning Architectures

CNN for Text

Convolutional filters capture local patterns (n-grams).

Fast, parallelizable, good for short texts
Captures local features well

RNN/LSTM

Process text sequentially, capture long-range dependencies.

Good for sequences, context-aware
Slower to train than CNNs

Transformers (BERT, RoBERTa)

Attention mechanism, bidirectional context.

State-of-the-art on most benchmarks
Pre-trained models available (transfer learning)

Transfer Learning with BERT

Load Pre-trained BERT

Use model trained on billions of words

Add Classification Head

Add dense layer for your classes

Fine-tune on Your Data

Train on task-specific dataset

Achieve SOTA Performance

Often with relatively small datasets

Handling Imbalanced Data

Class Weights

Penalize misclassifying minority class more

Oversampling

Duplicate minority class examples (SMOTE)

Undersampling

Reduce majority class examples

Data Augmentation

Paraphrase, back-translate minority examples

Focal Loss

Focus on hard examples, down-weight easy ones

Best Practices

✓Start with simple baseline (Naive Bayes + TF-IDF)

✓Clean text: lowercase, remove special chars, handle contractions

✓Use pre-trained embeddings (Word2Vec, GloVe) or BERT

✓Try both traditional ML and deep learning

✓Cross-validate to avoid overfitting

✓Monitor precision, recall, F1 (not just accuracy)

✓Use class weights for imbalanced data

✓Fine-tune BERT for state-of-the-art results

Key Takeaway: Text classification is fundamental to NLP. Start with simple baselines (Naive Bayes + TF-IDF), then move to deep learning (BERT) for best performance.