Model Evaluation

How do you know if your model is good? Model evaluation measures performance, detects overfitting, and helps you choose the best model. Never deploy a model without proper evaluation!

Golden Rule: Always evaluate on data the model hasn't seen during training (test set).

Train/Test Split

Training
70-80% of data. Model learns patterns here.
Validation
10-15% of data. Tune hyperparameters.
Test
10-20% of data. Final performance check.

Classification Metrics

Accuracy

Percentage of correct predictions.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

⚠️ Misleading with imbalanced classes

Precision

Of predicted positives, how many are actually positive?

Precision = TP / (TP + FP)

Use when: False positives are costly (spam filter)

Recall (Sensitivity)

Of actual positives, how many did we find?

Recall = TP / (TP + FN)

Use when: False negatives are costly (disease detection)

F1-Score

Harmonic mean of precision and recall.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Balances precision and recall

Confusion Matrix

Predicted Positive
Predicted Negative
Actual Positive
True Positive (TP)
False Negative (FN)
Actual Negative
False Positive (FP)
True Negative (TN)
python
Output:
Click "Run Code" to see output

Regression Metrics

MAE (Mean Absolute Error)

Average absolute difference.

MAE = (1/n) Σ |y - ŷ|

Easy to interpret, same units as target

MSE (Mean Squared Error)

Average squared difference.

MSE = (1/n) Σ (y - ŷ)²

Penalizes large errors more

RMSE (Root MSE)

Square root of MSE.

RMSE = √MSE

Same units as target, interpretable

R² (R-squared)

Proportion of variance explained.

R² = 1 - (SS_res / SS_tot)

0 to 1, higher is better

Cross-Validation

More reliable than single train/test split. Use all data for both training and testing.

K-Fold Cross-Validation

1. Split data into K equal folds
2. For each fold:
• Use it as test set
• Use remaining K-1 folds as training
• Train model and evaluate
3. Average performance across all K folds

Typical: K=5 or K=10

python
Output:
Click "Run Code" to see output

Overfitting vs Underfitting

Underfitting

Model too simple.

• High training error
• High test error
• Model hasn't learned

Fix: More complex model, more features

Good Fit

Just right!

• Low training error
• Low test error
• Generalizes well

Goal: Achieve this balance

Overfitting

Model too complex.

• Low training error
• High test error
• Memorized training data

Fix: Regularization, more data, simpler model

Best Practices

Always use separate test set for final evaluation
Use cross-validation for model selection
Choose metrics appropriate for your problem
Consider class imbalance when evaluating
Plot learning curves to diagnose overfitting
Never tune hyperparameters on test set
Report multiple metrics, not just one

Key Takeaway: Proper evaluation is critical. Use appropriate metrics, cross-validation, and always test on unseen data to ensure your model generalizes well.