Model Evaluation

How do you know if your model is good? Model evaluation measures performance, detects overfitting, and helps you choose the best model. Never deploy a model without proper evaluation!

Golden Rule: Always evaluate on data the model hasn't seen during training (test set).

Train/Test Split

Training

70-80% of data. Model learns patterns here.

Validation

10-15% of data. Tune hyperparameters.

Test

10-20% of data. Final performance check.

Classification Metrics

Accuracy

Percentage of correct predictions.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

⚠️ Misleading with imbalanced classes

Precision

Of predicted positives, how many are actually positive?

Precision = TP / (TP + FP)

Use when: False positives are costly (spam filter)

Recall (Sensitivity)

Of actual positives, how many did we find?

Recall = TP / (TP + FN)

Use when: False negatives are costly (disease detection)

F1-Score

Harmonic mean of precision and recall.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Balances precision and recall

Confusion Matrix

Predicted Positive

Predicted Negative

Actual Positive

True Positive (TP)

False Negative (FN)

Actual Negative

False Positive (FP)

True Negative (TN)

python

Output:

Click "Run Code" to see output

Regression Metrics

MAE (Mean Absolute Error)

Average absolute difference.

MAE = (1/n) Σ |y - ŷ|

Easy to interpret, same units as target

MSE (Mean Squared Error)

Average squared difference.

MSE = (1/n) Σ (y - ŷ)²

Penalizes large errors more

RMSE (Root MSE)

Square root of MSE.

RMSE = √MSE

Same units as target, interpretable

R² (R-squared)

Proportion of variance explained.

R² = 1 - (SS_res / SS_tot)

0 to 1, higher is better

Cross-Validation

More reliable than single train/test split. Use all data for both training and testing.

K-Fold Cross-Validation

1. Split data into K equal folds

2. For each fold:

• Use it as test set

• Use remaining K-1 folds as training

• Train model and evaluate

3. Average performance across all K folds

Typical: K=5 or K=10

python

Output:

Click "Run Code" to see output

Overfitting vs Underfitting

Underfitting

Model too simple.

• High training error
• High test error
• Model hasn't learned

Fix: More complex model, more features

Good Fit

Just right!

• Low training error
• Low test error
• Generalizes well

Goal: Achieve this balance

Overfitting

Model too complex.

• Low training error
• High test error
• Memorized training data

Fix: Regularization, more data, simpler model

Best Practices

✓Always use separate test set for final evaluation

✓Use cross-validation for model selection

✓Choose metrics appropriate for your problem

✓Consider class imbalance when evaluating

✓Plot learning curves to diagnose overfitting

✓Never tune hyperparameters on test set

✓Report multiple metrics, not just one

Key Takeaway: Proper evaluation is critical. Use appropriate metrics, cross-validation, and always test on unseen data to ensure your model generalizes well.