Loss Functions & Optimizers

Loss functions measure how wrong your model is. Optimizers adjust weights to minimize loss. Together, they drive the learning process in neural networks.

Training Loop: Forward pass → Calculate loss → Backward pass (gradients) → Update weights with optimizer → Repeat

Common Loss Functions

Mean Squared Error (MSE)

For regression tasks. Penalizes large errors heavily.

MSE = (1/n) Σ (y - ŷ)²

Use for: Predicting continuous values (prices, temperatures)

Binary Cross-Entropy

For binary classification (2 classes).

BCE = -[y log(ŷ) + (1-y) log(1-ŷ)]

Use for: Spam/not spam, fraud detection, yes/no predictions

Categorical Cross-Entropy

For multi-class classification (3+ classes).

CCE = -Σ y_i log(ŷ_i)

Use for: Image classification, text categorization

Huber Loss

Combines MSE and MAE. Robust to outliers.

Less sensitive to outliers than MSE
Smoother than MAE

Use for: Regression with noisy data

python
Output:
Click "Run Code" to see output

Optimization Algorithms

Stochastic Gradient Descent (SGD)

Update weights using gradient of loss.

θ = θ - α·∇L
α = learning rate

✓ Simple, works well | ✗ Can be slow, sensitive to learning rate

SGD with Momentum

Accelerate in consistent directions, dampen oscillations.

v = β·v + ∇L
θ = θ - α·v
β = momentum (typically 0.9)

✓ Faster convergence, escapes local minima better

RMSprop

Adaptive learning rate per parameter.

Divides learning rate by running average of gradient magnitudes
Good for non-stationary objectives

✓ Works well for RNNs

Adam (Adaptive Moment Estimation)

Combines momentum and RMSprop. Most popular optimizer.

Maintains running averages of both gradients and squared gradients
Adaptive learning rates + momentum

✓ Works well out-of-the-box, default choice for most tasks

AdamW

Adam with decoupled weight decay (better regularization).

Fixes weight decay implementation in Adam
Better generalization

✓ State-of-the-art for transformers and large models

Learning Rate

The most important hyperparameter. Controls step size during optimization.

Too High

Overshoots minimum, loss diverges.

Training unstable
Loss explodes

Just Right

Converges smoothly to minimum.

Steady progress
Good convergence

Too Low

Slow progress, may get stuck.

Very slow training
May not converge

Learning Rate Schedules

Step Decay: Reduce LR by factor every N epochs
Exponential Decay: Gradually decrease over time
Cosine Annealing: Oscillate LR in cosine pattern
Warm-up: Start low, increase, then decrease

Choosing Loss & Optimizer

Task Type
Recommended Loss
Regression
MSE, MAE, Huber
Binary Classification
Binary Cross-Entropy
Multi-class Classification
Categorical Cross-Entropy
Image Segmentation
Dice Loss, IoU Loss
Optimizer Recommendations:
Default choice: Adam (lr=0.001) or AdamW
For transformers: AdamW with warm-up
For CNNs: SGD with momentum (lr=0.01-0.1)
For RNNs: Adam or RMSprop
Always use learning rate scheduling

Key Takeaway: Loss functions define what to optimize. Optimizers determine how to optimize. Adam is a safe default, but experiment with learning rates and schedules for best results.