Loss Functions & Optimizers

Loss functions measure how wrong your model is. Optimizers adjust weights to minimize loss. Together, they drive the learning process in neural networks.

Training Loop: Forward pass → Calculate loss → Backward pass (gradients) → Update weights with optimizer → Repeat

Common Loss Functions

Mean Squared Error (MSE)

For regression tasks. Penalizes large errors heavily.

MSE = (1/n) Σ (y - ŷ)²

Use for: Predicting continuous values (prices, temperatures)

Binary Cross-Entropy

For binary classification (2 classes).

BCE = -[y log(ŷ) + (1-y) log(1-ŷ)]

Use for: Spam/not spam, fraud detection, yes/no predictions

Categorical Cross-Entropy

For multi-class classification (3+ classes).

CCE = -Σ y_i log(ŷ_i)

Use for: Image classification, text categorization

Huber Loss

Combines MSE and MAE. Robust to outliers.

Less sensitive to outliers than MSE
Smoother than MAE

Use for: Regression with noisy data

python

Output:

Click "Run Code" to see output

Optimization Algorithms

Stochastic Gradient Descent (SGD)

Update weights using gradient of loss.

θ = θ - α·∇L
α = learning rate

✓ Simple, works well | ✗ Can be slow, sensitive to learning rate

SGD with Momentum

Accelerate in consistent directions, dampen oscillations.

v = β·v + ∇L
θ = θ - α·v
β = momentum (typically 0.9)

✓ Faster convergence, escapes local minima better

RMSprop

Adaptive learning rate per parameter.

Divides learning rate by running average of gradient magnitudes
Good for non-stationary objectives

✓ Works well for RNNs

Adam (Adaptive Moment Estimation)

Combines momentum and RMSprop. Most popular optimizer.

Maintains running averages of both gradients and squared gradients
Adaptive learning rates + momentum

✓ Works well out-of-the-box, default choice for most tasks

AdamW

Adam with decoupled weight decay (better regularization).

Fixes weight decay implementation in Adam
Better generalization

✓ State-of-the-art for transformers and large models

Learning Rate

The most important hyperparameter. Controls step size during optimization.

Too High

Overshoots minimum, loss diverges.

Training unstable
Loss explodes

Just Right

Converges smoothly to minimum.

Steady progress
Good convergence

Too Low

Slow progress, may get stuck.

Very slow training
May not converge

Learning Rate Schedules

Step Decay: Reduce LR by factor every N epochs

Exponential Decay: Gradually decrease over time

Cosine Annealing: Oscillate LR in cosine pattern

Warm-up: Start low, increase, then decrease

Choosing Loss & Optimizer

Task Type

Recommended Loss

Regression

MSE, MAE, Huber

Binary Classification

Binary Cross-Entropy

Multi-class Classification

Categorical Cross-Entropy

Image Segmentation

Dice Loss, IoU Loss

Optimizer Recommendations:

▹Default choice: Adam (lr=0.001) or AdamW

▹For transformers: AdamW with warm-up

▹For CNNs: SGD with momentum (lr=0.01-0.1)

▹For RNNs: Adam or RMSprop

▹Always use learning rate scheduling

Key Takeaway: Loss functions define what to optimize. Optimizers determine how to optimize. Adam is a safe default, but experiment with learning rates and schedules for best results.