Training Techniques

Training deep neural networks requires more than just choosing a loss and optimizer. These techniques help models train faster, generalize better, and avoid common pitfalls.

Goal: Train models that generalize well to unseen data, not just memorize training data.

Regularization Techniques

Dropout

Randomly drop neurons during training to prevent co-adaptation.

During training: randomly set neurons to 0 with probability p (typically 0.5)
During inference: use all neurons, scale by (1-p)

✓ Simple, effective, widely used | Typical: p=0.5 for hidden layers, p=0.2 for input

L1/L2 Regularization (Weight Decay)

Add penalty for large weights to loss function.

L1: Loss + λ Σ|w| (sparse weights)
L2: Loss + λ Σw² (small weights, most common)

Prevents overfitting by keeping weights small

Early Stopping

Stop training when validation loss stops improving.

Monitor validation loss each epoch
Stop if no improvement for N epochs (patience)

Simple, effective, prevents overfitting

Data Augmentation

Artificially expand training data with transformations.

Images: rotation, flip, crop, color jitter
Text: synonym replacement, back-translation
Audio: pitch shift, time stretch, noise

Increases effective dataset size, improves generalization

Normalization Techniques

Batch Normalization

Normalize activations within each mini-batch.

For each layer: normalize to mean=0, std=1
Then scale and shift with learnable parameters

✓ Faster training, higher learning rates, acts as regularization

Layer Normalization

Normalize across features (not batch).

Works with any batch size
Better for RNNs and transformers

Used in: Transformers, language models

Training Strategies

Mini-batch Training
Update weights after small batch (32-256 samples)
Learning Rate Warm-up
Gradually increase LR at start of training
Gradient Clipping
Prevent exploding gradients by capping magnitude
Mixed Precision Training
Use FP16 for speed, FP32 for stability
Transfer Learning
Start with pre-trained weights, fine-tune
Curriculum Learning
Train on easy examples first, then harder ones

Debugging Training

⚠️ Loss not decreasing
Check learning rate, verify data loading, check gradients
⚠️ Loss exploding (NaN)
Lower learning rate, use gradient clipping, check for bugs
⚠️ Overfitting
Add dropout, use data augmentation, get more data
⚠️ Underfitting
Increase model capacity, train longer, reduce regularization
⚠️ Slow convergence
Increase learning rate, use batch normalization, check initialization

Best Practices

Always shuffle training data
Use batch normalization after conv/dense layers
Apply dropout before final layers
Monitor both training and validation metrics
Save checkpoints regularly
Use early stopping to prevent overfitting
Visualize training curves (loss, accuracy)
Start with small learning rate, increase if stable

Key Takeaway: Successful training combines regularization (dropout, weight decay), normalization (batch norm), and smart strategies (early stopping, data augmentation). Monitor metrics and iterate!