Loss Functions

Quick Comparison

Loss Function	Description	Task	Output type	Key property
MSE (Mean Squared Error)	RNNs for time series, CNNs for depth/bounding box regression, any continuous output	Regression	Continuous value	Penalizes large errors heavily; sensitive to outliers
MAE (Mean Absolute Error)	Forecasting models (price, demand) where outliers exist and shouldn't dominate	Regression	Continuous value	Treats all errors equally; robust to outliers
Huber	Object detection (bounding box regression in YOLO, Faster R-CNN)	Regression	Continuous value	MSE near zero, MAE for large errors; best of both
Binary Cross-Entropy	Text classification (spam/not spam), sentiment analysis, CNNs for binary image tasks	Binary classification	Sigmoid probability	Heavily penalizes confident wrong predictions
Categorical Cross-Entropy	LLMs (next token prediction), CNNs for image classification, multi-class NLP tasks	Multi-class classification	Softmax probabilities	Only cares about probability given to the correct class
Hinge	SVMs, classical text classification before deep learning era	Binary classification (SVM)	Raw score	Zero once margin is satisfied; creates decision boundary margin
KL Divergence	VAEs, knowledge distillation (teacher→student LLMs), RLHF in LLMs	Distribution matching	Probability distributions	Not symmetric; zero only when distributions are identical
Contrastive	Siamese networks, early sentence embedding models, face verification	Metric learning (pairs)	Embedding vectors	Pulls similar pairs together, pushes dissimilar beyond margin
Triplet	FaceNet, sentence transformers, recommendation system embeddings	Metric learning (triplets)	Embedding vectors	Enforces relative ordering: anchor closer to positive than negative

What loss function do LLMs use? — Categorical Cross-Entropy (next-token prediction). At each position the model outputs a probability distribution over the entire vocabulary via softmax, and cross-entropy measures how much probability was assigned to the correct next token. Minimizing this over billions of tokens is what drives pre-training.

Common Interview Questions

Why use cross-entropy instead of MSE for classification? — MSE assumes a Gaussian error distribution. Classification outputs are probabilities; cross-entropy is derived from maximum likelihood estimation under a Bernoulli/categorical distribution. It also produces stronger gradients when the model is confidently wrong.
What happens if you use MSE for binary classification? — The loss surface becomes non-convex, gradients near 0 and 1 vanish (sigmoid saturation), and training becomes slow or unstable.
When would you choose MAE over MSE? — When the dataset has outliers that should not dominate training (e.g., sensor readings with occasional spikes).
What is the difference between contrastive and triplet loss? — Contrastive loss works on pairs with binary labels (same/different). Triplet loss works on triplets and enforces a relative margin, which gives stronger signal about the embedding space geometry.
Where does KL divergence appear in deep learning? — In VAEs (regularization term to keep the latent distribution close to a standard normal), knowledge distillation (matching teacher and student output distributions), and policy gradient methods in RL.