Quick Comparison
| Loss Function | Description | Task | Output type | Key property |
|---|---|---|---|---|
| MSE (Mean Squared Error) | RNNs for time series, CNNs for depth/bounding box regression, any continuous output | Regression | Continuous value | Penalizes large errors heavily; sensitive to outliers |
| MAE (Mean Absolute Error) | Forecasting models (price, demand) where outliers exist and shouldn't dominate | Regression | Continuous value | Treats all errors equally; robust to outliers |
| Huber | Object detection (bounding box regression in YOLO, Faster R-CNN) | Regression | Continuous value | MSE near zero, MAE for large errors; best of both |
| Binary Cross-Entropy | Text classification (spam/not spam), sentiment analysis, CNNs for binary image tasks | Binary classification | Sigmoid probability | Heavily penalizes confident wrong predictions |
| Categorical Cross-Entropy | LLMs (next token prediction), CNNs for image classification, multi-class NLP tasks | Multi-class classification | Softmax probabilities | Only cares about probability given to the correct class |
| Hinge | SVMs, classical text classification before deep learning era | Binary classification (SVM) | Raw score | Zero once margin is satisfied; creates decision boundary margin |
| KL Divergence | VAEs, knowledge distillation (teacher→student LLMs), RLHF in LLMs | Distribution matching | Probability distributions | Not symmetric; zero only when distributions are identical |
| Contrastive | Siamese networks, early sentence embedding models, face verification | Metric learning (pairs) | Embedding vectors | Pulls similar pairs together, pushes dissimilar beyond margin |
| Triplet | FaceNet, sentence transformers, recommendation system embeddings | Metric learning (triplets) | Embedding vectors | Enforces relative ordering: anchor closer to positive than negative |
What loss function do LLMs use? — Categorical Cross-Entropy (next-token prediction). At each position the model outputs a probability distribution over the entire vocabulary via softmax, and cross-entropy measures how much probability was assigned to the correct next token. Minimizing this over billions of tokens is what drives pre-training.
Common Interview Questions
- Why use cross-entropy instead of MSE for classification? — MSE assumes a Gaussian error distribution. Classification outputs are probabilities; cross-entropy is derived from maximum likelihood estimation under a Bernoulli/categorical distribution. It also produces stronger gradients when the model is confidently wrong.
- What happens if you use MSE for binary classification? — The loss surface becomes non-convex, gradients near 0 and 1 vanish (sigmoid saturation), and training becomes slow or unstable.
- When would you choose MAE over MSE? — When the dataset has outliers that should not dominate training (e.g., sensor readings with occasional spikes).
- What is the difference between contrastive and triplet loss? — Contrastive loss works on pairs with binary labels (same/different). Triplet loss works on triplets and enforces a relative margin, which gives stronger signal about the embedding space geometry.
- Where does KL divergence appear in deep learning? — In VAEs (regularization term to keep the latent distribution close to a standard normal), knowledge distillation (matching teacher and student output distributions), and policy gradient methods in RL.