Subject 32

Loss Functions

A loss function measures how far a model's prediction is from the ground truth. The training process minimizes this value. Choosing the wrong loss function can make a model learn the wrong thing entirely, even if everything else is correct.

Quick Comparison

Loss Function Description Task Output type Key property
MSE (Mean Squared Error) RNNs for time series, CNNs for depth/bounding box regression, any continuous output Regression Continuous value Penalizes large errors heavily; sensitive to outliers
MAE (Mean Absolute Error) Forecasting models (price, demand) where outliers exist and shouldn't dominate Regression Continuous value Treats all errors equally; robust to outliers
Huber Object detection (bounding box regression in YOLO, Faster R-CNN) Regression Continuous value MSE near zero, MAE for large errors; best of both
Binary Cross-Entropy Text classification (spam/not spam), sentiment analysis, CNNs for binary image tasks Binary classification Sigmoid probability Heavily penalizes confident wrong predictions
Categorical Cross-Entropy LLMs (next token prediction), CNNs for image classification, multi-class NLP tasks Multi-class classification Softmax probabilities Only cares about probability given to the correct class
Hinge SVMs, classical text classification before deep learning era Binary classification (SVM) Raw score Zero once margin is satisfied; creates decision boundary margin
KL Divergence VAEs, knowledge distillation (teacher→student LLMs), RLHF in LLMs Distribution matching Probability distributions Not symmetric; zero only when distributions are identical
Contrastive Siamese networks, early sentence embedding models, face verification Metric learning (pairs) Embedding vectors Pulls similar pairs together, pushes dissimilar beyond margin
Triplet FaceNet, sentence transformers, recommendation system embeddings Metric learning (triplets) Embedding vectors Enforces relative ordering: anchor closer to positive than negative

What loss function do LLMs use? — Categorical Cross-Entropy (next-token prediction). At each position the model outputs a probability distribution over the entire vocabulary via softmax, and cross-entropy measures how much probability was assigned to the correct next token. Minimizing this over billions of tokens is what drives pre-training.

Common Interview Questions