1. What is Deep Learning?
Deep Learning is a subset of ML using neural networks with multiple layers. "Deep" refers to the number of layers. Each layer learns increasingly abstract representations of the data.
Input -> [Low-level features] -> [Mid-level features] -> [High-level features] -> Output Image -> [Edges, corners] -> [Textures, shapes] -> [Objects, faces] -> "Cat"
2. Neural Network Building Blocks
2.1 The Neuron (Perceptron)
inputs (x1, x2, ..., xn)
|
v
weighted sum: z = w1*x1 + w2*x2 + ... + wn*xn + b
|
v
activation function: a = f(z)
|
v
output
Each neuron: takes inputs, multiplies each by a weight, adds a bias, then passes through an activation function.
2.2 Activation Functions
Without activation functions, stacking layers would just be a linear transformation. Activations add non-linearity, allowing networks to learn complex patterns.
| Function | Formula | Output Range | Use Case |
|---|---|---|---|
| ReLU | max(0, x) | [0, โ) | Default for hidden layers |
| Sigmoid | 1 / (1 + e^-x) | (0, 1) | Binary output, gates |
| Tanh | (e^x โ e^-x) / (e^x + e^-x) | (โ1, 1) | When negative values matter |
| Softmax | e^xi / ฮฃ e^xj | (0, 1), sums to 1 | Multi-class output layer |
| GELU | x ยท P(X โค x) | (โ0.17, โ) | Transformers (smoother ReLU) |
| Leaky ReLU | max(0.01x, x) | (โโ, โ) | Prevents "dying ReLU" |
2.3 Layers
import torch.nn as nn
# Fully connected (Dense) - every neuron connects to every neuron in next layer
nn.Linear(in_features=128, out_features=64)
# Convolutional - slides filter over input, detects local patterns
nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
# Recurrent - has memory of previous inputs
nn.LSTM(input_size=128, hidden_size=256, num_layers=2)
# Normalization - stabilizes training
nn.BatchNorm2d(32) # Normalize activations per batch
nn.LayerNorm(768) # Normalize per sample (used in transformers)
# Regularization
nn.Dropout(0.2) # Randomly zero 20% of activations during training
3. Backpropagation & Gradient Descent
3.1 How Neural Networks Learn
1. FORWARD PASS: Input flows through network, producing prediction 2. LOSS: Compare prediction to actual label 3. BACKWARD PASS: Compute gradients (how much each weight contributed to error) 4. UPDATE WEIGHTS: Adjust weights in direction that reduces loss 5. REPEAT for many batches and epochs
3.2 Gradient Descent Variants
| Variant | Batch Size | Pros | Cons |
|---|---|---|---|
| Batch GD | Entire dataset | Stable | Slow, needs lots of memory |
| Stochastic GD (SGD) | 1 sample | Fast updates | Noisy, unstable |
| Mini-batch GD | 32โ256 samples | Best of both | Standard choice |
3.3 Optimizers
SGD with Momentum: Accumulates velocity in consistent gradient directions. Like a ball rolling downhill.
Adam (Adaptive Moment Estimation): Combines momentum + per-parameter learning rates. The default choice. Adapts learning rate for each parameter, works well with sparse gradients, less sensitive to learning rate choice.
AdamW: Adam with decoupled weight decay. Better generalization. Used in transformers.
3.4 Learning Rate
The most important hyperparameter.
Too high: Loss diverges, training is unstable Too low: Training is very slow, may get stuck in local minima Just right: Loss decreases smoothly Schedulers: Step decay: Reduce by factor every N epochs Cosine annealing: Smooth cosine decrease Warmup: Start low, increase, then decrease (used in transformers)
4. Convolutional Neural Networks (CNNs)
4.1 Key Idea
Instead of looking at all pixels at once (fully connected), CNNs look at local patches using sliding filters. This captures spatial patterns efficiently.
4.2 Core Operations
Convolution: A small filter (kernel) slides over the input, computing dot products. Input (5x5): Filter (3x3): Output (3x3): [1 0 1 0 1] [1 0 1] Slide filter across input, [0 1 0 1 0] [0 1 0] compute sum of element-wise [1 0 1 0 1] * [1 0 1] = products at each position [0 1 0 1 0] [1 0 1 0 1] Pooling: Downsamples feature maps, reduces computation, adds translation invariance. Max pooling: Take maximum value in each window (most common) Average pooling: Take average value in each window Stride: How many pixels the filter moves each step. Stride=2 halves spatial dimensions. Padding: Adding zeros around input to control output size. padding='same' keeps same dimensions.
4.3 CNN Architecture Pattern
Input Image (3x224x224)
|
[Conv2d -> BatchNorm -> ReLU -> MaxPool] x N (Feature extraction)
|
Flatten
|
[Linear -> ReLU -> Dropout] x M (Classification)
|
Linear -> Softmax (Output probabilities)
4.4 Famous CNN Architectures
| Architecture | Year | Key Innovation | Depth |
|---|---|---|---|
| LeNet-5 | 1998 | First practical CNN | 5 layers |
| AlexNet | 2012 | GPU training, ReLU, Dropout | 8 layers |
| VGG | 2014 | Small 3ร3 filters, deeper | 16โ19 layers |
| GoogLeNet | 2014 | Inception modules (parallel filters) | 22 layers |
| ResNet | 2015 | Skip connections (residual learning) | 50โ152 layers |
| EfficientNet | 2019 | Compound scaling | Variable |
4.5 ResNet โ The Most Important CNN Innovation
Problem: Very deep networks are hard to train (vanishing gradients).
Solution: Skip connections (residual connections). The network learns the residual
(difference) instead of the full mapping.
Input -> [Conv -> BN -> ReLU -> Conv -> BN] -> + -> ReLU -> Output | ^ +-------- (identity shortcut) ---------------+
4.6 PyTorch CNN Example
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.AdaptiveAvgPool2d(1),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128, num_classes),
)
def forward(self, x):
return self.classifier(self.features(x))
4.7 Transfer Learning
Instead of training from scratch, use a pre-trained model and fine-tune it. Lower layers learn general features (edges, textures) that transfer across tasks. Only the top layers need task-specific training.
import torchvision.models as models
model = models.resnet50(weights='IMAGENET1K_V2')
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace final layer for your task
model.fc = nn.Linear(2048, num_your_classes)
# Only train the new final layer
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
5. Transformers
5.1 The Problem with RNNs/LSTMs
- Process sequences one token at a time โ can't parallelize
- Long-range dependencies are hard to capture
- Vanishing gradient problem for long sequences
5.2 Architecture (2017 โ "Attention Is All You Need")
Key innovation: Self-Attention allows each token to attend to ALL other tokens in parallel.
Input Text -> [Encoder] -> Context Representation -> [Decoder] -> Output Text Encoder (N=6 identical layers): 1. Multi-Head Self-Attention 2. Add & Normalize (residual connection) 3. Feed-Forward Network 4. Add & Normalize Decoder (N=6 identical layers): 1. Masked Multi-Head Self-Attention (can't see future tokens) 2. Add & Normalize 3. Cross-Attention (attend to encoder output) 4. Add & Normalize 5. Feed-Forward Network 6. Add & Normalize
5.3 Self-Attention Mechanism
For each token, compute: Query (Q): "What am I looking for?" Key (K): "What do I contain?" Value (V): "What information do I provide?" Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V Multi-Head Attention: Run attention multiple times in parallel with different learned projections. Each "head" can attend to different types of relationships.
5.4 Positional Encoding
Since attention has no notion of order, positional information is added explicitly.
- Original paper: Sinusoidal functions
- Modern: Learned positional embeddings
- RoPE (Rotary Position Embedding): Used in LLaMA, GPT-NeoX
5.5 Transformer Variants
| Type | Architecture | Example Models | Use Case |
|---|---|---|---|
| Encoder-only | Just encoder | BERT, RoBERTa | Classification, NER, Q&A |
| Decoder-only | Just decoder | GPT, LLaMA, Claude | Text generation, chat |
| Encoder-Decoder | Both | T5, BART | Translation, summarization |
| Vision Transformer | Patches as tokens | ViT, DeiT | Image classification |
5.6 BERT
Pre-training: Masked Language Model (predict masked words) + Next Sentence Prediction.
Bidirectional: sees context from both left and right.
Used for: classification, NER, question answering, sentence similarity.
from transformers import BertTokenizer, BertForSequenceClassification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
5.7 GPT
Decoder-only: only sees previous tokens (autoregressive).
Pre-training: next token prediction on massive text corpus.
Scaling law: more parameters + more data = better performance.
6. Other Important Concepts
6.1 Generative Models
GANs: Generator creates fake data, Discriminator tries to distinguish real from fake. They improve through competition. Used for image generation and style transfer.
Diffusion Models (DALL-E, Stable Diffusion): Add noise to data gradually, then learn to reverse the process. Generate images by starting from pure noise and denoising step by step. Current state-of-the-art for image generation.
VAEs: Learn a compressed latent space representation. Can generate new data by sampling from latent space.
6.2 Embeddings
Dense vector representations of discrete items (words, users, products).
# Word embeddings
embedding = nn.Embedding(vocab_size=10000, embedding_dim=300)
# Input: word index [42] -> Output: 300-dimensional vector
# Word2Vec, GloVe โ older, word-level, static
# BERT, Sentence-BERT โ contextual, sentence-level
# OpenAI ada-002, Cohere โ modern, used for RAG
6.3 Common Training Techniques
- Learning rate warmup: Start with low LR, gradually increase
- Gradient clipping: Prevent exploding gradients by capping gradient norm
- Mixed precision training: Use FP16 for speed, FP32 for stability
- Gradient accumulation: Simulate larger batch sizes on limited GPU memory
- Early stopping: Stop when validation loss stops improving
# Mixed precision training (PyTorch)
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
output = model(input)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
7. Computer Vision Tasks
7.1 Image Processing with OpenCV
import cv2
img = cv2.imread('image.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
resized = cv2.resize(img, (224, 224))
edges = cv2.Canny(gray, 100, 200)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
_, binary = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)
7.2 Common CV Tasks
| Task | Description | Models |
|---|---|---|
| Classification | What's in the image? | ResNet, ViT |
| Object Detection | Where are objects + what are they? | YOLO, Faster R-CNN |
| Semantic Segmentation | Label every pixel | U-Net, DeepLab |
| Instance Segmentation | Separate each object instance | Mask R-CNN |
| Image Generation | Create new images | Diffusion, GAN |
8. Key Formulas
Loss Functions: MSE = (1/n) * ฮฃ (y_pred - y_true)ยฒ Cross-Entropy = -ฮฃ y_true * log(y_pred) Binary CE = -(y*log(p) + (1-y)*log(1-p)) Attention: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V Softmax: softmax(xi) = e^xi / ฮฃ e^xj ReLU: f(x) = max(0, x) Sigmoid: f(x) = 1 / (1 + e^(-x))