Deep Learning Fundamentals

1. What is Deep Learning?

Deep Learning is a subset of ML using neural networks with multiple layers. "Deep" refers to the number of layers. Each layer learns increasingly abstract representations of the data.

Input -> [Low-level features] -> [Mid-level features] -> [High-level features] -> Output
Image -> [Edges, corners]     -> [Textures, shapes]   -> [Objects, faces]      -> "Cat"

2. Neural Network Building Blocks

2.1 The Neuron (Perceptron)

inputs (x1, x2, ..., xn)
    |
    v
weighted sum: z = w1*x1 + w2*x2 + ... + wn*xn + b
    |
    v
activation function: a = f(z)
    |
    v
output

Each neuron: takes inputs, multiplies each by a weight, adds a bias, then passes through an activation function.

2.2 Activation Functions

Without activation functions, stacking layers would just be a linear transformation. Activations add non-linearity, allowing networks to learn complex patterns.

Function	Formula	Output Range	Use Case
ReLU	max(0, x)	[0, ∞)	Default for hidden layers
Sigmoid	1 / (1 + e^-x)	(0, 1)	Binary output, gates
Tanh	(e^x − e^-x) / (e^x + e^-x)	(−1, 1)	When negative values matter
Softmax	e^xi / Σ e^xj	(0, 1), sums to 1	Multi-class output layer
GELU	x · P(X ≤ x)	(−0.17, ∞)	Transformers (smoother ReLU)
Leaky ReLU	max(0.01x, x)	(−∞, ∞)	Prevents "dying ReLU"

2.3 Layers

import torch.nn as nn

# Fully connected (Dense) - every neuron connects to every neuron in next layer
nn.Linear(in_features=128, out_features=64)

# Convolutional - slides filter over input, detects local patterns
nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)

# Recurrent - has memory of previous inputs
nn.LSTM(input_size=128, hidden_size=256, num_layers=2)

# Normalization - stabilizes training
nn.BatchNorm2d(32)       # Normalize activations per batch
nn.LayerNorm(768)        # Normalize per sample (used in transformers)

# Regularization
nn.Dropout(0.2)          # Randomly zero 20% of activations during training

3. Backpropagation & Gradient Descent

3.1 How Neural Networks Learn

1. FORWARD PASS:    Input flows through network, producing prediction
2. LOSS:            Compare prediction to actual label
3. BACKWARD PASS:   Compute gradients (how much each weight contributed to error)
4. UPDATE WEIGHTS:  Adjust weights in direction that reduces loss
5. REPEAT for many batches and epochs

3.2 Gradient Descent Variants

Variant	Batch Size	Pros	Cons
Batch GD	Entire dataset	Stable	Slow, needs lots of memory
Stochastic GD (SGD)	1 sample	Fast updates	Noisy, unstable
Mini-batch GD	32–256 samples	Best of both	Standard choice

3.3 Optimizers

SGD with Momentum: Accumulates velocity in consistent gradient directions. Like a ball rolling downhill.

Adam (Adaptive Moment Estimation): Combines momentum + per-parameter learning rates. The default choice. Adapts learning rate for each parameter, works well with sparse gradients, less sensitive to learning rate choice.

AdamW: Adam with decoupled weight decay. Better generalization. Used in transformers.

3.4 Learning Rate

The most important hyperparameter.

Too high:   Loss diverges, training is unstable
Too low:    Training is very slow, may get stuck in local minima
Just right: Loss decreases smoothly

Schedulers:
  Step decay:       Reduce by factor every N epochs
  Cosine annealing: Smooth cosine decrease
  Warmup:           Start low, increase, then decrease (used in transformers)

4. Convolutional Neural Networks (CNNs)

4.1 Key Idea

Instead of looking at all pixels at once (fully connected), CNNs look at local patches using sliding filters. This captures spatial patterns efficiently.

4.2 Core Operations

Convolution: A small filter (kernel) slides over the input, computing dot products.

Input (5x5):          Filter (3x3):       Output (3x3):
[1 0 1 0 1]           [1 0 1]             Slide filter across input,
[0 1 0 1 0]           [0 1 0]             compute sum of element-wise
[1 0 1 0 1]    *      [1 0 1]      =      products at each position
[0 1 0 1 0]
[1 0 1 0 1]

Pooling:   Downsamples feature maps, reduces computation, adds translation invariance.
  Max pooling:     Take maximum value in each window (most common)
  Average pooling: Take average value in each window

Stride:    How many pixels the filter moves each step. Stride=2 halves spatial dimensions.
Padding:   Adding zeros around input to control output size. padding='same' keeps same dimensions.

4.3 CNN Architecture Pattern

Input Image (3x224x224)
    |
[Conv2d -> BatchNorm -> ReLU -> MaxPool]  x N   (Feature extraction)
    |
Flatten
    |
[Linear -> ReLU -> Dropout]  x M                (Classification)
    |
Linear -> Softmax                               (Output probabilities)

4.4 Famous CNN Architectures

Architecture	Year	Key Innovation	Depth
LeNet-5	1998	First practical CNN	5 layers
AlexNet	2012	GPU training, ReLU, Dropout	8 layers
VGG	2014	Small 3×3 filters, deeper	16–19 layers
GoogLeNet	2014	Inception modules (parallel filters)	22 layers
ResNet	2015	Skip connections (residual learning)	50–152 layers
EfficientNet	2019	Compound scaling	Variable

4.5 ResNet — The Most Important CNN Innovation

Problem: Very deep networks are hard to train (vanishing gradients).
Solution: Skip connections (residual connections). The network learns the residual (difference) instead of the full mapping.

Input -> [Conv -> BN -> ReLU -> Conv -> BN] -> + -> ReLU -> Output
  |                                            ^
  +-------- (identity shortcut) ---------------+

4.6 PyTorch CNN Example

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

4.7 Transfer Learning

Instead of training from scratch, use a pre-trained model and fine-tune it. Lower layers learn general features (edges, textures) that transfer across tasks. Only the top layers need task-specific training.

import torchvision.models as models

model = models.resnet50(weights='IMAGENET1K_V2')

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for your task
model.fc = nn.Linear(2048, num_your_classes)

# Only train the new final layer
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)

5. Transformers

5.1 The Problem with RNNs/LSTMs

Process sequences one token at a time — can't parallelize
Long-range dependencies are hard to capture
Vanishing gradient problem for long sequences

5.2 Architecture (2017 — "Attention Is All You Need")

Key innovation: Self-Attention allows each token to attend to ALL other tokens in parallel.

Input Text -> [Encoder] -> Context Representation -> [Decoder] -> Output Text

Encoder (N=6 identical layers):
  1. Multi-Head Self-Attention
  2. Add & Normalize (residual connection)
  3. Feed-Forward Network
  4. Add & Normalize

Decoder (N=6 identical layers):
  1. Masked Multi-Head Self-Attention (can't see future tokens)
  2. Add & Normalize
  3. Cross-Attention (attend to encoder output)
  4. Add & Normalize
  5. Feed-Forward Network
  6. Add & Normalize

5.3 Self-Attention Mechanism

For each token, compute:
  Query (Q): "What am I looking for?"
  Key   (K): "What do I contain?"
  Value (V): "What information do I provide?"

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Multi-Head Attention: Run attention multiple times in parallel with different
learned projections. Each "head" can attend to different types of relationships.

5.4 Positional Encoding

Since attention has no notion of order, positional information is added explicitly.

Original paper: Sinusoidal functions
Modern: Learned positional embeddings
RoPE (Rotary Position Embedding): Used in LLaMA, GPT-NeoX

5.5 Transformer Variants

Type	Architecture	Example Models	Use Case
Encoder-only	Just encoder	BERT, RoBERTa	Classification, NER, Q&A
Decoder-only	Just decoder	GPT, LLaMA, Claude	Text generation, chat
Encoder-Decoder	Both	T5, BART	Translation, summarization
Vision Transformer	Patches as tokens	ViT, DeiT	Image classification

5.6 BERT

Pre-training: Masked Language Model (predict masked words) + Next Sentence Prediction.
Bidirectional: sees context from both left and right.
Used for: classification, NER, question answering, sentence similarity.

from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

5.7 GPT

Decoder-only: only sees previous tokens (autoregressive).
Pre-training: next token prediction on massive text corpus.
Scaling law: more parameters + more data = better performance.

6. Other Important Concepts

6.1 Generative Models

GANs: Generator creates fake data, Discriminator tries to distinguish real from fake. They improve through competition. Used for image generation and style transfer.

Diffusion Models (DALL-E, Stable Diffusion): Add noise to data gradually, then learn to reverse the process. Generate images by starting from pure noise and denoising step by step. Current state-of-the-art for image generation.

VAEs: Learn a compressed latent space representation. Can generate new data by sampling from latent space.

6.2 Embeddings

Dense vector representations of discrete items (words, users, products).

# Word embeddings
embedding = nn.Embedding(vocab_size=10000, embedding_dim=300)
# Input: word index [42] -> Output: 300-dimensional vector

# Word2Vec, GloVe         — older, word-level, static
# BERT, Sentence-BERT     — contextual, sentence-level
# OpenAI ada-002, Cohere  — modern, used for RAG

6.3 Common Training Techniques

Learning rate warmup: Start with low LR, gradually increase
Gradient clipping: Prevent exploding gradients by capping gradient norm
Mixed precision training: Use FP16 for speed, FP32 for stability
Gradient accumulation: Simulate larger batch sizes on limited GPU memory
Early stopping: Stop when validation loss stops improving

# Mixed precision training (PyTorch)
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
    output = model(input)
    loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

7. Computer Vision Tasks

7.1 Image Processing with OpenCV

import cv2

img = cv2.imread('image.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
resized = cv2.resize(img, (224, 224))
edges = cv2.Canny(gray, 100, 200)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
_, binary = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)

7.2 Common CV Tasks

Task	Description	Models
Classification	What's in the image?	ResNet, ViT
Object Detection	Where are objects + what are they?	YOLO, Faster R-CNN
Semantic Segmentation	Label every pixel	U-Net, DeepLab
Instance Segmentation	Separate each object instance	Mask R-CNN
Image Generation	Create new images	Diffusion, GAN

8. Key Formulas

Loss Functions:
  MSE             = (1/n) * Σ (y_pred - y_true)²
  Cross-Entropy   = -Σ y_true * log(y_pred)
  Binary CE       = -(y*log(p) + (1-y)*log(1-p))

Attention:
  Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V

Softmax:
  softmax(xi)      = e^xi / Σ e^xj

ReLU:
  f(x)             = max(0, x)

Sigmoid:
  f(x)             = 1 / (1 + e^(-x))