Subject 33

Deep Learning Fundamentals

Neural networks, CNNs, Transformers, and the training mechanics behind them. The foundation for understanding every modern AI system.

1. What is Deep Learning?

Deep Learning is a subset of ML using neural networks with multiple layers. "Deep" refers to the number of layers. Each layer learns increasingly abstract representations of the data.

Input -> [Low-level features] -> [Mid-level features] -> [High-level features] -> Output
Image -> [Edges, corners]     -> [Textures, shapes]   -> [Objects, faces]      -> "Cat"

2. Neural Network Building Blocks

2.1 The Neuron (Perceptron)

inputs (x1, x2, ..., xn)
    |
    v
weighted sum: z = w1*x1 + w2*x2 + ... + wn*xn + b
    |
    v
activation function: a = f(z)
    |
    v
output

Each neuron: takes inputs, multiplies each by a weight, adds a bias, then passes through an activation function.

2.2 Activation Functions

Without activation functions, stacking layers would just be a linear transformation. Activations add non-linearity, allowing networks to learn complex patterns.

Function Formula Output Range Use Case
ReLU max(0, x) [0, โˆž) Default for hidden layers
Sigmoid 1 / (1 + e^-x) (0, 1) Binary output, gates
Tanh (e^x โˆ’ e^-x) / (e^x + e^-x) (โˆ’1, 1) When negative values matter
Softmax e^xi / ฮฃ e^xj (0, 1), sums to 1 Multi-class output layer
GELU x ยท P(X โ‰ค x) (โˆ’0.17, โˆž) Transformers (smoother ReLU)
Leaky ReLU max(0.01x, x) (โˆ’โˆž, โˆž) Prevents "dying ReLU"

2.3 Layers

import torch.nn as nn

# Fully connected (Dense) - every neuron connects to every neuron in next layer
nn.Linear(in_features=128, out_features=64)

# Convolutional - slides filter over input, detects local patterns
nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)

# Recurrent - has memory of previous inputs
nn.LSTM(input_size=128, hidden_size=256, num_layers=2)

# Normalization - stabilizes training
nn.BatchNorm2d(32)       # Normalize activations per batch
nn.LayerNorm(768)        # Normalize per sample (used in transformers)

# Regularization
nn.Dropout(0.2)          # Randomly zero 20% of activations during training

3. Backpropagation & Gradient Descent

3.1 How Neural Networks Learn

1. FORWARD PASS:    Input flows through network, producing prediction
2. LOSS:            Compare prediction to actual label
3. BACKWARD PASS:   Compute gradients (how much each weight contributed to error)
4. UPDATE WEIGHTS:  Adjust weights in direction that reduces loss
5. REPEAT for many batches and epochs

3.2 Gradient Descent Variants

Variant Batch Size Pros Cons
Batch GD Entire dataset Stable Slow, needs lots of memory
Stochastic GD (SGD) 1 sample Fast updates Noisy, unstable
Mini-batch GD 32โ€“256 samples Best of both Standard choice

3.3 Optimizers

SGD with Momentum: Accumulates velocity in consistent gradient directions. Like a ball rolling downhill.

Adam (Adaptive Moment Estimation): Combines momentum + per-parameter learning rates. The default choice. Adapts learning rate for each parameter, works well with sparse gradients, less sensitive to learning rate choice.

AdamW: Adam with decoupled weight decay. Better generalization. Used in transformers.

3.4 Learning Rate

The most important hyperparameter.

Too high:   Loss diverges, training is unstable
Too low:    Training is very slow, may get stuck in local minima
Just right: Loss decreases smoothly

Schedulers:
  Step decay:       Reduce by factor every N epochs
  Cosine annealing: Smooth cosine decrease
  Warmup:           Start low, increase, then decrease (used in transformers)

4. Convolutional Neural Networks (CNNs)

4.1 Key Idea

Instead of looking at all pixels at once (fully connected), CNNs look at local patches using sliding filters. This captures spatial patterns efficiently.

4.2 Core Operations

Convolution: A small filter (kernel) slides over the input, computing dot products.

Input (5x5):          Filter (3x3):       Output (3x3):
[1 0 1 0 1]           [1 0 1]             Slide filter across input,
[0 1 0 1 0]           [0 1 0]             compute sum of element-wise
[1 0 1 0 1]    *      [1 0 1]      =      products at each position
[0 1 0 1 0]
[1 0 1 0 1]

Pooling:   Downsamples feature maps, reduces computation, adds translation invariance.
  Max pooling:     Take maximum value in each window (most common)
  Average pooling: Take average value in each window

Stride:    How many pixels the filter moves each step. Stride=2 halves spatial dimensions.
Padding:   Adding zeros around input to control output size. padding='same' keeps same dimensions.

4.3 CNN Architecture Pattern

Input Image (3x224x224)
    |
[Conv2d -> BatchNorm -> ReLU -> MaxPool]  x N   (Feature extraction)
    |
Flatten
    |
[Linear -> ReLU -> Dropout]  x M                (Classification)
    |
Linear -> Softmax                               (Output probabilities)

4.4 Famous CNN Architectures

Architecture Year Key Innovation Depth
LeNet-5 1998 First practical CNN 5 layers
AlexNet 2012 GPU training, ReLU, Dropout 8 layers
VGG 2014 Small 3ร—3 filters, deeper 16โ€“19 layers
GoogLeNet 2014 Inception modules (parallel filters) 22 layers
ResNet 2015 Skip connections (residual learning) 50โ€“152 layers
EfficientNet 2019 Compound scaling Variable

4.5 ResNet โ€” The Most Important CNN Innovation

Problem: Very deep networks are hard to train (vanishing gradients).
Solution: Skip connections (residual connections). The network learns the residual (difference) instead of the full mapping.

Input -> [Conv -> BN -> ReLU -> Conv -> BN] -> + -> ReLU -> Output
  |                                            ^
  +-------- (identity shortcut) ---------------+

4.6 PyTorch CNN Example

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

4.7 Transfer Learning

Instead of training from scratch, use a pre-trained model and fine-tune it. Lower layers learn general features (edges, textures) that transfer across tasks. Only the top layers need task-specific training.

import torchvision.models as models

model = models.resnet50(weights='IMAGENET1K_V2')

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for your task
model.fc = nn.Linear(2048, num_your_classes)

# Only train the new final layer
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)

5. Transformers

5.1 The Problem with RNNs/LSTMs

5.2 Architecture (2017 โ€” "Attention Is All You Need")

Key innovation: Self-Attention allows each token to attend to ALL other tokens in parallel.

Input Text -> [Encoder] -> Context Representation -> [Decoder] -> Output Text

Encoder (N=6 identical layers):
  1. Multi-Head Self-Attention
  2. Add & Normalize (residual connection)
  3. Feed-Forward Network
  4. Add & Normalize

Decoder (N=6 identical layers):
  1. Masked Multi-Head Self-Attention (can't see future tokens)
  2. Add & Normalize
  3. Cross-Attention (attend to encoder output)
  4. Add & Normalize
  5. Feed-Forward Network
  6. Add & Normalize

5.3 Self-Attention Mechanism

For each token, compute:
  Query (Q): "What am I looking for?"
  Key   (K): "What do I contain?"
  Value (V): "What information do I provide?"

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Multi-Head Attention: Run attention multiple times in parallel with different
learned projections. Each "head" can attend to different types of relationships.

5.4 Positional Encoding

Since attention has no notion of order, positional information is added explicitly.

5.5 Transformer Variants

Type Architecture Example Models Use Case
Encoder-only Just encoder BERT, RoBERTa Classification, NER, Q&A
Decoder-only Just decoder GPT, LLaMA, Claude Text generation, chat
Encoder-Decoder Both T5, BART Translation, summarization
Vision Transformer Patches as tokens ViT, DeiT Image classification

5.6 BERT

Pre-training: Masked Language Model (predict masked words) + Next Sentence Prediction.
Bidirectional: sees context from both left and right.
Used for: classification, NER, question answering, sentence similarity.

from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

5.7 GPT

Decoder-only: only sees previous tokens (autoregressive).
Pre-training: next token prediction on massive text corpus.
Scaling law: more parameters + more data = better performance.

6. Other Important Concepts

6.1 Generative Models

GANs: Generator creates fake data, Discriminator tries to distinguish real from fake. They improve through competition. Used for image generation and style transfer.

Diffusion Models (DALL-E, Stable Diffusion): Add noise to data gradually, then learn to reverse the process. Generate images by starting from pure noise and denoising step by step. Current state-of-the-art for image generation.

VAEs: Learn a compressed latent space representation. Can generate new data by sampling from latent space.

6.2 Embeddings

Dense vector representations of discrete items (words, users, products).

# Word embeddings
embedding = nn.Embedding(vocab_size=10000, embedding_dim=300)
# Input: word index [42] -> Output: 300-dimensional vector

# Word2Vec, GloVe         โ€” older, word-level, static
# BERT, Sentence-BERT     โ€” contextual, sentence-level
# OpenAI ada-002, Cohere  โ€” modern, used for RAG

6.3 Common Training Techniques

# Mixed precision training (PyTorch)
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
    output = model(input)
    loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

7. Computer Vision Tasks

7.1 Image Processing with OpenCV

import cv2

img = cv2.imread('image.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
resized = cv2.resize(img, (224, 224))
edges = cv2.Canny(gray, 100, 200)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
_, binary = cv2.threshold(gray, 127, 255, cv2.THRESH_BINARY)

7.2 Common CV Tasks

Task Description Models
Classification What's in the image? ResNet, ViT
Object Detection Where are objects + what are they? YOLO, Faster R-CNN
Semantic Segmentation Label every pixel U-Net, DeepLab
Instance Segmentation Separate each object instance Mask R-CNN
Image Generation Create new images Diffusion, GAN

8. Key Formulas

Loss Functions:
  MSE             = (1/n) * ฮฃ (y_pred - y_true)ยฒ
  Cross-Entropy   = -ฮฃ y_true * log(y_pred)
  Binary CE       = -(y*log(p) + (1-y)*log(1-p))

Attention:
  Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V

Softmax:
  softmax(xi)      = e^xi / ฮฃ e^xj

ReLU:
  f(x)             = max(0, x)

Sigmoid:
  f(x)             = 1 / (1 + e^(-x))