Notes

My Notes

Personal study notes covering LLM fine-tuning, CNNs, RNNs, Transformers, attention mechanisms, and core neural network concepts including weights, bias, activation functions, and memory calculations.

Neural Network Fundamentals

Weights and Bias

Weights (W)

Weights determine how important each input feature is.

A neural network layer computes:

z = w₁x₁ + wβ‚‚xβ‚‚ + w₃x₃ + ...

Where x₁, xβ‚‚, x₃ are inputs and w₁, wβ‚‚, w₃ are weights. Each input is multiplied by a weight.

Bias (b)

The bias is an additional value added to the weighted sum.

z = (w₁x₁ + wβ‚‚xβ‚‚ + ... + wβ‚™xβ‚™) + b

Bias shifts the output up or down, allowing the model to fit data more flexibly while applying the activation function.

Activation Functions

Introduce non-linearity: they determine whether a neuron should "fire" (activate) based on weighted input, enabling networks to learn complex, non-linear data patterns.

Softmax

Softmax is usually applied in the last layer of a model.

It converts raw scores (logits) into probabilities.

Properties

Sigmoid

The sigmoid function squashes values into a range between 0 and 1.

Commonly used for:

Training Concepts

Loss Functions

A loss function measures how far the model's predictions are from the true values. The goal of training is to minimise the loss.

Learning Rate

The learning rate is a hyperparameter that controls how much the model's weights are updated during each step of training.

new_weight = old_weight βˆ’ learning_rate Γ— gradient

The most common optimiser used to manage the learning rate during training is Adam (Adaptive Moment Estimation), which automatically adjusts the learning rate per parameter based on past gradients.

Vanishing Gradient vs Exploding Gradient

During backpropagation, gradients are multiplied through each layer. Depending on the magnitude of these gradients, two problems can arise:

Vanishing Gradient

Gradients become extremely small as they propagate back through many layers, causing early layers to stop learning.

Exploding Gradient

Gradients become extremely large, causing weights to update by huge amounts and making training unstable (loss may spike or become NaN).

Quick Comparison

                    Vanishing Gradient          Exploding Gradient
Gradients           β†’ 0 (shrink to zero)        β†’ ∞ (grow unbounded)
Effect              Early layers stop learning   Training becomes unstable
Common cause        Sigmoid/Tanh activations     Large weight matrices
Key fix             ReLU, LSTM/GRU, ResNets      Gradient clipping

Model Architectures

Convolutional Neural Networks (CNNs)

CNNs are commonly used for computer vision tasks.

Used For

Core Components

Popular Architectures

Sequence Models / RNN

Recurrent Neural Networks (RNN)

RNNs have memory, allowing them to process sequential data.

Common Applications

Key Idea

We unroll the data across time steps, while keeping the same weights and biases shared across the sequence.

Key Limitations

  1. Vanishing / Exploding Gradients – during backpropagation through time (BPTT), gradients are multiplied at each time step; over long sequences they shrink to near-zero (vanishing) or grow unboundedly (exploding), making it hard to learn long-range dependencies
  2. Short-Term Memory – vanilla RNNs struggle to retain information from early time steps, so context from the beginning of a long sequence is effectively lost by the end
  3. Sequential (Non-Parallelisable) Processing – each time step depends on the previous hidden state, so RNNs must be processed one step at a time and cannot leverage GPU parallelism the way Transformers can, making them slow to train on long sequences

LSTM and GRU architectures address limitations 1 and 2 with gating mechanisms, but the fundamental bottleneck of sequential processing remained. These limitations directly motivated the landmark 2017 paper "Attention Is All You Need" (Vaswani et al.), which introduced the Transformer architecture β€” replacing recurrence entirely with self-attention and solving all three limitations.

Gradient and Backpropagation

Reference: https://www.youtube.com/watch?v=LHXXI4-IEns

Seq2Seq Models

A sequence-to-sequence (seq2seq) model maps a variable-length input sequence to a variable-length output sequence. It was the dominant architecture for tasks like machine translation, text summarisation, and dialogue generation before Transformers.

The Original Encoder–Decoder Architecture

The classic seq2seq model (Sutskever et al., 2014) consists of two RNNs (typically LSTMs):

Input: "I love NLP"

Encoder RNN
  "I" β†’ h₁ β†’ "love" β†’ hβ‚‚ β†’ "NLP" β†’ h₃  ──► context vector (c)

Decoder RNN
  c β†’ "J'" β†’ "adore" β†’ "le" β†’ "TAL" β†’ <EOS>
  1. Encoder – reads the entire input sequence token by token and compresses it into a single fixed-size vector called the context vector (the final hidden state)
  2. Decoder – takes the context vector and generates the output sequence one token at a time, using each previously generated token as input for the next step

The Bottleneck Problem

The entire input sequence is squeezed into one fixed-size context vector. For long sentences this vector cannot retain all the information, causing the model to forget early parts of the input. Translation quality dropped sharply as sentence length increased.

Long input sentence
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ word₁  wordβ‚‚  word₃  ...  wordβ‚…β‚€  word₅₁│
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β–Ό
                      Single context vector (c)   ← information bottleneck
                                 β–Ό
                          Decoder output

The Fix: Attention Mechanism (Bahdanau et al., 2015)

Instead of relying on a single context vector, the attention mechanism lets the decoder look back at all encoder hidden states at every decoding step and decide which parts of the input to focus on.

Encoder hidden states:  h₁   hβ‚‚   h₃   hβ‚„   hβ‚…
                         ↑    ↑    ↑    ↑    ↑
                        0.05 0.10 0.60 0.20 0.05   ← attention weights (sum to 1)
                         β”‚    β”‚    β”‚    β”‚    β”‚
                         β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
                                  β–Ό
                         Weighted context vector  β†’ fed into decoder at this step
  1. Score – at each decoder step, compute an alignment score between the current decoder state and every encoder hidden state
  2. Normalise – pass the scores through softmax to get attention weights
  3. Weight – multiply each encoder hidden state by its attention weight and sum them into a new, step-specific context vector
  4. Decode – feed this context vector (along with the previous token) into the decoder to produce the next output token

Why Attention Solved the Bottleneck

From RNN Attention to Self-Attention (Transformers)

RNN-based attention still had to process the encoder sequentially, one token at a time. The Transformer (Vaswani et al., 2017) replaced the RNN entirely with self-attention, allowing all tokens to attend to each other in parallel β€” solving the speed bottleneck while keeping the benefits of attention.

Seq2Seq Evolution:

Vanilla Seq2Seq (2014)    Fixed context vector          β†’ bottleneck on long sequences
+ Attention (2015)        Dynamic context per step      β†’ solved information loss
Transformer (2017)        Self-attention, fully parallel β†’ solved speed and scalability

Transformers

Positional Encoding

Positional encoding assigns a numerical representation to each word before embedding, allowing the model to understand the order of tokens in a sequence.

BERT vs GPT

BERT: Autoencoding encoder-only model that reads the entire sequence bidirectionally, excelling at understanding tasks (classification, NER, question answering). Trained with Masked Language Modelling (MLM) β€” randomly masks tokens in the input and trains the model to predict them using the full surrounding context.

GPT: Autoregressive decoder-only model that processes tokens left-to-right, suited for text generation. Trained with causal language modelling β€” predicts the next token using only prior context.

Attention Mechanisms

Attention

Attention allows a model to capture relationships between tokens in a sequence.

This is particularly important in encoder–decoder architectures such as machine translation systems.

Self-Attention

Self-attention allows a model to process an entire sequence simultaneously and learn dependencies between all tokens.

This is powerful for tasks that require understanding context across the whole sequence, such as:

Self-attention mechanism diagram showing Query, Key, Value linear projections, MatMul, Scale, Softmax steps, and the formula softmax(QΒ·Kα΅€ / scaling)Β·V
Learning Self-Attention with Neural Networks

Multi-Head Attention

Multi-head attention runs the attention mechanism multiple times in parallel, each time with different learned weight matrices. Each parallel run is called a head.

Each head can learn to focus on a different type of relationship in the sequence β€” for example, one head might attend to syntactic structure while another attends to semantic similarity.

The outputs of all heads are concatenated and projected into a final representation.

Input
  |
  +---> Head 1 (Q1, K1, V1) ---> Attention output 1
  +---> Head 2 (Q2, K2, V2) ---> Attention output 2
  +---> Head 3 (Q3, K3, V3) ---> Attention output 3
  |
  +--> Concatenate all outputs --> Linear projection --> Final output

Why it matters: a single attention head can only learn one way to relate tokens. Multiple heads let the model capture several different relationships simultaneously, which is key to the power of Transformers.

Text Representation & Retrieval

Embeddings & Text Representation Techniques

Embeddings are the mathematical representations of words, phrases, or tokens in a large-dimensional space, capturing their semantic meaning and relationships with other words. Before modern embeddings, several simpler techniques were used.

One-Hot Encoding

Each word is represented as a vector of zeros with a single 1 at the word's index in the vocabulary.

Vocabulary: [cat, dog, fish]

cat  β†’  [1, 0, 0]
dog  β†’  [0, 1, 0]
fish β†’  [0, 0, 1]

Problem: no sense of similarity β€” cat and dog are just as "different" as cat and fish. Vector size grows with vocabulary (sparse, high-dimensional).

N-Grams

Instead of single words, capture sequences of N consecutive words to preserve some context.

Sentence: "I love NLP"

Unigrams (1-gram): [I, love, NLP]
Bigrams  (2-gram): [I love, love NLP]
Trigrams (3-gram): [I love NLP]

Why useful: captures local word order and phrases. "not good" as a bigram is very different from "good" alone.

Problem: vocabulary explodes with larger N, still no semantic meaning.

Bag of Words (BoW)

Represents a document as a vector of word counts, ignoring order entirely.

Vocabulary: [I, love, NLP, AI]

"I love NLP"  β†’  [1, 1, 1, 0]
"I love AI"   β†’  [1, 1, 0, 1]

Problem: no word order, no context, all words treated equally.

TF-IDF

Improves BoW by weighting words based on how frequent they are in a document vs. how common they are across all documents. Rare but frequent words in a doc get high scores.

Problem: still no semantic meaning, synonyms treated as unrelated.

Word Embeddings (Word2Vec, GloVe)

Dense low-dimensional vectors where similar words are close in vector space. Captures semantic relationships learned from large corpora.

King βˆ’ Man + Woman β‰ˆ Queen

Word2Vec Architectures

Both use a shallow neural network (single hidden layer) and are trained on large text corpora using techniques like negative sampling to make training efficient.

Limitations of Word2Vec

Contextual Embeddings (BERT, GPT)

Each word gets a different vector depending on its surrounding context. "bank" in a financial sentence gets a different embedding than "bank" in a nature sentence.

These are the embeddings used in modern Transformers and LLMs.

Evolution Summary

One-Hot Encoding   β†’  sparse, no similarity
N-Grams            β†’  adds local context, still sparse
Bag of Words       β†’  word counts, ignores order
TF-IDF             β†’  weighted counts, still no semantics
Word2Vec / GloVe   β†’  dense, semantic, but context-free
BERT / GPT         β†’  dense, semantic, context-aware

Chunking Methods

Chunking is the process of splitting large documents into smaller pieces before embedding and retrieval.

Hyperparameters

What Are Hyperparameters?

Hyperparameters are configuration values set before training begins. Unlike model parameters (weights and biases), they are not learned from data β€” you choose them manually or via search.

Training Hyperparameters

Learning Rate

Controls how large each weight update step is. The single most impactful hyperparameter.

Too high  β†’ overshoots optimum, loss diverges
Too low   β†’ learns very slowly, may get stuck
Typical range: 1e-5 to 1e-1  (e.g. 3e-4 for Adam)

Batch Size

Number of training samples processed before the model's weights are updated.

Epochs

One epoch = one full pass through the entire training dataset.

Optimizer

Algorithm used to update weights. Common choices:

Weight Decay (L2 Regularisation)

Penalises large weights to reduce overfitting. Added directly to the loss.

loss_total = loss + Ξ» Γ— Ξ£(wΒ²)   where Ξ» is the weight decay coefficient

Dropout Rate

Fraction of neurons randomly set to zero during each training step, forcing the network not to rely on any single neuron.

Typical range: 0.1 – 0.5
0.0 = no dropout (disabled)
0.5 = 50% of neurons dropped each step

Gradient Clipping

Caps gradients at a maximum norm before the weight update, preventing exploding gradients.

Common value: max_norm = 1.0

Architecture Hyperparameters

Number of Layers (Depth)

How many stacked layers the network has. Deeper networks can learn more complex features but are harder to train and prone to vanishing gradients.

Hidden Size / Model Dimension (d_model)

Width of each layer β€” the size of the internal representation vectors.

GPT-2 small:  d_model = 768
GPT-3:        d_model = 12,288

Number of Attention Heads

In Transformer models, how many parallel self-attention heads to run. Each head attends to different relationships. Must divide evenly into d_model.

head_dim = d_model / num_heads
e.g. 768 / 12 = 64 dimensions per head

Context Length (Sequence Length)

Maximum number of tokens the model can process at once. Attention scales quadratically with sequence length (O(nΒ²)), so longer contexts are expensive.

Feedforward Dimension

Size of the inner layer in the Transformer's feedforward block. Typically 4Γ— d_model.

Inference / Generation Hyperparameters

Temperature

Scales the logits before softmax to control randomness of output.

Temperature = 0.0  β†’ greedy (always picks highest probability token)
Temperature = 1.0  β†’ standard distribution (default)
Temperature > 1.0  β†’ more random / creative
Temperature < 1.0  β†’ more focused / deterministic

Top-k Sampling

At each step, only the k highest-probability tokens are considered. The model samples from these k candidates.

k = 1   β†’ greedy decoding
k = 50  β†’ common default

Top-p (Nucleus) Sampling

Instead of a fixed k, consider the smallest set of tokens whose cumulative probability exceeds p. Adapts dynamically to the distribution shape.

p = 0.9  β†’ sample from tokens covering 90% of the probability mass

Max New Tokens

Hard limit on how many tokens the model generates in one call, preventing runaway outputs.

Repetition Penalty

Reduces the probability of tokens that have already appeared, discouraging repetitive output. Values > 1.0 penalise repetition.

Quick Reference

Hyperparameter          Typical Range / Values       What it controls
────────────────────────────────────────────────────────────────────────
Learning rate           1e-5 – 1e-1                  Step size of weight updates
Batch size              16 – 512                     Samples per update step
Epochs                  1 – 100+                     Full passes through data
Dropout                 0.1 – 0.5                    Regularisation strength
Weight decay (Ξ»)        1e-4 – 1e-1                  Penalty on large weights
Gradient clip           0.5 – 5.0                    Max gradient norm
d_model                 128 – 12,288                 Layer width
Num heads               4 – 96                       Parallel attention heads
Context length          512 – 128,000 tokens         Max input length
Temperature             0.0 – 2.0                    Output randomness
Top-k                   1 – 100                      Candidate token pool
Top-p                   0.5 – 1.0                    Nucleus probability mass

LLM Training & Deployment

LLM Fine-Tuning Methods

FP16 Memory Calculation

FP16 (16-bit floating point) uses 2 bytes per parameter.

Example – 7B parameter model:

7B Γ— 2 bytes = 14 GB

So a 7B parameter model in FP16 requires about 14 GB of RAM.

KV Cache (Key-Value Cache)

During autoregressive generation, a Transformer decoder produces tokens one at a time. At each step, the self-attention mechanism needs the Key and Value vectors of every previous token to compute attention. Without caching, the model would have to recompute K and V for the entire sequence from scratch at every single step.

The Problem Without KV Cache

Generating token 5:
  Must attend to tokens 1, 2, 3, 4
  β†’ recompute K and V for ALL of them

Generating token 6:
  Must attend to tokens 1, 2, 3, 4, 5
  β†’ recompute K and V for ALL of them AGAIN

This is redundant β€” K and V for tokens 1–4 haven't changed!

How KV Cache Works

The KV cache stores the Key and Value matrices from all previous tokens so they are computed only once and reused at every subsequent step.

Step 1: process token 1 β†’ compute K₁, V₁ β†’ store in cache
Step 2: process token 2 β†’ compute Kβ‚‚, Vβ‚‚ β†’ append to cache β†’ attend using [K₁Kβ‚‚], [V₁Vβ‚‚]
Step 3: process token 3 β†’ compute K₃, V₃ β†’ append to cache β†’ attend using [K₁Kβ‚‚K₃], [V₁Vβ‚‚V₃]
...
Only the NEW token's Q, K, V are computed β€” past K, V are read from cache

Why It Matters

KV Cache Memory Formula

KV cache size = 2 Γ— num_layers Γ— num_heads Γ— head_dim Γ— seq_len Γ— bytes_per_param

Example β€” 7B model (32 layers, 32 heads, head_dim=128, FP16, 4096 tokens):
  = 2 Γ— 32 Γ— 32 Γ— 128 Γ— 4096 Γ— 2 bytes
  = ~2 GB

For long context windows (e.g. 128K tokens), the KV cache alone can exceed the model weights in memory.

Optimisations

Quick Summary

Without KV cache    Recompute all K, V at every step     Slow (O(nΒ²) per token)
With KV cache       Store and reuse past K, V             Fast (O(n) per token)
Trade-off           Uses extra GPU memory                 Grows with sequence length