Neural Network Fundamentals
Weights and Bias
Weights (W)
Weights determine how important each input feature is.
A neural network layer computes:
z = wβxβ + wβxβ + wβxβ + ...
Where xβ, xβ, xβ are inputs and wβ, wβ, wβ are weights. Each input is multiplied by a weight.
Bias (b)
The bias is an additional value added to the weighted sum.
z = (wβxβ + wβxβ + ... + wβxβ) + b
Bias shifts the output up or down, allowing the model to fit data more flexibly while applying the activation function.
Activation Functions
Introduce non-linearity: they determine whether a neuron should "fire" (activate) based on weighted input, enabling networks to learn complex, non-linear data patterns.
- Sigmoid
- ReLU
- Tanh (Hyperbolic Tangent)
- Leaky ReLU
Softmax
Softmax is usually applied in the last layer of a model.
It converts raw scores (logits) into probabilities.
Properties
- The output values sum to 1
- The token with the highest probability will have a value closer to 1
Sigmoid
The sigmoid function squashes values into a range between 0 and 1.
Commonly used for:
- Binary classification
- Probability outputs
Training Concepts
Loss Functions
A loss function measures how far the model's predictions are from the true values. The goal of training is to minimise the loss.
- Mean Squared Error (MSE) β used for regression; penalises large errors heavily
- Mean Absolute Error (MAE) β used for regression; more robust to outliers than MSE
- Binary Cross-Entropy β used for binary classification; measures difference between predicted probability and true label
- Categorical Cross-Entropy β used for multi-class classification; compares predicted probability distribution to the true class
- Sparse Categorical Cross-Entropy β same as categorical cross-entropy but accepts integer class labels instead of one-hot vectors
Learning Rate
The learning rate is a hyperparameter that controls how much the model's weights are updated during each step of training.
new_weight = old_weight β learning_rate Γ gradient
- Too high β the model overshoots the optimal weights, training becomes unstable or diverges
- Too low β the model learns very slowly and may get stuck in local minima
- Just right β the model converges efficiently to a good solution
The most common optimiser used to manage the learning rate during training is Adam (Adaptive Moment Estimation), which automatically adjusts the learning rate per parameter based on past gradients.
Vanishing Gradient vs Exploding Gradient
During backpropagation, gradients are multiplied through each layer. Depending on the magnitude of these gradients, two problems can arise:
Vanishing Gradient
Gradients become extremely small as they propagate back through many layers, causing early layers to stop learning.
- Common with sigmoid and tanh activation functions (their derivatives are < 1)
- Especially problematic in deep networks and vanilla RNNs processing long sequences
- Solutions: use ReLU activations, batch normalisation, residual connections (skip connections), or gated architectures like LSTM / GRU
Exploding Gradient
Gradients become extremely large, causing weights to update by huge amounts and making training unstable (loss may spike or become NaN).
- Common in deep networks and RNNs when weight matrices have large values
- Solutions: gradient clipping (cap gradients at a maximum value), proper weight initialisation, batch normalisation
Quick Comparison
Vanishing Gradient Exploding Gradient Gradients β 0 (shrink to zero) β β (grow unbounded) Effect Early layers stop learning Training becomes unstable Common cause Sigmoid/Tanh activations Large weight matrices Key fix ReLU, LSTM/GRU, ResNets Gradient clipping
Model Architectures
Convolutional Neural Networks (CNNs)
CNNs are commonly used for computer vision tasks.
Used For
- Image classification
- Object detection
- Medical imaging
Core Components
- Convolution Layers β extract features from images
- Pooling Layers β reduce spatial size and computation
- Activation Functions β introduce non-linearity
- Fully Connected Layers β produce final predictions
Popular Architectures
- ResNet
- VGG
- EfficientNet
Sequence Models / RNN
Recurrent Neural Networks (RNN)
RNNs have memory, allowing them to process sequential data.
Common Applications
- Machine translation
- Autocomplete
- Sentiment analysis
- Named Entity Recognition (NER)
- Time series prediction
Key Idea
We unroll the data across time steps, while keeping the same weights and biases shared across the sequence.
Key Limitations
- Vanishing / Exploding Gradients β during backpropagation through time (BPTT), gradients are multiplied at each time step; over long sequences they shrink to near-zero (vanishing) or grow unboundedly (exploding), making it hard to learn long-range dependencies
- Short-Term Memory β vanilla RNNs struggle to retain information from early time steps, so context from the beginning of a long sequence is effectively lost by the end
- Sequential (Non-Parallelisable) Processing β each time step depends on the previous hidden state, so RNNs must be processed one step at a time and cannot leverage GPU parallelism the way Transformers can, making them slow to train on long sequences
LSTM and GRU architectures address limitations 1 and 2 with gating mechanisms, but the fundamental bottleneck of sequential processing remained. These limitations directly motivated the landmark 2017 paper "Attention Is All You Need" (Vaswani et al.), which introduced the Transformer architecture β replacing recurrence entirely with self-attention and solving all three limitations.
Gradient and Backpropagation
Reference: https://www.youtube.com/watch?v=LHXXI4-IEns
Seq2Seq Models
A sequence-to-sequence (seq2seq) model maps a variable-length input sequence to a variable-length output sequence. It was the dominant architecture for tasks like machine translation, text summarisation, and dialogue generation before Transformers.
The Original EncoderβDecoder Architecture
The classic seq2seq model (Sutskever et al., 2014) consists of two RNNs (typically LSTMs):
Input: "I love NLP" Encoder RNN "I" β hβ β "love" β hβ β "NLP" β hβ βββΊ context vector (c) Decoder RNN c β "J'" β "adore" β "le" β "TAL" β <EOS>
- Encoder β reads the entire input sequence token by token and compresses it into a single fixed-size vector called the context vector (the final hidden state)
- Decoder β takes the context vector and generates the output sequence one token at a time, using each previously generated token as input for the next step
The Bottleneck Problem
The entire input sequence is squeezed into one fixed-size context vector. For long sentences this vector cannot retain all the information, causing the model to forget early parts of the input. Translation quality dropped sharply as sentence length increased.
Long input sentence
ββββββββββββββββββββββββββββββββββββββββββββ
β wordβ wordβ wordβ ... wordβ
β wordβ
ββ
ββββββββββββββββββββββββββββββββ¬ββββββββββββ
βΌ
Single context vector (c) β information bottleneck
βΌ
Decoder output
The Fix: Attention Mechanism (Bahdanau et al., 2015)
Instead of relying on a single context vector, the attention mechanism lets the decoder look back at all encoder hidden states at every decoding step and decide which parts of the input to focus on.
Encoder hidden states: hβ hβ hβ hβ hβ
β β β β β
0.05 0.10 0.60 0.20 0.05 β attention weights (sum to 1)
β β β β β
ββββββ΄βββββ΄βββββ΄βββββ
βΌ
Weighted context vector β fed into decoder at this step
- Score β at each decoder step, compute an alignment score between the current decoder state and every encoder hidden state
- Normalise β pass the scores through softmax to get attention weights
- Weight β multiply each encoder hidden state by its attention weight and sum them into a new, step-specific context vector
- Decode β feed this context vector (along with the previous token) into the decoder to produce the next output token
Why Attention Solved the Bottleneck
- No more single vector β the decoder gets a fresh, relevant context vector at every step instead of one compressed summary
- Long-range access β the model can attend to any part of the input regardless of sequence length
- Interpretable β attention weights show which input tokens the model focuses on for each output token (useful for debugging translations)
From RNN Attention to Self-Attention (Transformers)
RNN-based attention still had to process the encoder sequentially, one token at a time. The Transformer (Vaswani et al., 2017) replaced the RNN entirely with self-attention, allowing all tokens to attend to each other in parallel β solving the speed bottleneck while keeping the benefits of attention.
Seq2Seq Evolution: Vanilla Seq2Seq (2014) Fixed context vector β bottleneck on long sequences + Attention (2015) Dynamic context per step β solved information loss Transformer (2017) Self-attention, fully parallel β solved speed and scalability
Transformers
Positional Encoding
Positional encoding assigns a numerical representation to each word before embedding, allowing the model to understand the order of tokens in a sequence.
BERT vs GPT
BERT: Autoencoding encoder-only model that reads the entire sequence bidirectionally, excelling at understanding tasks (classification, NER, question answering). Trained with Masked Language Modelling (MLM) β randomly masks tokens in the input and trains the model to predict them using the full surrounding context.
GPT: Autoregressive decoder-only model that processes tokens left-to-right, suited for text generation. Trained with causal language modelling β predicts the next token using only prior context.
Attention Mechanisms
Attention
Attention allows a model to capture relationships between tokens in a sequence.
This is particularly important in encoderβdecoder architectures such as machine translation systems.
Self-Attention
Self-attention allows a model to process an entire sequence simultaneously and learn dependencies between all tokens.
This is powerful for tasks that require understanding context across the whole sequence, such as:
- Text generation
- Question answering
- Summarization
Multi-Head Attention
Multi-head attention runs the attention mechanism multiple times in parallel, each time with different learned weight matrices. Each parallel run is called a head.
Each head can learn to focus on a different type of relationship in the sequence β for example, one head might attend to syntactic structure while another attends to semantic similarity.
The outputs of all heads are concatenated and projected into a final representation.
Input | +---> Head 1 (Q1, K1, V1) ---> Attention output 1 +---> Head 2 (Q2, K2, V2) ---> Attention output 2 +---> Head 3 (Q3, K3, V3) ---> Attention output 3 | +--> Concatenate all outputs --> Linear projection --> Final output
Why it matters: a single attention head can only learn one way to relate tokens. Multiple heads let the model capture several different relationships simultaneously, which is key to the power of Transformers.
Text Representation & Retrieval
Embeddings & Text Representation Techniques
Embeddings are the mathematical representations of words, phrases, or tokens in a large-dimensional space, capturing their semantic meaning and relationships with other words. Before modern embeddings, several simpler techniques were used.
One-Hot Encoding
Each word is represented as a vector of zeros with a single 1 at the word's index in the vocabulary.
Vocabulary: [cat, dog, fish] cat β [1, 0, 0] dog β [0, 1, 0] fish β [0, 0, 1]
Problem: no sense of similarity β cat and dog are just as "different" as cat and fish. Vector size grows with vocabulary (sparse, high-dimensional).
N-Grams
Instead of single words, capture sequences of N consecutive words to preserve some context.
Sentence: "I love NLP" Unigrams (1-gram): [I, love, NLP] Bigrams (2-gram): [I love, love NLP] Trigrams (3-gram): [I love NLP]
Why useful: captures local word order and phrases. "not good" as a bigram is very different from "good" alone.
Problem: vocabulary explodes with larger N, still no semantic meaning.
Bag of Words (BoW)
Represents a document as a vector of word counts, ignoring order entirely.
Vocabulary: [I, love, NLP, AI] "I love NLP" β [1, 1, 1, 0] "I love AI" β [1, 1, 0, 1]
Problem: no word order, no context, all words treated equally.
TF-IDF
Improves BoW by weighting words based on how frequent they are in a document vs. how common they are across all documents. Rare but frequent words in a doc get high scores.
Problem: still no semantic meaning, synonyms treated as unrelated.
Word Embeddings (Word2Vec, GloVe)
Dense low-dimensional vectors where similar words are close in vector space. Captures semantic relationships learned from large corpora.
King β Man + Woman β Queen
Word2Vec Architectures
- CBOW (Continuous Bag of Words) β predicts the target word from surrounding context words. Faster to train and works better for frequent words.
- Skip-gram β predicts surrounding context words from the target word. Works better for rare words and smaller datasets.
Both use a shallow neural network (single hidden layer) and are trained on large text corpora using techniques like negative sampling to make training efficient.
Limitations of Word2Vec
- Static embeddings β one fixed vector per word regardless of context. "bank" (river) and "bank" (money) get the same vector (polysemy problem).
- Out-of-vocabulary (OOV) β cannot handle words not seen during training. Misspellings or new words have no representation.
- No subword information β treats each word as an atomic unit, so it cannot leverage morphology (e.g., "unhappiness" shares nothing with "happy"). FastText addresses this by using character n-grams.
- Requires large corpora β needs substantial training data to produce high-quality embeddings; performs poorly on small or domain-specific datasets.
- No sentence/document-level meaning β only produces word-level embeddings; doesn't capture phrase or sentence semantics directly.
- Encodes societal biases β reflects biases present in the training data (e.g., gender or racial stereotypes in analogy tasks).
Contextual Embeddings (BERT, GPT)
Each word gets a different vector depending on its surrounding context. "bank" in a financial sentence gets a different embedding than "bank" in a nature sentence.
These are the embeddings used in modern Transformers and LLMs.
Evolution Summary
One-Hot Encoding β sparse, no similarity N-Grams β adds local context, still sparse Bag of Words β word counts, ignores order TF-IDF β weighted counts, still no semantics Word2Vec / GloVe β dense, semantic, but context-free BERT / GPT β dense, semantic, context-aware
Chunking Methods
Chunking is the process of splitting large documents into smaller pieces before embedding and retrieval.
- Fixed-size Chunking β splits text into chunks of a set character/token length, regardless of content
- Content-aware Chunking β splits at natural boundaries like sentences or paragraphs to preserve meaning
- Recursive Character Level Chunking β recursively splits on a hierarchy of separators (e.g. paragraphs β sentences β words) until chunks are small enough
- Document Structure-based Chunking β uses the document's own structure (headings, sections, markdown) to define chunk boundaries
- Contextual Chunking with LLMs β uses an LLM to intelligently determine the best split points based on semantic context
Hyperparameters
What Are Hyperparameters?
Hyperparameters are configuration values set before training begins. Unlike model parameters (weights and biases), they are not learned from data β you choose them manually or via search.
Training Hyperparameters
Learning Rate
Controls how large each weight update step is. The single most impactful hyperparameter.
Too high β overshoots optimum, loss diverges Too low β learns very slowly, may get stuck Typical range: 1e-5 to 1e-1 (e.g. 3e-4 for Adam)
Batch Size
Number of training samples processed before the model's weights are updated.
- Larger batch β more stable gradient estimates, faster GPU utilisation, but needs more memory and may generalise worse
- Smaller batch β noisier updates, acts as regularisation, trains more slowly per epoch
- Typical values: 16, 32, 64, 128, 256
Epochs
One epoch = one full pass through the entire training dataset.
- Too few β underfitting
- Too many β overfitting (use early stopping to mitigate)
Optimizer
Algorithm used to update weights. Common choices:
- SGD β simple, often needs careful tuning
- Adam β adaptive learning rate per parameter; most common default
- AdamW β Adam with decoupled weight decay; preferred for Transformers
Weight Decay (L2 Regularisation)
Penalises large weights to reduce overfitting. Added directly to the loss.
loss_total = loss + Ξ» Γ Ξ£(wΒ²) where Ξ» is the weight decay coefficient
Dropout Rate
Fraction of neurons randomly set to zero during each training step, forcing the network not to rely on any single neuron.
Typical range: 0.1 β 0.5 0.0 = no dropout (disabled) 0.5 = 50% of neurons dropped each step
Gradient Clipping
Caps gradients at a maximum norm before the weight update, preventing exploding gradients.
Common value: max_norm = 1.0
Architecture Hyperparameters
Number of Layers (Depth)
How many stacked layers the network has. Deeper networks can learn more complex features but are harder to train and prone to vanishing gradients.
Hidden Size / Model Dimension (d_model)
Width of each layer β the size of the internal representation vectors.
GPT-2 small: d_model = 768 GPT-3: d_model = 12,288
Number of Attention Heads
In Transformer models, how many parallel self-attention heads to run. Each head attends to different relationships. Must divide evenly into d_model.
head_dim = d_model / num_heads e.g. 768 / 12 = 64 dimensions per head
Context Length (Sequence Length)
Maximum number of tokens the model can process at once. Attention scales quadratically with sequence length (O(nΒ²)), so longer contexts are expensive.
Feedforward Dimension
Size of the inner layer in the Transformer's feedforward block. Typically 4Γ d_model.
Inference / Generation Hyperparameters
Temperature
Scales the logits before softmax to control randomness of output.
Temperature = 0.0 β greedy (always picks highest probability token) Temperature = 1.0 β standard distribution (default) Temperature > 1.0 β more random / creative Temperature < 1.0 β more focused / deterministic
Top-k Sampling
At each step, only the k highest-probability tokens are considered. The model samples from these k candidates.
k = 1 β greedy decoding k = 50 β common default
Top-p (Nucleus) Sampling
Instead of a fixed k, consider the smallest set of tokens whose cumulative probability exceeds p. Adapts dynamically to the distribution shape.
p = 0.9 β sample from tokens covering 90% of the probability mass
Max New Tokens
Hard limit on how many tokens the model generates in one call, preventing runaway outputs.
Repetition Penalty
Reduces the probability of tokens that have already appeared, discouraging repetitive output. Values > 1.0 penalise repetition.
Quick Reference
Hyperparameter Typical Range / Values What it controls ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Learning rate 1e-5 β 1e-1 Step size of weight updates Batch size 16 β 512 Samples per update step Epochs 1 β 100+ Full passes through data Dropout 0.1 β 0.5 Regularisation strength Weight decay (Ξ») 1e-4 β 1e-1 Penalty on large weights Gradient clip 0.5 β 5.0 Max gradient norm d_model 128 β 12,288 Layer width Num heads 4 β 96 Parallel attention heads Context length 512 β 128,000 tokens Max input length Temperature 0.0 β 2.0 Output randomness Top-k 1 β 100 Candidate token pool Top-p 0.5 β 1.0 Nucleus probability mass
LLM Training & Deployment
LLM Fine-Tuning Methods
- Full Fine-Tuning
- LoRA / PEFT
- Prompt Tuning
FP16 Memory Calculation
FP16 (16-bit floating point) uses 2 bytes per parameter.
Example β 7B parameter model:
7B Γ 2 bytes = 14 GB
So a 7B parameter model in FP16 requires about 14 GB of RAM.
KV Cache (Key-Value Cache)
During autoregressive generation, a Transformer decoder produces tokens one at a time. At each step, the self-attention mechanism needs the Key and Value vectors of every previous token to compute attention. Without caching, the model would have to recompute K and V for the entire sequence from scratch at every single step.
The Problem Without KV Cache
Generating token 5: Must attend to tokens 1, 2, 3, 4 β recompute K and V for ALL of them Generating token 6: Must attend to tokens 1, 2, 3, 4, 5 β recompute K and V for ALL of them AGAIN This is redundant β K and V for tokens 1β4 haven't changed!
How KV Cache Works
The KV cache stores the Key and Value matrices from all previous tokens so they are computed only once and reused at every subsequent step.
Step 1: process token 1 β compute Kβ, Vβ β store in cache Step 2: process token 2 β compute Kβ, Vβ β append to cache β attend using [KβKβ], [VβVβ] Step 3: process token 3 β compute Kβ, Vβ β append to cache β attend using [KβKβKβ], [VβVβVβ] ... Only the NEW token's Q, K, V are computed β past K, V are read from cache
Why It Matters
- Massive speedup β without the cache, generation time scales quadratically with sequence length (O(nΒ²)); with the cache, each new step is O(n) since only one new token's Q attends to all cached K/V
- Trades memory for speed β the cache consumes GPU memory that grows with sequence length, number of layers, and number of attention heads
KV Cache Memory Formula
KV cache size = 2 Γ num_layers Γ num_heads Γ head_dim Γ seq_len Γ bytes_per_param Example β 7B model (32 layers, 32 heads, head_dim=128, FP16, 4096 tokens): = 2 Γ 32 Γ 32 Γ 128 Γ 4096 Γ 2 bytes = ~2 GB
For long context windows (e.g. 128K tokens), the KV cache alone can exceed the model weights in memory.
Optimisations
- Multi-Query Attention (MQA) β all heads share a single set of K, V projections, drastically reducing cache size
- Grouped-Query Attention (GQA) β a middle ground where groups of heads share K, V (used in LLaMA 2 70B, Mistral)
- Quantised KV cache β store cached K, V in lower precision (e.g. INT8) to reduce memory
- Sliding window attention β only cache the most recent N tokens instead of the full history (used in Mistral)
Quick Summary
Without KV cache Recompute all K, V at every step Slow (O(nΒ²) per token) With KV cache Store and reuse past K, V Fast (O(n) per token) Trade-off Uses extra GPU memory Grows with sequence length