RNNs and Seq2Seq

Beginner

What an RNN is doing

An RNN reuses the same computation at every time step. At step $t$, it reads the current token $x_t$ and the previous hidden state $h_{t-1}$, then produces a new hidden state $h_t$. That hidden state acts like a running summary of what the model has seen so far.

Vanilla RNN: the simplest recurrent form, but fragile on long dependencies.
LSTM: adds a cell state and gates that control what to keep, write, and expose.
GRU: a simpler gated variant that combines some of the LSTM controls.
Seq2Seq: uses one recurrent network to read an input sequence and another to generate an output sequence.

x1 -> h1 ->
x2 -> h2 ->  recurrent state carries context forward
x3 -> h3 ->
...

Same cell parameters are reused at every step.

hidden_state = "initial"
for token in ["I", "love", "NLP"]:
    hidden_state = f"update({hidden_state}, {token})"
print(hidden_state)

Why vanilla RNNs struggle

In theory, an RNN can carry information from far back in the sequence. In practice, training through many time steps is hard because gradients tend to shrink or explode as they flow backward through the chain. That is why plain RNNs often remember recent context better than distant context.

Vanishing gradients: useful old signals decay before they can shape learning.
Exploding gradients: updates become unstable unless clipped or carefully tuned.
Sequential dependence: each step waits for the previous step, which slows training and inference.

Why LSTMs and GRUs mattered

LSTMs and GRUs were major improvements because they introduced gates. These gates learn when to forget old information, when to write new information, and when to expose stored information to the next layer or output. That makes long-range learning much more reliable than with a plain recurrent unit.

Model	Main idea	Strength	Weakness
Vanilla RNN	Single hidden state update	Simple and lightweight	Poor long-term memory
LSTM	Cell state plus forget, input, and output gates	Better long-range retention	More parameters and slower step cost
GRU	Reset and update gates	Simpler than LSTM, often competitive	Less explicit memory control than LSTM

What Seq2Seq adds

Seq2Seq turns recurrence into a full input-output pipeline. The encoder reads the source sequence and produces a representation. The decoder then generates the target sequence token by token until it emits an end token.

Source tokens -> Encoder RNN -> context representation
                                   |
                                   v
Start token -> Decoder RNN -> y1 -> y2 -> y3 -> ... -> <EOS>

Real-world example: early neural machine translation systems encoded an English sentence and decoded a French translation. The same pattern was also used for headline generation, dialogue generation, and speech transcription variants.

Advanced

Core recurrence and training dynamics

A recurrent layer can be written abstractly as $h_t = f(x_t, h_{t-1})$. Training uses backpropagation through time (BPTT): the network is unrolled across steps, and gradients are propagated backward through the full chain. That means optimization difficulty grows with sequence length, especially when relevant evidence appears far before the prediction it must support.

Truncated BPTT limits how many steps gradients flow backward to reduce cost.
Gradient clipping is commonly used to contain exploding updates.
Teacher forcing feeds the true previous token into the decoder during training to stabilize learning.
Exposure bias appears at inference time because the decoder must condition on its own past predictions instead of ground-truth tokens.

Encoder-decoder bottleneck

Early Seq2Seq systems often compressed the entire source sequence into a single fixed-length context vector. That works for short or moderate inputs, but becomes a bottleneck when the source is long, information-dense, or requires fine-grained alignment between source and target positions.

Without attention
Source sentence -> Encoder -> one context vector -> Decoder

Problem:
all source details must be squeezed into one representation

Attention inside Seq2Seq

Attention improved Seq2Seq by letting the decoder consult all encoder states instead of relying only on one final vector. For each output step, the decoder scores source positions, builds a weighted context vector, and uses that context to predict the next target token. This gives a soft alignment between source and target tokens.

The decoder can focus on different source words at different output steps.
Long sentences become easier because information is no longer forced into one vector.
Attention weights often become interpretable alignment maps.

Encoder states: h1  h2  h3  h4  ...  hn
                    ^   ^
                    |   |
Decoder step t computes attention weights over all source positions
and builds a context vector specialized for the next token.

Training and decoding workflow

def seq2seq_training_step(source_tokens, target_tokens):
    encoder_states = encode(source_tokens)
    decoder_input = "<BOS>"
    loss = 0

    for gold_token in target_tokens:
        prediction = decode_step(decoder_input, encoder_states)
        loss += cross_entropy(prediction, gold_token)
        decoder_input = gold_token  # teacher forcing

    return loss

At inference time, the decoder does not have the gold previous token. It must feed back its own prediction, which means early mistakes can cascade.

Stage	Common choice	Why it matters
Training	Teacher forcing	Faster, more stable optimization
Inference	Autoregressive decoding	Model must live with its own past outputs
Search	Greedy or beam search	Affects output quality and compute cost

Greedy decoding vs beam search

Greedy decoding picks the highest-probability next token at each step. It is fast, but can miss a better global sequence. Beam search keeps several partial hypotheses alive, trading more compute for better sequence-level results. In translation-era Seq2Seq systems, beam search was a standard decoding upgrade.

Important distinction: this module focuses on recurrent Seq2Seq systems. Attention here is a component added to encoder-decoder RNNs, not a full replacement for recurrence.

Where recurrent Seq2Seq still makes sense

Small educational projects where you want to understand sequence generation mechanics directly.
Resource-constrained or latency-specific setups with short sequences and compact models.
Streaming-style tasks where left-to-right state updates are operationally natural.
Historical understanding, since many later architectures solved problems first exposed clearly by RNN-based Seq2Seq.

RNNs and Seq2Seq are no longer the dominant general NLP architecture, but they remain important because they make sequence memory, alignment, decoding, and training pathologies very explicit.

pseudocode = [
	    "encode source tokens into recurrent states",
	    "initialize decoder with <BOS>",
	    "predict next token autoregressively",
	    "optionally score all encoder states with attention",
	    "stop when <EOS> is generated"
]
print(pseudocode)

To-do list

Learn

Understand the difference between vanilla RNNs, LSTMs, and GRUs.
Learn how hidden state, cell state, and gating relate to sequence memory.
Study encoder-decoder Seq2Seq data flow from source tokens to generated targets.
Understand vanishing and exploding gradients and how BPTT creates them.
Learn why fixed-length context vectors became a bottleneck and how attention improved Seq2Seq.
Know the difference between teacher forcing during training and autoregressive decoding at inference.

Practice

Trace a short sentence through a toy RNN and write down each hidden-state update.
Sketch an LSTM cell and label forget, input, and output gates from memory.
Walk through a Seq2Seq translation example and identify encoder inputs, decoder inputs, and targets.
Compare greedy decoding and beam search on a tiny next-token probability table.
Inspect where a fixed-context encoder-decoder would fail on a long source sentence.

Build

Implement or run a toy character-level RNN and log hidden states over time.
Build a miniature encoder-decoder model for a tiny translation or sequence-reversal dataset.
Add attention to the Seq2Seq baseline and compare outputs on longer examples.
Track training loss with and without teacher forcing and note the behavior difference.
Write a short note on when recurrent Seq2Seq is adequate and when its bottlenecks dominate.