Subject 20

RNNs and Seq2Seq

Recurrent neural networks process ordered inputs one time step at a time, carrying forward a hidden state as memory. Seq2Seq models use recurrent encoders and decoders to map one sequence into another, which made them a foundational architecture for early neural machine translation, summarization, and speech pipelines.

Beginner

What an RNN is doing

An RNN reuses the same computation at every time step. At step $t$, it reads the current token $x_t$ and the previous hidden state $h_{t-1}$, then produces a new hidden state $h_t$. That hidden state acts like a running summary of what the model has seen so far.

x1 -> h1 ->
x2 -> h2 ->  recurrent state carries context forward
x3 -> h3 ->
...

Same cell parameters are reused at every step.
hidden_state = "initial"
for token in ["I", "love", "NLP"]:
    hidden_state = f"update({hidden_state}, {token})"
print(hidden_state)

Why vanilla RNNs struggle

In theory, an RNN can carry information from far back in the sequence. In practice, training through many time steps is hard because gradients tend to shrink or explode as they flow backward through the chain. That is why plain RNNs often remember recent context better than distant context.

Why LSTMs and GRUs mattered

LSTMs and GRUs were major improvements because they introduced gates. These gates learn when to forget old information, when to write new information, and when to expose stored information to the next layer or output. That makes long-range learning much more reliable than with a plain recurrent unit.

Model Main idea Strength Weakness
Vanilla RNN Single hidden state update Simple and lightweight Poor long-term memory
LSTM Cell state plus forget, input, and output gates Better long-range retention More parameters and slower step cost
GRU Reset and update gates Simpler than LSTM, often competitive Less explicit memory control than LSTM

What Seq2Seq adds

Seq2Seq turns recurrence into a full input-output pipeline. The encoder reads the source sequence and produces a representation. The decoder then generates the target sequence token by token until it emits an end token.

Source tokens -> Encoder RNN -> context representation
                                   |
                                   v
Start token -> Decoder RNN -> y1 -> y2 -> y3 -> ... -> <EOS>

Real-world example: early neural machine translation systems encoded an English sentence and decoded a French translation. The same pattern was also used for headline generation, dialogue generation, and speech transcription variants.

Advanced

Core recurrence and training dynamics

A recurrent layer can be written abstractly as $h_t = f(x_t, h_{t-1})$. Training uses backpropagation through time (BPTT): the network is unrolled across steps, and gradients are propagated backward through the full chain. That means optimization difficulty grows with sequence length, especially when relevant evidence appears far before the prediction it must support.

Encoder-decoder bottleneck

Early Seq2Seq systems often compressed the entire source sequence into a single fixed-length context vector. That works for short or moderate inputs, but becomes a bottleneck when the source is long, information-dense, or requires fine-grained alignment between source and target positions.

Without attention
Source sentence -> Encoder -> one context vector -> Decoder

Problem:
all source details must be squeezed into one representation

Attention inside Seq2Seq

Attention improved Seq2Seq by letting the decoder consult all encoder states instead of relying only on one final vector. For each output step, the decoder scores source positions, builds a weighted context vector, and uses that context to predict the next target token. This gives a soft alignment between source and target tokens.

Encoder states: h1  h2  h3  h4  ...  hn
                    ^   ^
                    |   |
Decoder step t computes attention weights over all source positions
and builds a context vector specialized for the next token.

Training and decoding workflow

def seq2seq_training_step(source_tokens, target_tokens):
    encoder_states = encode(source_tokens)
    decoder_input = "<BOS>"
    loss = 0

    for gold_token in target_tokens:
        prediction = decode_step(decoder_input, encoder_states)
        loss += cross_entropy(prediction, gold_token)
        decoder_input = gold_token  # teacher forcing

    return loss

At inference time, the decoder does not have the gold previous token. It must feed back its own prediction, which means early mistakes can cascade.

Stage Common choice Why it matters
Training Teacher forcing Faster, more stable optimization
Inference Autoregressive decoding Model must live with its own past outputs
Search Greedy or beam search Affects output quality and compute cost

Greedy decoding vs beam search

Greedy decoding picks the highest-probability next token at each step. It is fast, but can miss a better global sequence. Beam search keeps several partial hypotheses alive, trading more compute for better sequence-level results. In translation-era Seq2Seq systems, beam search was a standard decoding upgrade.

Important distinction: this module focuses on recurrent Seq2Seq systems. Attention here is a component added to encoder-decoder RNNs, not a full replacement for recurrence.

Where recurrent Seq2Seq still makes sense

RNNs and Seq2Seq are no longer the dominant general NLP architecture, but they remain important because they make sequence memory, alignment, decoding, and training pathologies very explicit.

pseudocode = [
	    "encode source tokens into recurrent states",
	    "initialize decoder with <BOS>",
	    "predict next token autoregressively",
	    "optionally score all encoder states with attention",
	    "stop when <EOS> is generated"
]
print(pseudocode)

To-do list

Learn

  • Understand the difference between vanilla RNNs, LSTMs, and GRUs.
  • Learn how hidden state, cell state, and gating relate to sequence memory.
  • Study encoder-decoder Seq2Seq data flow from source tokens to generated targets.
  • Understand vanishing and exploding gradients and how BPTT creates them.
  • Learn why fixed-length context vectors became a bottleneck and how attention improved Seq2Seq.
  • Know the difference between teacher forcing during training and autoregressive decoding at inference.

Practice

  • Trace a short sentence through a toy RNN and write down each hidden-state update.
  • Sketch an LSTM cell and label forget, input, and output gates from memory.
  • Walk through a Seq2Seq translation example and identify encoder inputs, decoder inputs, and targets.
  • Compare greedy decoding and beam search on a tiny next-token probability table.
  • Inspect where a fixed-context encoder-decoder would fail on a long source sentence.

Build

  • Implement or run a toy character-level RNN and log hidden states over time.
  • Build a miniature encoder-decoder model for a tiny translation or sequence-reversal dataset.
  • Add attention to the Seq2Seq baseline and compare outputs on longer examples.
  • Track training loss with and without teacher forcing and note the behavior difference.
  • Write a short note on when recurrent Seq2Seq is adequate and when its bottlenecks dominate.