Sequence Models

Beginner

What makes a problem sequential

A problem becomes sequential when the order of items matters, not just the items themselves. In language, the same words can mean different things depending on position, local context, and what came before.

Order matters: "dog bites man" and "man bites dog" contain the same words but different meaning.
Context matters: the word "bank" means something different in "river bank" versus "bank loan".
Length varies: text inputs are usually not fixed-size vectors, so models must handle short and long sequences.

Input sequence:   x1   x2   x3   ...   xn
Meaning depends on both token identity and position.

The model must decide:
- what information to keep
- what information to ignore
- how each position affects the output

Common sequence task types

Not all sequence models produce the same kind of output. A useful way to think about them is by how inputs and outputs align.

Task type	Input	Output	Example
Sequence classification	Whole sequence	One label	Sentiment analysis
Sequence labeling	One sequence	One label per token	Named-entity recognition, POS tagging
Language modeling	Prefix or context	Next token or missing token	Autocomplete, masked-word prediction
Sequence generation	Input sequence or prompt	New sequence	Summarization, translation, dialogue

tokens = ["the", "cat", "sat"]
for i, token in enumerate(tokens):
    print(i, token)

task_views = {
    "sequence_classification": "one label for all tokens",
    "sequence_labeling": "one label per token",
    "next_token_prediction": "predict what comes next"
}

print(task_views)

Why sequence modeling is harder than simple classification

In ordinary tabular prediction, each feature is usually treated as a fixed field. In sequence problems, the model must build a representation on the fly while reading the data. That means it has to balance local clues and longer context at the same time.

Nearby words often carry syntax and short-range meaning.
Distant words may determine topic, reference, or agreement.
Padding, truncation, and very long inputs can distort what the model sees.

Real-world example: sentiment classification may depend on a late negation, while sequence labeling may depend on a named entity introduced several tokens earlier.

Advanced

Formal view of a sequence model

Let an input sequence be $x_1, x_2, ..., x_n$. A sequence model learns a function that maps that ordered input to outputs such as a single label, a token-aligned label sequence, token probabilities, or a generated target sequence. The main question is not just what to predict, but which parts of the context each prediction should depend on.

Causal dependence: predict position $t$ using only earlier positions.
Bidirectional dependence: predict using both left and right context.
Aligned output: output length matches input length.
Unaligned output: input and output lengths differ or output is generated step by step.

Input:   x1  x2  x3  x4  ...  xn
Output:  y   or  y1 y2 y3 y4 ... yn   or   y1 y2 ... ym

The design choice is the dependency pattern:
- one label for the whole input
- one label per input position
- one output sequence conditioned on the input sequence

Key design axes

Most sequence-model families differ along a small set of design questions. Thinking in these dimensions helps you compare models without jumping too quickly into architecture details.

Design axis	Main question	Why it matters
Context direction	Can the model use only past tokens, or both past and future tokens?	Determines suitability for generation versus understanding tasks
State representation	How is earlier information summarized or preserved?	Affects long-range memory and compression quality
Alignment	Does each input position map to one output position?	Separates labeling tasks from generation tasks
Decoding strategy	Are outputs predicted all at once or sequentially?	Changes latency, error propagation, and search complexity

Training objectives and evaluation framing

Sequence models are often trained with token-level or sequence-level objectives. The right objective depends on the task formulation.

Classification: optimize one label for the full sequence.
Labeling: optimize one prediction per position, often with masking for padding tokens.
Language modeling: optimize token prediction under conditional probability.
Generation: optimize target tokens conditioned on source context and previously generated tokens.

def next_token_probability(history):
    return f"P(next | {' '.join(history)})"

def sequence_label_probability(tokens, labels):
    return f"P({labels} | {tokens})"

print(next_token_probability(["the", "cat"]))
print(sequence_label_probability(["John", "works", "there"], ["B-PER", "O", "O"]))

Evaluation should match the output structure. Accuracy can be fine for simple sequence classification, but token- level tagging often needs span-aware metrics, and generative tasks may need sequence-level quality measures. A model can look strong on local predictions while still failing to preserve global consistency.

Common failure modes

Long-range dependency failure: the model overweights nearby tokens and misses distant evidence.
Context compression failure: too much information is forced into a weak internal summary.
Length sensitivity: behavior degrades on much longer sequences than seen during training.
Error propagation: in autoregressive settings, early mistakes can corrupt later outputs.
Boundary and padding mistakes: special tokens or truncation rules distort predictions.

Scope boundary: this module explains the sequence-modeling problem class. The next modules handle the major architecture families used to solve it, including recurrent models, Seq2Seq systems, attention, and transformers.

Why this topic matters

If you understand sequence models at the formulation level, later architecture choices become easier to reason about. You can ask whether a task needs left-to-right prediction, token alignment, bidirectional context, or free-form generation before deciding which model family fits best.

To-do list

Learn

Understand the difference between sequence classification, sequence labeling, language modeling, and sequence generation.
Learn why token order, local context, and distant context all matter in language tasks.
Study causal versus bidirectional context and when each is appropriate.
Understand aligned outputs versus generated outputs.
Learn the main failure modes: long-range dependency issues, compression loss, and length sensitivity.

Practice

Map ten NLP tasks to the correct sequence-model formulation and justify each choice.
Write down simple conditional probabilities for next-token and token-label predictions.
Compare a bag-of-words view with an ordered-sequence view on the same sentence pair.
Identify which tasks need only past context and which need both left and right context.
Inspect examples where truncation or padding would change the model's behavior.

Build

Create a tiny simulator that shows how changing one earlier token changes later predictions.
Build a small dataset for sequence classification and another for token labeling.
Write a short comparison sheet linking task structure to the required output format.
Prepare a set of short and long example sequences to test context sensitivity later.
Document one real application each for classification, labeling, language modeling, and generation.