Subject 27

Scaling laws in NLP

Scaling laws describe the regular, predictable way language model loss changes as model size, data volume, and training compute increase. They matter because they turn model growth from guesswork into a budgeting and planning problem.

Beginner

Scaling laws say that if you train language models in a reasonably consistent way, performance usually improves along a smooth curve as you increase size and training resources. The headline is not simply "bigger is better". The real lesson is that parameters, data, and compute must scale together if you want efficient gains.

What actually scales?

Power-law intuition

One of the classic findings is that loss often falls roughly like a power law. Informally, every large increase in resources produces a smaller but still somewhat predictable improvement. That is why a 10x increase in model size does not produce a 10x quality jump, yet it can still be strategically worth it.

Very rough intuition

More parameters  -> lower loss
More data        -> lower loss
More compute     -> lower loss

But each extra step usually buys less than the previous one.
That is diminishing returns, not random behavior.

Why the topic matters

Simple example

runs = [
    {"params_b": 0.3, "tokens_b": 20, "loss": 2.10},
    {"params_b": 1.3, "tokens_b": 50, "loss": 1.82},
    {"params_b": 6.7, "tokens_b": 200, "loss": 1.55},
]

for run in runs:
    print(run)

The pattern to notice is not the exact numbers. It is the shape: when resources grow in a coordinated way, loss tends to improve smoothly enough that future runs can often be forecast from smaller pilot runs.

Important scope: this module is about empirical relationships between size, data, compute, and loss. It is not the module for transformer internals, LLM applications, retrieval systems, or deployment infrastructure.

Advanced

Advanced discussion focuses on how to allocate a fixed compute budget. Early large-scale work showed that language model loss follows clean power-law trends across model size, dataset size, and compute over many orders of magnitude. Later work revised the practical recipe: many famous models were too large for the amount of data they saw, meaning they were undertrained relative to their compute budget.

Canonical scaling-law view

Three common regimes

1. Parameter-limited
   Model is too small to absorb more data efficiently.

2. Data-limited
   Model is large, but token budget is too small, so capacity is left unused.

3. Compute-limited
   You know both would help, but available training FLOPs force a trade-off.

Compute-optimal training in plain language

If you double parameters but do not materially increase tokens, the model may be too large for the amount of learning signal it receives. If you only add data while keeping the model too small, the model may not have enough capacity to exploit that data. Compute-optimal training tries to balance these two inefficiencies.

A commonly cited later takeaway is that model size and token count should often grow together much more than earlier practice assumed. This shifted the field away from extreme parameter growth with relatively fixed data budgets.

A compact mathematical sketch

# Stylized loss decomposition, not a production formula.
# Irreducible loss + model term + data term

def estimated_loss(irreducible, params, tokens, a=150, b=1.0, alpha=0.34, beta=0.28):
    model_term = a * (params ** (-alpha))
    data_term = b * (tokens ** (-beta))
    return irreducible + model_term + data_term

print(round(estimated_loss(1.45, params=70e9, tokens=1.4e12), 4))

The important concept is the shape of the formula, not the constants. As parameters and tokens grow, the added benefit shrinks, and the curve approaches an irreducible floor set by the task and data distribution.

What scaling laws are good for

Limits and caveats

Do not collapse this topic into "larger models always win." The real content of scaling laws is about rates of improvement, optimal allocation, and where additional spend stops being efficient.

Planning question for a fixed compute budget

Option A: much larger model + too few tokens   -> likely undertrained
Option B: balanced model + balanced token set  -> often compute-optimal
Option C: small model + massive token set      -> likely capacity-limited

Scaling-law analysis helps choose B on purpose rather than by guesswork.

To-do list

Learn

  • Understand why language model loss often follows power-law-like behavior over scale.
  • Learn the difference between parameter-limited, data-limited, and compute-limited regimes.
  • Study the shift from early scaling-law interpretations to compute-optimal training guidance.
  • Understand the idea of irreducible loss and diminishing marginal returns.

Practice

  • Plot loss against model size on a log-log chart using synthetic or published sample points.
  • Compare two hypothetical runs and decide which one looks undertrained.
  • Write a short explanation of why more parameters without more tokens can waste compute.
  • Estimate how much improvement you expect from the next 2x scale increase and what uncertainty remains.

Build

  • Create a tiny scaling-law calculator that accepts parameters, tokens, and a fixed compute budget.
  • Build a note comparing an oversized-undertrained run with a smaller-better-trained run.
  • Document assumptions that would make a scaling-law forecast fail in practice.
  • Prepare a one-page recommendation for how to allocate the next training budget increase.