Subject 12

Text preprocessing and normalization

Text preprocessing and normalization convert messy real-world text into a cleaner, more consistent form so later NLP steps are easier, safer, and more reliable. It is one of the least flashy parts of the pipeline, but it often determines whether the rest of the system sees signal or noise.

Beginner

Real text is rarely clean. It can contain extra spaces, copied HTML, broken punctuation, OCR artifacts, different Unicode forms for what looks like the same character, repeated symbols, smart quotes, mixed line breaks, or invisible control characters. Preprocessing is the stage where you decide which of those differences are accidental noise and which are meaningful information.

What preprocessing is trying to do

The goal is not to make text look pretty. The goal is to make equivalent inputs behave more consistently for the task you care about. Two visually different strings may represent the same thing to your model, search engine, or classifier, and normalization helps collapse that unnecessary variation.

Common cleanup operations

Operation Typical example Main caution
Trim and collapse whitespace "hello world" -> "hello world" May destroy layout that matters in code, tables, or addresses.
Unicode normalization full-width characters -> standard-width characters Compatibility normalization can change presentation details.
Case normalization "Invoice" -> "invoice" Can hurt NER, acronyms, and legal or biomedical text.
HTML cleanup "<b>Sale</b>" -> "Sale" Some tags encode structure that may still matter.
Control-character removal remove zero-width or non-printing characters Be careful with languages or tools where such markers are meaningful.
import re
import unicodedata

def normalize_text(text):
    text = unicodedata.normalize("NFKC", text)
    text = text.replace("\u00a0", " ")
    text = re.sub(r"\s+", " ", text).strip()
    return text

print(normalize_text("  Full-width ABC   and   extra spaces  "))
Raw text
  -> remove accidental formatting noise
  -> standardize characters and spacing
  -> keep meaning-bearing content intact
  -> hand cleaner text to later pipeline stages

Real-world example: OCR-scanned invoices often contain broken spacing, full-width characters, odd dashes, and invisible formatting artifacts. A modest normalization pass can materially improve matching, deduplication, and field extraction.

Simple rule: normalize what is accidental, preserve what may carry meaning. If you are not sure, keep both the raw and normalized forms.

Intermediate

Useful preprocessing is task-aware. The right policy for search is not always the right policy for named entity recognition, classification, or document auditing. Strong practitioners think in terms of normalization policy: what should be standardized, what should be preserved, and why.

Task-specific normalization choices

Use case Usually helpful Often dangerous
Keyword search Whitespace cleanup, Unicode normalization, case folding, quote normalization Dropping symbols that distinguish product codes or IDs
Named entity recognition Unicode cleanup, consistent spacing, controlled OCR repair Aggressive lowercasing or punctuation removal
Sentiment or classification Basic cleanup, repeated whitespace removal, noisy markup stripping Removing negation markers, emojis, or repeated emphasis
Deduplication Case normalization, whitespace normalization, stable punctuation rules Changing text so much that near-duplicates become false matches

Unicode and text-format subtleties

Unicode normalization matters because the same human-visible text can be encoded in different ways. Some forms preserve canonical equivalence, while compatibility-oriented forms also simplify presentation differences such as full-width characters. That is useful for many pipelines, but it can be too aggressive if exact rendering matters.

Pipeline design pattern

Raw document
  -> preserve original text
  -> apply deterministic cleanup rules
  -> produce normalized text for matching or modeling
  -> log preprocessing version and key transforms
  -> keep enough traceability to explain differences
import html
import re
import unicodedata

def normalize_for_indexing(text, lowercase=True):
    cleaned = html.unescape(text)
    cleaned = unicodedata.normalize("NFKC", cleaned)
    cleaned = cleaned.replace("\u00a0", " ")
    cleaned = re.sub(r"<[^>]+>", " ", cleaned)
    cleaned = re.sub(r"\s+", " ", cleaned).strip()
    if lowercase:
        cleaned = cleaned.lower()
    return cleaned

record = {
    "raw": "<p>Order â„– 123</p>\nShipped on 03/08/2026",
    "normalized": normalize_for_indexing("<p>Order â„– 123</p>\nShipped on 03/08/2026"),
    "preprocess_version": "v2-indexing"
}

print(record)

Keep this module separate from tokenization, multilingual handling, retrieval design, or embedding models. Preprocessing and normalization are about cleaning and standardizing text before those later choices, not replacing them.

Advanced

At advanced maturity, preprocessing becomes a data-contract and quality-control problem. You want a pipeline that is deterministic, explainable, testable, and stable across time. Small text-cleaning changes can silently change model accuracy, break annotation offsets, alter deduplication rates, or reduce audit trust if the raw evidence is no longer recoverable.

Properties of a strong preprocessing pipeline

Failure modes to watch for

Raw input
  -> preserve original
  -> apply task-specific normalization profile
  -> validate on downstream metrics
  -> log version, rules, and known trade-offs
  -> ship normalized text plus traceability metadata
def preprocess_record(text, profile_name):
    profiles = {
        "display_safe": {"lowercase": False},
        "search_index": {"lowercase": True},
    }

    profile = profiles[profile_name]
    normalized = normalize_for_indexing(text, lowercase=profile["lowercase"])

    return {
        "raw": text,
        "normalized": normalized,
        "profile": profile_name,
        "preprocess_version": "v3"
    }

print(preprocess_record("Order № 123 — Shipped on 03/08/2026", "search_index"))

In practice, strong teams treat preprocessing as part of the system contract. They define the rules, test them on realistic text, measure downstream impact, and retain enough provenance to explain exactly why two text strings matched, differed, or failed.

To-do list

Learn

  • Understand the difference between cleaning accidental noise and removing meaningful text.
  • Learn whitespace, casing, HTML cleanup, control-character removal, and Unicode normalization trade-offs.
  • Study when NFC/NFD versus NFKC/NFKD style normalization is appropriate.
  • Learn why preprocessing policies should differ across search, classification, NER, and deduplication.
  • Understand versioning, traceability, and offset-alignment risks in text pipelines.

Practice

  • Create several normalization policies and test them on OCR text, HTML snippets, chat text, and copied PDFs.
  • Compare raw-text, lowercased, and Unicode-normalized pipelines on a small matching or classification task.
  • Inspect examples where punctuation, emoji, or capitalization change the meaning and should be preserved.
  • Build before-and-after examples showing where aggressive cleanup creates false matches or lost entities.
  • Check whether your preprocessing remains stable when run twice and whether offsets still align.

Build

  • Create a reusable preprocessing module with configurable profiles such as display-safe and search-index.
  • Add unit tests for whitespace handling, Unicode normalization, HTML stripping, and idempotence.
  • Log preprocessing versions and retain raw plus normalized text in your document records.
  • Build a small evaluation script that compares downstream results before and after normalization changes.
  • Document a normalization policy for one domain corpus and justify each rule in plain language.