Machine Translation

Beginner

Translation is harder than replacing words one by one. Good translation preserves meaning, style, grammar, and sometimes cultural or domain-specific intent.

Languages have different word orders and idioms.
One source phrase may require multiple target words, or the reverse.
Context determines the right translation for ambiguous terms.

Real-world example: translating legal or medical content requires more than fluency; terminology must be precise and consistent.

source = "The bank is closed."
possible_meanings = ["financial institution", "river bank"]
print(source, possible_meanings)

Advanced

Machine translation systems evolved from statistical phrase-based models to neural Seq2Seq and then transformers. Engineering concerns include domain adaptation, terminology constraints, low-resource languages, document-level context, and evaluation quality beyond surface overlap metrics.

Constrained decoding can preserve required terminology.
Back-translation helps low-resource settings by creating synthetic parallel data.
Document-level translation improves consistency across sentences.
Human evaluation remains important for adequacy and fluency.

Source text -> tokenize -> encoder -> decoder -> beam search / decoding -> target text -> evaluation

terminology = {"claim": "reclamation", "policy": "politique"}
print(terminology)

Translation is a good case study because it exposes alignment, attention, decoding, and evaluation issues in one task.

To-do list

Learn

Understand why translation needs context and alignment.
Learn the historical progression from phrase-based MT to transformers.
Study low-resource translation challenges.
Understand terminology control and document-level consistency.

Practice

Inspect translations where literal word substitution fails.
Compare outputs from a generic model and a domain-adapted system.
Evaluate short translations with BLEU and human judgment.
Test ambiguous source sentences and analyze disambiguation failures.

Build

Create a simple translation demo using a pretrained model.
Add a glossary or terminology constraint mechanism.
Build an evaluation notebook comparing outputs across domains.
Write notes on where translation quality breaks down.