Beginner
Translation is harder than replacing words one by one. Good translation preserves meaning, style, grammar, and sometimes cultural or domain-specific intent.
- Languages have different word orders and idioms.
- One source phrase may require multiple target words, or the reverse.
- Context determines the right translation for ambiguous terms.
Real-world example: translating legal or medical content requires more than fluency; terminology must be precise and consistent.
source = "The bank is closed."
possible_meanings = ["financial institution", "river bank"]
print(source, possible_meanings)
Advanced
Machine translation systems evolved from statistical phrase-based models to neural Seq2Seq and then transformers. Engineering concerns include domain adaptation, terminology constraints, low-resource languages, document-level context, and evaluation quality beyond surface overlap metrics.
- Constrained decoding can preserve required terminology.
- Back-translation helps low-resource settings by creating synthetic parallel data.
- Document-level translation improves consistency across sentences.
- Human evaluation remains important for adequacy and fluency.
Source text -> tokenize -> encoder -> decoder -> beam search / decoding -> target text -> evaluation
terminology = {"claim": "reclamation", "policy": "politique"}
print(terminology)
Translation is a good case study because it exposes alignment, attention, decoding, and evaluation issues in one task.
To-do list
Learn
- Understand why translation needs context and alignment.
- Learn the historical progression from phrase-based MT to transformers.
- Study low-resource translation challenges.
- Understand terminology control and document-level consistency.
Practice
- Inspect translations where literal word substitution fails.
- Compare outputs from a generic model and a domain-adapted system.
- Evaluate short translations with BLEU and human judgment.
- Test ambiguous source sentences and analyze disambiguation failures.
Build
- Create a simple translation demo using a pretrained model.
- Add a glossary or terminology constraint mechanism.
- Build an evaluation notebook comparing outputs across domains.
- Write notes on where translation quality breaks down.