Subject 14

Core NLP concepts

How machines extract meaning from text and represent language numerically — from counting words to understanding semantic relationships.

1. Named Entity Recognition (NER)

NER identifies and classifies real-world entities mentioned in text.

Common entity types

Entity Type Example
PERSONElon Musk
ORGANIZATIONTesla
LOCATIONUSA
DATEJanuary 2024
MONEY$500 million
PERCENTAGE12%

Example

Sentence:  "Elon Musk is the CEO of Tesla and lives in the USA."

NER output:
  Elon Musk  →  PERSON
  Tesla      →  ORGANIZATION
  USA        →  LOCATION

Why NER matters

2. Bag of Words (BoW)

One of the simplest techniques to convert text into numbers. Word order and grammar are ignored — only word frequency matters.

Example

Sentence 1: "I love NLP"
Sentence 2: "I love AI"

Vocabulary: [I, love, NLP, AI]

Sentence 1  →  [1, 1, 1, 0]
Sentence 2  →  [1, 1, 0, 1]
AdvantagesLimitations
Very easy to implement No understanding of context
Works well for small datasets No semantic meaning
Useful as a baseline model Treats all words as equally important

3. TF-IDF (Term Frequency – Inverse Document Frequency)

TF-IDF improves Bag of Words by assigning importance scores to words. Words frequent in a document but rare across all documents score higher.

TF-IDF = TF × IDF

TF(t, d)  =  (Number of times term t appears in document d)
             / (Total number of terms in document d)

IDF(t)    =  log( Total number of documents
                  / Number of documents containing term t )

Intuition: A word like "the" appears everywhere — low IDF, low score. A word like "neural" appears rarely — high IDF, high score when present.

Better than BoW because

Works well for

Limitations

4. Word2Vec

Word2Vec represents words as dense numerical vectors that capture meaning and context. Words used in similar contexts get similar vectors.

Famous arithmetic examples:
  King − Man + Woman ≈ Queen
  Paris − France + Italy ≈ Rome

CBOW (Continuous Bag of Words)

Predicts a target word using surrounding context words.

Sentence: "Raj went to school yesterday"   (window size: 1)

  Input: [Raj, to]        →  Output: went
  Input: [went, school]   →  Output: to
  Input: [to, yesterday]  →  Output: school

How it works:
  1. Context words converted to one-hot vectors
  2. Vectors are summed or averaged
  3. Passed through hidden layer
  4. Model predicts the target word
  5. Error calculated, weights updated via backpropagation

Skip-Gram

Predicts surrounding context words from a target word — the reverse of CBOW.

Sentence: "Raj went to school yesterday"   (window size: 1)

  Target: went   →  Training pairs: (went→Raj), (went→to)
  Target: to     →  Training pairs: (to→went), (to→school)
  Target: school →  Training pairs: (school→to), (school→yesterday)

How it works:
  1. Target word converted to one-hot vector
  2. Passed through hidden layer
  3. Model predicts each context word
  4. Error calculated, weights updated via backpropagation

  👉 The hidden layer weights become the word embeddings
AdvantagesLimitations
Captures semantic relationships Same word has one vector regardless of context
Dense and meaningful embeddings bank (river) and bank (money) get the same vector
Useful for clustering and similarity Fixed by contextual models like BERT

5. When to Use Each Technique

Technique Use when
Bag of Words Building simple text classifiers or baseline NLP models
TF-IDF Search systems, document similarity, spam detection
Word2Vec Semantic similarity, recommendation systems, text clustering

These techniques show the evolution of NLP: from counting words → weighting word importance → understanding semantic meaning. They form the foundation for modern NLP and Generative AI systems.

6. Linguistic Fundamentals

Level Focus Example
Syntax Grammatical structure of sentences Dependency parsing, constituency parsing
Semantics Meaning of words and sentences Word sense disambiguation, contextual interpretation
Pragmatics Speaker intent in real-world context "Can you open the window?" is a request, not an ability test

7. Core NLP Tasks

Task Description
Tokenization Breaking text into words, phrases, or symbols
POS Tagging Classifying words into grammatical categories (noun, verb, etc.)
NER Identifying names of people, places, organizations
Parsing Understanding structural relationships between words
Sentiment Analysis Evaluating emotional tone (positive, negative, neutral)
Machine Translation Converting text from one language to another

8. The NLP Pipeline

Step 1: Text Preprocessing
        Tokenization, stop-word removal, stemming, lemmatization
            |
Step 2: Feature Extraction
        TF-IDF, word embeddings, or statistical models
            |
Step 3: Model Training
        Supervised or unsupervised learning
            |
Step 4: Parsing & Semantic Analysis
        Understanding sentence structure and meaning
            |
Step 5: Inference & Decision Making
        Translation, summarization, question answering

9. Modern Deep Learning Approaches

Approach Description Examples
Word Embeddings Capture semantic meaning of words as dense vectors Word2Vec, GloVe, FastText
Sequence-to-Sequence Process text where order matters RNNs, LSTMs, GRUs
Transformers Understand deep context via attention mechanisms BERT, GPT, T5

10. Evaluation & Ethical Considerations