Notes Exam

Question 1

What do weights determine in a neural network?

Question 2

What does the bias term do in a neural network layer?

Question 3

What is the purpose of activation functions in a neural network?

Question 4

What does the softmax function do?

Question 5

The sigmoid function squashes values into what range?

Question 6

Which loss function is used for binary classification?

Question 7

Which regression loss function is more robust to outliers?

Question 8

What happens when the learning rate is too high?

Question 9

What is Adam in the context of training neural networks?

Question 10

What does a lower temperature do to model predictions?

Question 11

CNNs are commonly used for which type of tasks?

Question 12

What is the role of pooling layers in a CNN?

Question 13

What key property allows RNNs to process sequential data?

Question 14

What does positional encoding do in a Transformer?

Question 15

BERT is trained with which method?

Question 16

What does self-attention allow a model to do?

Question 17

In multi-head attention, what happens to the outputs of all heads?

Question 18

What is the main problem with one-hot encoding?

Question 19

What is the main limitation of Word2Vec / GloVe embeddings?

Question 20

What makes contextual embeddings (BERT, GPT) different from Word2Vec?

Question 21

What does Bag of Words (BoW) ignore?

Question 22

How does TF-IDF improve on Bag of Words?

Question 23

What is chunking in the context of retrieval systems?

Question 24

Which of the following is a parameter-efficient LLM fine-tuning method?

Question 25

How many bytes per parameter does FP16 use?

Question 26

Attention is particularly important in which type of architecture?

Question 27

Given the sentence "I love NLP", what are the bigrams?

Question 28

How does Sparse Categorical Cross-Entropy differ from Categorical Cross-Entropy?

Question 29

Why is multi-head attention more powerful than a single attention head?

Question 30

What does Recursive Character Level Chunking do?

Question 31

What causes the vanishing gradient problem?

Question 32

What is the primary solution for the exploding gradient problem?

Question 33

Which activation functions are most likely to cause the vanishing gradient problem?

Question 34

Which of the following is NOT a key limitation of vanilla RNNs?

Question 35

What do LSTM and GRU architectures solve compared to vanilla RNNs?

Question 36

What is the main bottleneck of the original seq2seq encoder-decoder architecture?

Question 37

How does the attention mechanism (Bahdanau, 2015) fix the seq2seq bottleneck?

Question 38

What does the KV cache store during autoregressive generation?

Question 39

What is the main trade-off of using a KV cache?

Question 40

What does Grouped-Query Attention (GQA) do to reduce KV cache memory?

Question 41

What effect does using a smaller batch size have on training?

Question 42

What does dropout do during training?

Question 43

In a fraud detection system, 99.5% of transactions are legitimate. If a model predicts "not fraud" for every transaction, what is its accuracy?

Question 44

In fraud detection, which is typically more costly to get wrong — a false positive (flagging a legitimate transaction) or a false negative (missing actual fraud)?

Question 45

What is Retrieval-Augmented Generation (RAG)?

Question 46

When should you use RAG instead of fine-tuning?

Question 47

What is a hallucination in the context of LLMs?

Question 48

Named Entity Recognition (NER) is used in banking KYC processes to:

Question 49

A 13B parameter model in FP32 (4 bytes per parameter) requires 52 GB. How much memory does the same model need in INT8 (1 byte per parameter)?

Question 50

How does LoRA achieve parameter-efficient fine-tuning?