50 multiple choice questions based on your study notes. Click an answer to check if you're right.
Question 1
What do weights determine in a neural network?
Question 2
What does the bias term do in a neural network layer?
Question 3
What is the purpose of activation functions in a neural network?
Question 4
What does the softmax function do?
Question 5
The sigmoid function squashes values into what range?
Question 6
Which loss function is used for binary classification?
Question 7
Which regression loss function is more robust to outliers?
Question 8
What happens when the learning rate is too high?
Question 9
What is Adam in the context of training neural networks?
Question 10
What does a lower temperature do to model predictions?
Question 11
CNNs are commonly used for which type of tasks?
Question 12
What is the role of pooling layers in a CNN?
Question 13
What key property allows RNNs to process sequential data?
Question 14
What does positional encoding do in a Transformer?
Question 15
BERT is trained with which method?
Question 16
What does self-attention allow a model to do?
Question 17
In multi-head attention, what happens to the outputs of all heads?
Question 18
What is the main problem with one-hot encoding?
Question 19
What is the main limitation of Word2Vec / GloVe embeddings?
Question 20
What makes contextual embeddings (BERT, GPT) different from Word2Vec?
Question 21
What does Bag of Words (BoW) ignore?
Question 22
How does TF-IDF improve on Bag of Words?
Question 23
What is chunking in the context of retrieval systems?
Question 24
Which of the following is a parameter-efficient LLM fine-tuning method?
Question 25
How many bytes per parameter does FP16 use?
Question 26
Attention is particularly important in which type of architecture?
Question 27
Given the sentence "I love NLP", what are the bigrams?
Question 28
How does Sparse Categorical Cross-Entropy differ from Categorical Cross-Entropy?
Question 29
Why is multi-head attention more powerful than a single attention head?
Question 30
What does Recursive Character Level Chunking do?
Question 31
What causes the vanishing gradient problem?
Question 32
What is the primary solution for the exploding gradient problem?
Question 33
Which activation functions are most likely to cause the vanishing gradient problem?
Question 34
Which of the following is NOT a key limitation of vanilla RNNs?
Question 35
What do LSTM and GRU architectures solve compared to vanilla RNNs?
Question 36
What is the main bottleneck of the original seq2seq encoder-decoder architecture?
Question 37
How does the attention mechanism (Bahdanau, 2015) fix the seq2seq bottleneck?
Question 38
What does the KV cache store during autoregressive generation?
Question 39
What is the main trade-off of using a KV cache?
Question 40
What does Grouped-Query Attention (GQA) do to reduce KV cache memory?
Question 41
What effect does using a smaller batch size have on training?
Question 42
What does dropout do during training?
Question 43
In a fraud detection system, 99.5% of transactions are legitimate. If a model predicts "not fraud" for every transaction, what is its accuracy?
Question 44
In fraud detection, which is typically more costly to get wrong — a false positive (flagging a legitimate transaction) or a false negative (missing actual fraud)?
Question 45
What is Retrieval-Augmented Generation (RAG)?
Question 46
When should you use RAG instead of fine-tuning?
Question 47
What is a hallucination in the context of LLMs?
Question 48
Named Entity Recognition (NER) is used in banking KYC processes to:
Question 49
A 13B parameter model in FP32 (4 bytes per parameter) requires 52 GB. How much memory does the same model need in INT8 (1 byte per parameter)?
Question 50
How does LoRA achieve parameter-efficient fine-tuning?