🧠

Artificial Intelligence & Machine Learning MCQ

Test your AI & Machine Learning knowledge with 100 multiple choice questions covering fundamentals to advanced concepts, with instant feedback and explanations.

100 Questions 40 Beginner 40 Intermediate 20 Advanced

How This Practice Test Works

Every question below expands right on this page — click a question to reveal its four options, pick the one you think is correct, and you'll get instant feedback along with the correct answer and a short explanation of the reasoning. Questions are grouped by difficulty, so start with the 40 beginner questions to confirm your fundamentals, work through the 40 intermediate ones, and finish with the 20 advanced questions that mirror what exams and technical screenings actually ask. There's no sign-up, no timer, and no limit — retake the test as often as you like.

Curated by Tech Baithak Editorial Team · Last updated: June 2026

What is machine learning?

Correct Answer

A branch of AI enabling systems to learn from data and improve at tasks without being explicitly programmed

Explanation

ML systems learn patterns from training data to make predictions or decisions on new data. Instead of hand-coding rules, patterns are automatically discovered from examples.

What is the difference between supervised and unsupervised learning?

Correct Answer

Supervised learning trains on labeled data (input-output pairs); unsupervised learning finds patterns in unlabeled data

Explanation

Supervised: spam detection (emails labeled spam/not). Unsupervised: clustering customers by purchasing behavior without predefined groups. Semi-supervised and reinforcement learning are other categories.

What is overfitting in machine learning?

Correct Answer

When a model learns the training data too specifically, including noise, performing poorly on new unseen data

Explanation

Overfitting: high training accuracy, low test accuracy. The model memorizes training examples rather than learning generalizable patterns. Prevented by regularization, dropout, cross-validation, more data, or simpler models.

What is a neural network?

Correct Answer

A computational model inspired by biological neurons, consisting of interconnected layers of nodes that learn representations from data

Explanation

Neural networks have input, hidden, and output layers. Each neuron applies a weighted sum + activation function. Deep neural networks (many hidden layers) learn hierarchical representations.

What is gradient descent?

Correct Answer

An optimization algorithm iteratively adjusting model parameters in the direction that minimizes the loss function

Explanation

Gradient descent computes the gradient of the loss with respect to parameters and updates them: θ = θ - α∇L. α is the learning rate. Variants: batch, stochastic (SGD), mini-batch.

What is the purpose of a training, validation, and test set?

Correct Answer

Training: fit model parameters. Validation: tune hyperparameters and detect overfitting. Test: final unbiased performance evaluation (never touched during development)

Explanation

The test set must not be used during development to get an unbiased estimate. Using test data to make decisions contaminates it, creating optimistic biased estimates.

What is a decision tree?

Correct Answer

A supervised learning model that splits data using feature-based conditions in a tree structure to make predictions

Explanation

Decision trees split at each node based on the feature best separating classes/reducing variance (Gini impurity, information gain, MSE). Easy to interpret but prone to overfitting without pruning.

What is k-means clustering?

Correct Answer

An unsupervised clustering algorithm partitioning data into k clusters by iteratively assigning points to the nearest centroid and recomputing centroids

Explanation

K-means: initialize k centroids, assign each point to nearest centroid, recompute centroids, repeat until convergence. Limitation: assumes spherical clusters, sensitive to initialization and outliers.

What is linear regression?

Correct Answer

A supervised learning algorithm modeling the relationship between input features and a continuous output variable using a linear function

Explanation

Linear regression: y = w₀ + w₁x₁ + ... + wₙxₙ. Minimizes mean squared error. Assumes linear relationship, normal errors, and no multicollinearity. Ridge/Lasso add regularization.

What is logistic regression?

Correct Answer

A classification algorithm using the logistic (sigmoid) function to output probabilities for binary classification

Explanation

Logistic regression outputs P(y=1|x) = σ(wᵀx) where σ is sigmoid. Despite the name, it is a classification algorithm. Trained with cross-entropy loss. Interpretable coefficients as log-odds.

What is a Random Forest?

Correct Answer

An ensemble method training multiple decision trees on random subsets of data and features, aggregating their predictions for better accuracy and robustness

Explanation

Random Forests use bagging (bootstrap aggregation) and random feature subsets, reducing overfitting of individual trees. Aggregation: majority vote (classification) or average (regression). Feature importance is a useful output.

What is the bias-variance tradeoff?

Correct Answer

The tradeoff between underfitting (high bias, oversimplified) and overfitting (high variance, oversensitive to training data)

Explanation

High bias: model too simple, misses patterns (underfitting). High variance: model too complex, overfits. Total error = bias² + variance + irreducible noise. Regularization, ensemble methods, and more data help balance this.

What is cross-validation?

Correct Answer

A technique evaluating model performance by splitting data into k folds, training on k-1 and testing on the held-out fold, averaging across all k runs

Explanation

k-fold CV gives a more reliable estimate of generalization performance. Stratified k-fold preserves class distribution. Leave-one-out (LOO) uses each sample as the test set once.

What is precision and recall?

Correct Answer

Precision = TP/(TP+FP), Recall = TP/(TP+FN) — tradeoff between accuracy and completeness

Explanation

Precision: of all predicted positive, how many are actually positive (TP/(TP+FP)). Recall (sensitivity): of all actual positive, how many were found (TP/(TP+FN)). F1 = harmonic mean of both.

What is the F1 score?

Correct Answer

The harmonic mean of precision and recall: 2*(P*R)/(P+R), balancing both metrics equally

Explanation

F1 score is useful when classes are imbalanced and both false positives and false negatives matter. High F1 requires both high precision and high recall. F-β generalizes by weighting recall β times more than precision.

What is feature engineering?

Correct Answer

The process of transforming raw data into informative features that improve model performance

Explanation

Feature engineering includes normalization, encoding categoricals, creating interaction terms, log-transforming skewed features, and domain-specific transformations. Good features often matter more than model choice.

What is a support vector machine (SVM)?

Correct Answer

A supervised learning algorithm finding the hyperplane maximizing margin between classes, supporting kernel tricks for non-linear boundaries

Explanation

SVM finds the maximum-margin hyperplane. Support vectors are the closest training points. Kernel trick (RBF, polynomial) maps data to higher dimensions for non-linear separation. Effective in high-dimensional spaces.

What is deep learning?

Correct Answer

A subset of machine learning using neural networks with many hidden layers to automatically learn hierarchical representations from raw data

Explanation

Deep learning (CNNs, RNNs, Transformers) learns features automatically from raw data (pixels, text, audio). Enabled by: large datasets (ImageNet), GPUs, and algorithmic advances (ReLU, batch norm, attention).

What is a convolutional neural network (CNN)?

Correct Answer

A neural network architecture using convolution operations to automatically learn spatial hierarchies of features, excelling at image recognition

Explanation

CNNs use convolutional layers (learn local feature detectors), pooling layers (spatial downsampling), and fully connected layers. Local connectivity and weight sharing make them efficient for images.

What is a recurrent neural network (RNN)?

Correct Answer

A neural network with connections forming cycles, enabling processing of sequential data by maintaining hidden state across time steps

Explanation

RNNs process sequences by passing hidden state h_t = f(h_{t-1}, x_t) through time. Challenges: vanishing/exploding gradients. LSTMs and GRUs solve this with gating mechanisms.

What is transfer learning?

Correct Answer

Reusing a model pre-trained on a large dataset as a starting point for a new task, saving training time and requiring less task-specific data

Explanation

Pre-trained models (ImageNet-trained CNNs, BERT, GPT) capture general knowledge. Fine-tuning adapts them to specific tasks with small datasets. Reduces training time from days to hours.

What is natural language processing (NLP)?

Correct Answer

A field of AI enabling computers to understand, interpret, and generate human language

Explanation

NLP tasks: sentiment analysis, named entity recognition, machine translation, question answering, text summarization. Transformers (BERT, GPT) have revolutionized NLP with self-attention mechanisms.

What is reinforcement learning?

Correct Answer

A learning paradigm where an agent learns to take actions in an environment to maximize cumulative reward through trial and error

Explanation

RL components: agent, environment, state, action, reward. The agent learns a policy π(a|s) mapping states to actions. Applications: game playing (AlphaGo), robotics, recommendation systems, autonomous driving.

What is the purpose of regularization in ML?

Correct Answer

Adding a penalty term to the loss function to constrain model complexity and prevent overfitting

Explanation

L1 regularization (Lasso): adds λ|w| — promotes sparsity (many weights = 0). L2 regularization (Ridge): adds λw² — shrinks weights toward 0. Elastic Net combines both. Dropout is regularization for neural networks.

What is a confusion matrix?

Correct Answer

A table showing counts of true positives, true negatives, false positives, and false negatives for a classification model

Explanation

Confusion matrix rows = actual class, columns = predicted class. TP: correct positives. TN: correct negatives. FP: false alarms. FN: missed detections. Derives precision, recall, F1, accuracy, and ROC curves.

What is dimensionality reduction?

Correct Answer

Techniques reducing the number of features while preserving important information, addressing the curse of dimensionality

Explanation

PCA finds lower-dimensional projections maximizing variance. t-SNE/UMAP are for visualization. Autoencoders learn compressed representations. Reduces overfitting, computational cost, and storage requirements.

What is the curse of dimensionality?

Correct Answer

As feature dimensions increase, data becomes increasingly sparse and distances lose meaning, making ML algorithms less effective

Explanation

In high dimensions: distances between points converge, volume increases exponentially (more data needed), nearest neighbor becomes unreliable. Dimensionality reduction and feature selection help combat this.

What is a generative model vs a discriminative model?

Correct Answer

Generative models learn P(X,Y) and can generate new data; discriminative models learn P(Y|X) directly to classify or predict

Explanation

Discriminative (logistic regression, SVM, neural net classifiers): learn decision boundary P(Y|X). Generative (Naive Bayes, VAE, GAN, LLMs): model the data distribution, enabling generation of new samples.

What is data augmentation?

Correct Answer

Artificially increasing training data diversity by applying transformations (flipping, rotation, cropping, noise) without collecting new data

Explanation

Data augmentation (for images: random crop, flip, color jitter; for text: back translation, synonym replacement) reduces overfitting and improves generalization when data is limited.

What is the ROC curve and AUC?

Correct Answer

ROC (Receiver Operating Characteristic) plots true positive rate vs. false positive rate; AUC (Area Under Curve) summarizes performance across all thresholds

Explanation

ROC curve: TPR (sensitivity) vs FPR (1-specificity) at different thresholds. AUC=1: perfect. AUC=0.5: random. AUC measures ranking quality — probability that a random positive example is ranked higher than a random negative.

What is a hyperparameter?

Correct Answer

A configuration value set before training that controls the learning process (learning rate, number of layers, regularization strength)

Explanation

Model parameters (weights) are learned from data. Hyperparameters (learning rate, batch size, number of trees, k in k-means) are set by the practitioner and tuned via grid search, random search, or Bayesian optimization.

What is batch normalization?

Correct Answer

A technique normalizing layer activations within each mini-batch, accelerating training and allowing higher learning rates

Explanation

Batch normalization normalizes activations to have zero mean and unit variance within each mini-batch, then applies learned scale (γ) and shift (β). Stabilizes training, acts as regularizer, and allows faster convergence.

What is dropout in neural networks?

Correct Answer

A regularization technique randomly "dropping" neurons during training to prevent co-adaptation and reduce overfitting

Explanation

Dropout: randomly set each neuron output to 0 with probability p during training. Forces the network to learn redundant representations. At inference, outputs are scaled by (1-p). Acts as ensemble of thinned networks.

What is the attention mechanism in deep learning?

Correct Answer

A mechanism allowing models to dynamically weight the importance of different parts of the input when producing each output

Explanation

Attention (Bahdanau, 2014) allows models to focus on relevant input parts per output token. Self-attention (Transformers) computes relationships within the same sequence, enabling long-range dependency modeling.

What is GPT?

Correct Answer

Generative Pre-trained Transformer — a large autoregressive language model trained to predict the next token, enabling text generation

Explanation

GPT models (OpenAI) use decoder-only Transformers trained on large text corpora via next-token prediction. Fine-tuning and RLHF align them for following instructions. GPT-4, Claude, Gemini are examples.

What is a loss function?

Correct Answer

A function measuring the difference between predicted and actual values, guiding parameter updates during training

Explanation

Cross-entropy loss for classification: -Σy_i log(ŷ_i). MSE for regression: (y-ŷ)². Hinge loss for SVM. The gradient of the loss guides gradient descent updates to minimize prediction error.

What is an activation function?

Correct Answer

A non-linear function applied to each neuron's output, enabling neural networks to learn complex non-linear mappings

Explanation

Without activation functions, neural networks are just linear transformations. ReLU (max(0,x)) is most common. Sigmoid (output 0-1) for binary output. Softmax for multi-class output. GELU, Swish used in Transformers.

What is the difference between classification and regression?

Correct Answer

Classification predicts discrete class labels; regression predicts continuous numeric values

Explanation

Classification: email spam/not, image = cat/dog/car. Regression: predict house price, temperature. Both are supervised. Logistic regression is classification despite the name. Some problems can be formulated as either.

What is the softmax function?

Correct Answer

A function converting a vector of raw scores into a probability distribution over classes, where all probabilities sum to 1

Explanation

softmax(z_i) = exp(z_i) / Σ exp(z_j). Amplifies the largest logit and suppresses others. Used in the output layer for multi-class classification with cross-entropy loss.

What is a knowledge graph?

Correct Answer

A structured knowledge base representing entities and their relationships as nodes and edges, enabling semantic reasoning

Explanation

Knowledge graphs (Google Knowledge Graph, Wikidata, Freebase) represent facts as (subject, relation, object) triples. Used for question answering, recommendation, entity disambiguation, and grounding LLMs.

What is the transformer architecture and why did it revolutionize AI?

Correct Answer

An architecture using self-attention mechanisms to process sequences in parallel, capturing long-range dependencies efficiently and scaling to billions of parameters

Explanation

Transformers (Vaswani et al., 2017) replaced RNNs for NLP. Self-attention O(n²) allows parallel processing (no sequential dependency). Scales well with data and compute. Foundation for BERT, GPT, T5, and most modern AI.

What is BERT and how does it differ from GPT?

Correct Answer

BERT is a bidirectional encoder (reads context in both directions) for understanding; GPT is a unidirectional decoder for generation

Explanation

BERT (encoder-only, masked LM + NSP) learns bidirectional context — useful for classification, NER, QA. GPT (decoder-only, autoregressive) predicts next token — useful for generation. T5 uses encoder-decoder.

What is fine-tuning a pre-trained model?

Correct Answer

Continuing training a pre-trained model on a smaller task-specific dataset to adapt general knowledge to a specific task with fewer data requirements

Explanation

Fine-tuning: initialize with pre-trained weights, train on task data (often with a small learning rate). Works because pre-trained representations transfer well. PEFT methods (LoRA, Adapters) fine-tune efficiently with few parameters.

What is a GAN (Generative Adversarial Network)?

Correct Answer

A framework with two competing networks: a generator creating fake data and a discriminator distinguishing real from fake, trained adversarially

Explanation

GAN (Goodfellow 2014): generator G maximizes discriminator D's error; D minimizes it. At equilibrium, G generates realistic data. Used for image synthesis (StyleGAN), image-to-image translation (pix2pix), deepfakes.

What is a VAE (Variational Autoencoder)?

Correct Answer

A generative model learning a probabilistic latent space representation using an encoder-decoder architecture with a KL-divergence regularization term

Explanation

VAE encodes inputs to a distribution (μ, σ) rather than a point. Decoder samples from this distribution. KL divergence regularizes the latent space. Enables generation (sample from latent space) and smooth interpolation.

What is word2vec?

Correct Answer

A neural model learning dense vector representations of words where semantically similar words have similar vectors

Explanation

Word2vec (Mikolov 2013) uses CBOW or skip-gram training to learn 100-300 dimensional word embeddings. Famous property: king - man + woman ≈ queen. Captured semantic relationships. Succeeded by contextual embeddings (BERT, GPT).

What is the vanishing gradient problem?

Correct Answer

During backpropagation, gradients shrink exponentially through layers, making it hard to train early layers in deep networks

Explanation

Sigmoid/tanh saturate with small gradients (≤ 0.25). Multiplying many <1 values → 0. Solutions: ReLU activations, residual connections (ResNet), batch normalization, LSTM gates, gradient clipping, careful initialization.

What is LSTM and how does it address the vanishing gradient problem?

Correct Answer

LSTM uses input, forget, and output gates to control information flow through a cell state, creating highways for gradients and preserving long-range dependencies

Explanation

LSTM (Hochreiter 1997) cell state C_t has linear interaction — gradients flow back without vanishing. Gates control what to remember/forget/output. Handles long sequences (100-1000 steps), unlike vanilla RNNs.

What is object detection and how does it differ from image classification?

Correct Answer

Image classification assigns one label to the whole image; object detection localizes and classifies multiple objects with bounding boxes

Explanation

Object detection (YOLO, DETR, Faster R-CNN) outputs class + bounding box coordinates for each object. Image segmentation (semantic/instance) classifies every pixel. Classification: whole image → one label.

What is the purpose of residual connections (skip connections) in deep networks?

Correct Answer

To allow gradients to flow directly through skip paths (F(x)+x), enabling training of very deep networks (100+ layers) by mitigating vanishing gradients

Explanation

ResNet (He et al., 2015): H(x) = F(x) + x. The skip connection allows gradients to bypass layers. This enabled training networks with 100-1000 layers, achieving breakthrough results on ImageNet.

What is RLHF (Reinforcement Learning from Human Feedback)?

Correct Answer

A training approach where human preferences between model outputs are used to train a reward model, which guides RL fine-tuning for better alignment with human values

Explanation

RLHF (InstructGPT, ChatGPT): collect human comparisons of outputs → train reward model → use PPO to maximize reward. Dramatically improves instruction following, helpfulness, and reducing harmful outputs.

What is prompt engineering?

Correct Answer

Designing input prompts to guide large language models toward desired outputs without modifying model weights

Explanation

Prompt engineering techniques: zero-shot, few-shot (examples in prompt), chain-of-thought (reasoning steps), ReAct (reason+act). Crucial for getting desired behavior from LLMs without fine-tuning.

What is RAG (Retrieval-Augmented Generation)?

Correct Answer

An approach combining a retrieval system with a language model to ground responses in relevant external documents, reducing hallucinations

Explanation

RAG: retrieve relevant documents from a vector database using semantic search, prepend them to the prompt. The LLM conditions on retrieved context. Reduces hallucination, enables knowledge updates without retraining.

What is the difference between precision-recall and ROC curves?

Correct Answer

PR curves focus on positive class performance (better for imbalanced datasets); ROC curves show true vs false positive rates across all thresholds

Explanation

With severe class imbalance, ROC AUC can be misleadingly optimistic (many TN inflate FPR denominator). PR AUC focuses on the minority positive class. For fraud detection (rare events), PR is more informative.

What is gradient clipping?

Correct Answer

Limiting the gradient magnitude to a maximum threshold to prevent exploding gradients, especially in RNN training

Explanation

Exploding gradients: gradient norm grows exponentially in deep/RNN networks. Gradient clipping: if ||g|| > threshold, g = g * threshold/||g||. Essential for training RNNs and very deep networks.

What is knowledge distillation?

Correct Answer

Training a smaller "student" model to mimic a larger "teacher" model's soft probability outputs, transferring knowledge into a compact model

Explanation

Distillation (Hinton 2015): student learns from teacher's soft labels (temperature-scaled softmax) carrying inter-class similarity info. The student can approach teacher performance with far fewer parameters.

What is neural architecture search (NAS)?

Correct Answer

Automated techniques finding optimal neural network architectures through reinforcement learning, evolutionary algorithms, or differentiable search methods

Explanation

NAS (AutoML): automates network design (layer types, connections, widths). EfficientNet, NASNet, MobileNetV3 were NAS-discovered. DARTS (differentiable NAS) makes search end-to-end differentiable and much faster.

What is a recommendation system?

Correct Answer

A system predicting user preferences to suggest relevant items, using collaborative filtering, content-based, or hybrid approaches

Explanation

Collaborative filtering: find similar users (user-based) or items (item-based). Matrix factorization (SVD, ALS) decomposes user-item matrix. Deep learning (YouTube, Netflix): two-tower retrieval + ranking models.

What is concept drift in machine learning?

Correct Answer

The change in statistical properties of the target variable over time, causing a deployed model's performance to degrade

Explanation

Concept drift: P(Y|X) changes over time (e.g., spam content evolves, economic conditions shift). Requires monitoring (data drift, prediction drift), retraining triggers, and online learning or periodic retraining.

What is federated learning?

Correct Answer

A distributed ML paradigm training models across many devices without sharing raw data, sending only model updates to a central aggregator

Explanation

Federated learning (Google, Apple): models train locally on-device (privacy preserved), send gradients/weights to server for aggregation. Used in keyboard prediction, medical imaging (hospitals can't share patient data).

What is a vector database and why is it used in AI applications?

Correct Answer

A database optimized for storing and searching high-dimensional embedding vectors using approximate nearest neighbor algorithms for semantic similarity search

Explanation

Vector databases (Pinecone, Weaviate, Chroma, pgvector) store embeddings and enable semantic search. Used in RAG for document retrieval, recommendation systems, and image search with HNSW or IVF indexing.

What is LoRA (Low-Rank Adaptation)?

Correct Answer

A parameter-efficient fine-tuning method adding low-rank matrices to transformer layers, adapting LLMs with far fewer trainable parameters

Explanation

LoRA (Hu et al., 2021): instead of fine-tuning all weights W, add ΔW = BA where B (d×r), A (r×d), r<<d. Train only A,B (0.1% of parameters). Enables LLM fine-tuning on consumer GPUs.

What is the difference between model accuracy and fairness in ML?

Correct Answer

Accuracy measures overall prediction correctness; fairness concerns whether the model performs equally across demographic groups and avoids discriminatory outcomes

Explanation

A model can be accurate overall but unfair: higher error rates for minority groups (demographic parity, equalized odds). Fairness metrics often conflict with each other and with accuracy. Crucial for hiring, lending, criminal justice AI.

What is mean squared error (MSE) vs mean absolute error (MAE)?

Correct Answer

MSE penalizes large errors more heavily (squaring amplifies outliers); MAE treats all errors equally, being more robust to outliers

Explanation

MSE = mean((y-ŷ)²): sensitive to outliers (squared amplification). MAE = mean(|y-ŷ|): more robust. Huber loss combines both: quadratic for small errors, linear for large. Choice depends on tolerance for outliers.

What is model explainability (XAI)?

Correct Answer

Techniques making ML model predictions interpretable and understandable to humans, crucial for trust, debugging, and regulatory compliance

Explanation

LIME (local linear approximations), SHAP (Shapley values for feature attribution), attention visualization, saliency maps. Required by GDPR (right to explanation), healthcare, and financial regulations.

When should you choose a decision tree over logistic regression for a classification task?

Correct Answer

When the relationship between features and the target is non-linear and you need an interpretable model that handles mixed feature types without scaling

Explanation

Decision trees naturally capture non-linear interactions and handle categorical and numeric features without scaling, while logistic regression assumes a roughly linear decision boundary in feature space. The right choice depends on the data's structure and interpretability needs.

What is the elbow method used for in clustering?

Correct Answer

Plotting within-cluster sum of squares against the number of clusters k and choosing the k where the improvement sharply diminishes, forming an elbow shape

Explanation

The elbow method helps pick a reasonable k for k-means by plotting inertia (within-cluster variance) versus k. The point where adding more clusters yields diminishing returns ("the elbow") is a good candidate for k. The silhouette score is a complementary technique.

What is stratified sampling and why is it used when splitting a dataset?

Correct Answer

Splitting data so that each subset preserves the same proportion of class labels as the original dataset, preventing skewed train/test distributions

Explanation

Stratified sampling keeps the class ratio consistent across train, validation, and test sets, which is especially important for imbalanced datasets — otherwise a random split could leave a set with too few minority-class examples to evaluate fairly.

Why is one-hot encoding commonly used for categorical features in machine learning?

Correct Answer

It converts categories into binary indicator columns, avoiding a false sense of ordinal relationship that plain integer encoding would imply to algorithms that assume numeric ordering

Explanation

Assigning integers like 1, 2, 3 to unordered categories implies an ordering or magnitude that does not exist. One-hot encoding creates a separate binary column per category, letting the model treat them as independent without artificial ordinal relationships.

What is the main practical tradeoff when increasing the depth of a decision tree?

Correct Answer

Deeper trees can capture more complex patterns but are increasingly likely to overfit the training data and lose generalization to new data

Explanation

As depth increases, a tree can fit training data (and its noise) more closely, increasing variance and the risk of overfitting. Techniques like pruning, setting a maximum depth, or requiring a minimum samples per leaf help balance fit and generalization.

In scikit-learn-style workflows, what is the purpose of a Pipeline object?

Correct Answer

It chains preprocessing steps and an estimator into a single object so transformations are applied consistently during both training and inference, preventing data leakage

Explanation

A pipeline bundles steps like scaling, encoding, and the final estimator so that fit/transform are applied in the same order on both training and test data, which prevents accidentally leaking information from the test set into preprocessing statistics.

What is early stopping during neural network training?

Correct Answer

Monitoring performance on a validation set and halting training once it stops improving, to prevent the model from overfitting to the training data

Explanation

Early stopping tracks a validation metric (such as validation loss) and stops training once it stagnates or worsens for a set number of epochs (patience). This acts as an implicit regularizer, halting training before the model starts memorizing the training set.

What is the difference between bagging and boosting as ensemble techniques?

Correct Answer

Bagging trains models independently in parallel on bootstrapped samples to reduce variance; boosting trains models sequentially, each focusing on correcting the errors of the previous ones, primarily reducing bias

Explanation

Bagging (e.g., Random Forest) reduces variance by averaging independent models trained on bootstrapped subsets. Boosting (e.g., AdaBoost, Gradient Boosting, XGBoost) builds models sequentially, with each new model emphasizing the mistakes of the ensemble so far, typically reducing bias.

Why might you use the Adam optimizer instead of plain stochastic gradient descent (SGD) when training a neural network?

Correct Answer

Adam adapts the learning rate for each parameter using estimates of first and second moments of the gradients, often leading to faster convergence with less manual tuning than plain SGD

Explanation

Adam combines momentum (using a moving average of past gradients) with per-parameter adaptive learning rates (using a moving average of squared gradients), which often makes training more stable and faster to converge than vanilla SGD, especially on noisy or sparse gradients.

What is data leakage in a machine learning workflow?

Correct Answer

When information from outside the training dataset — often from the validation or test set, or from the future — improperly influences model training, producing unrealistically optimistic performance estimates

Explanation

Data leakage happens when features or preprocessing statistics are derived using information that would not be available at prediction time (such as fitting a scaler on the full dataset before splitting). The result is a model that looks great in evaluation but performs poorly in production.

What is the purpose of the learning rate schedule (e.g., step decay or cosine annealing) during training?

Correct Answer

To systematically adjust the learning rate over the course of training — often starting higher for faster initial progress and decreasing later for fine-grained convergence near a minimum

Explanation

A learning rate schedule changes the step size used in gradient updates over time. A common pattern is to start with a larger learning rate to make rapid progress and gradually reduce it so the optimizer can settle into a minimum without overshooting.

What is tokenization in the context of NLP and language models?

Correct Answer

The process of splitting raw text into smaller units (words, subwords, or characters) that a model can map to numerical IDs and process

Explanation

Tokenization converts text into discrete units — often subword pieces via algorithms like Byte-Pair Encoding (BPE) or WordPiece — which are then mapped to integer IDs and embedded. Subword tokenization balances vocabulary size with the ability to represent rare or unseen words.

What is the purpose of a learning curve (training size vs. error) when diagnosing a model?

Correct Answer

It plots training and validation error as a function of training set size, helping diagnose whether a model suffers from high bias (both errors plateau high) or high variance (large gap between training and validation error)

Explanation

Learning curves reveal whether more data would help: a persistent gap between low training error and high validation error suggests overfitting (high variance), while both errors converging to a high value suggests the model is too simple (high bias) and needs more capacity or better features.

Why is min-max scaling or standardization often applied before training algorithms like k-NN or SVM?

Correct Answer

Because these algorithms rely on distance or dot-product calculations, and features on larger numeric scales would otherwise dominate the result regardless of their actual importance

Explanation

Distance-based algorithms (k-NN, SVM, k-means) compute similarity using feature magnitudes. If one feature ranges from 0-1 and another from 0-100,000, the larger-scale feature will dominate the distance computation, so scaling features to comparable ranges (standardization or min-max scaling) is essential.

What is the practical difference between using grid search and random search for hyperparameter tuning?

Correct Answer

Grid search exhaustively evaluates every combination on a predefined grid, while random search samples a fixed number of random combinations, often finding good results faster when only a few hyperparameters truly matter

Explanation

Grid search becomes computationally expensive as the number of hyperparameters grows, since the number of combinations multiplies. Random search samples combinations randomly and often finds comparably good settings more efficiently, because typically only a few hyperparameters have a large effect on performance (Bergstra and Bengio, 2012).

What is self-supervised learning and how is it used in modern AI?

Correct Answer

Learning representations from unlabeled data by creating supervised signals from the data itself (masked prediction, contrastive learning), enabling pre-training at scale

Explanation

SSL (BERT: masked token prediction; GPT: next token prediction; CLIP: image-text contrastive; MAE: masked image patches) enables pre-training on internet-scale unlabeled data, learning powerful representations.

What is the scaling law in large language models?

Correct Answer

Empirical power-law relationships showing that model performance improves predictably with compute, parameters, and data — enabling principled decisions about training runs

Explanation

Kaplan et al. (2020): loss ∝ N^-0.076 * D^-0.095 (N=params, D=tokens). Chinchilla (Hoffmann 2022): optimal: N and D should scale equally. These laws guide billion-dollar training run decisions.

What is the attention mechanism complexity and how does sparse attention address it?

Correct Answer

Standard self-attention is O(n²) in sequence length, making long sequences expensive. Sparse attention (Longformer, BigBird) reduces to O(n√n) using local and global attention patterns

Explanation

O(n²) attention quadruples cost when doubling sequence length. FlashAttention (IO-aware) keeps O(n²) compute but reduces memory bandwidth by computing in tiles. Sparse methods genuinely reduce compute for long sequences.

What is the mixture of experts (MoE) architecture?

Correct Answer

A neural network architecture routing inputs to a subset of "expert" feed-forward networks, enabling sparse activation of a much larger total parameter count

Explanation

MoE (Switch Transformer, GPT-4): each token routed to top-k experts (sparse activation). Total params >> active params per token. Enables scaling to hundreds of billions of parameters while keeping per-token compute manageable.

What is mechanistic interpretability in LLMs?

Correct Answer

A research program reverse-engineering the specific circuits and algorithms implemented by transformer weights to understand exactly how models compute their outputs

Explanation

Mechanistic interpretability (Anthropic, EleutherAI) identifies circuits: induction heads, copy suppression heads, indirect object identification circuits. Aims to fully understand transformer computations like reverse-engineering software.

What is the difference between in-context learning and fine-tuning in LLMs?

Correct Answer

In-context learning (few-shot prompting) adapts behavior without parameter updates; fine-tuning updates model weights on task-specific data

Explanation

ICL (Brown et al., 2020): task examples in the prompt guide behavior without updating weights. Surprising emergent ability of large models. Fine-tuning updates weights: better at specific tasks but requires compute and can degrade other capabilities.

What is Constitutional AI (CAI) and how does it improve alignment?

Correct Answer

Anthropic's technique for self-critiquing model outputs against a list of principles and using the critiques to generate better responses, reducing dependence on human labelers

Explanation

CAI (Bai et al., 2022): SL-CAI critiques and revises its own outputs against a "constitution" of principles. RL-CAI uses AI feedback instead of human feedback. Scales alignment without proportional human labeling cost.

What is a neural ordinary differential equation (Neural ODE)?

Correct Answer

A model parameterizing the derivative of the hidden state as a neural network, solving an ODE to get the output — enabling continuous-depth models and memory efficiency

Explanation

Neural ODE (Chen 2018): instead of discrete layers, define dh/dt = f(h,t,θ). Solve with an ODE solver. Memory O(1) via adjoint method. Continuous normalizing flows, time-series modeling, and dynamical systems.

What is the difference between model parallelism and data parallelism in distributed training?

Correct Answer

Data parallelism: each device trains on different data batches with replicated models; model parallelism: splits the model itself across devices for models too large for one GPU

Explanation

Data parallelism (DDP, FSDP): scale to large datasets. Model parallelism (Tensor parallelism, pipeline parallelism): required when a model (e.g., 70B params) doesn't fit in a single GPU's memory. Megatron-LM uses both.

What is activation engineering/steering in LLMs?

Correct Answer

Directly modifying intermediate activations during inference to steer model behavior (adding a direction to the residual stream to induce concepts like "banana" or "French")

Explanation

Activation steering (Representation Engineering, Zou et al.): identify directions in activation space corresponding to concepts, add them during forward pass. Changes model behavior without fine-tuning. Related to mechanistic interpretability.

What is the alignment problem in AI?

Correct Answer

The challenge of ensuring advanced AI systems reliably pursue goals that are beneficial to humans, even as capabilities scale far beyond current AI

Explanation

Alignment: an AI system might optimize a proxy metric perfectly while causing unintended harm (Goodhart's law). RLHF, Constitutional AI, interpretability, and scalable oversight are approaches. Key concern for AGI safety research.

What is multi-modal learning in AI?

Correct Answer

Training models that process and relate multiple data modalities (text, images, audio, video) within a unified architecture

Explanation

Multi-modal models (GPT-4V, CLIP, Gemini, Flamingo) learn joint representations across modalities. CLIP learns image-text alignment via contrastive learning. Enables zero-shot image classification, image captioning, visual QA.

What is emergent ability in large language models?

Correct Answer

Capabilities that appear abruptly at certain model scales and weren't predicted by extrapolating performance from smaller models — suggesting qualitative transitions in capability

Explanation

Wei et al. (2022): abilities like chain-of-thought reasoning, arithmetic, and multi-step reasoning appear suddenly above certain model scales. Debate: are these genuinely emergent or artifacts of evaluation metrics? Has major implications for AI forecasting.

What is the temperature parameter in LLM sampling?

Correct Answer

A parameter controlling randomness in token sampling by scaling logits before softmax — lower temperature → more deterministic, higher → more diverse/random outputs

Explanation

Temperature T: logits/T before softmax. T=0 → greedy (always highest probability token). T=1 → sample from model distribution. T>1 → flatter distribution, more diverse but less coherent. Nucleus (top-p) and top-k sampling offer alternatives.

What is speculative decoding in LLMs?

Correct Answer

An inference optimization where a small draft model generates tokens quickly, which the large target model verifies in parallel, achieving speedup without quality loss

Explanation

Speculative decoding (Leviathan et al., 2022): draft model generates k tokens, target model verifies all in parallel (one forward pass). If draft tokens match target distribution, all are accepted. 2-3x speedup on typical text.

What is Constitutional AI and RLAIF vs RLHF?

Correct Answer

RLHF uses human preferences to train a reward model; RLAIF (RL from AI feedback) uses AI-generated preferences, enabling scale without proportional human annotation cost

Explanation

RLHF: human raters compare outputs → reward model → PPO optimization. RLAIF: the AI itself rates outputs against constitutional principles → reward model → RL. Scales better but depends on AI judgment quality.

What is the difference between zero-shot, one-shot, and few-shot prompting?

Correct Answer

Zero-shot: no examples in prompt. One-shot: one example. Few-shot: several examples. More examples generally improve performance by demonstrating the task format and expected output style

Explanation

Few-shot prompting (Brown et al., 2020): including k demonstrations in the prompt dramatically improves performance on novel tasks without weight updates. The LLM infers the task from examples via in-context learning.

What is catastrophic forgetting in neural networks?

Correct Answer

The tendency of neural networks to forget previously learned tasks when trained sequentially on new tasks, as new weights overwrite old task knowledge

Explanation

Catastrophic forgetting (McCloskey 1989) is the main challenge in continual/lifelong learning. Solutions: EWC (penalize changes to important weights), replay buffers (mix old and new data), modular networks, PackNet (parameter isolation).

Mathematically, why does L2 regularization (Ridge) tend to shrink weights smoothly toward zero rather than setting them exactly to zero, unlike L1 (Lasso)?

Correct Answer

Because the L2 penalty's gradient is proportional to the weight itself, so its pull weakens as a weight approaches zero, whereas the L1 penalty's gradient has constant magnitude and can drive small weights all the way to exactly zero, producing sparsity

Explanation

The derivative of the L2 penalty term (λw²) with respect to w is 2λw — proportional to w, so the shrinkage effect diminishes near zero. The derivative of the L1 penalty (λ|w|) is a constant ±λ, which keeps pushing small weights toward zero until they reach it exactly, producing sparse solutions useful for feature selection.

In the bias-variance decomposition of expected prediction error, what does the irreducible error term represent, and why can no model eliminate it?

Correct Answer

The variance inherent in the data-generating process itself (label noise or unmeasured factors), which sets a lower bound on achievable error regardless of model choice or amount of data

Explanation

Expected squared error decomposes into bias² + variance + irreducible error (often written as σ², the noise variance in the true relationship between inputs and outputs). Because this noise is intrinsic to the data-generating process, it represents a theoretical floor on performance that no amount of additional data, model capacity, or tuning can remove.