Mathematically, why does L2 regularization (Ridge) tend to shrink weights smoothly toward zero rather than setting them exactly to zero, unlike L1 (Lasso)?
Correct! Well done.
Incorrect.
The correct answer is B) Because the L2 penalty's gradient is proportional to the weight itself, so its pull weakens as a weight approaches zero, whereas the L1 penalty's gradient has constant magnitude and can drive small weights all the way to exactly zero, producing sparsity
Correct Answer
Because the L2 penalty's gradient is proportional to the weight itself, so its pull weakens as a weight approaches zero, whereas the L1 penalty's gradient has constant magnitude and can drive small weights all the way to exactly zero, producing sparsity
The derivative of the L2 penalty term (λw²) with respect to w is 2λw — proportional to w, so the shrinkage effect diminishes near zero. The derivative of the L1 penalty (λ|w|) is a constant ±λ, which keeps pushing small weights toward zero until they reach it exactly, producing sparse solutions useful for feature selection.