What is BERT and how is it pre-trained?

Answer

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained Transformer encoder model that learns deep bidirectional representations by conditioning on both left and right context simultaneously. Pre-training uses two tasks: Masked Language Modeling (MLM) — 15% of input tokens are masked and the model predicts them; and Next Sentence Prediction (NSP) — predicts whether two sentences are consecutive. After pre-training on large corpora, BERT is fine-tuned on specific NLP tasks (question answering, classification, NER) by adding a small task-specific head.