What is the Transformer architecture?

Answer

The Transformer (introduced in "Attention Is All You Need", 2017) is an architecture based entirely on self-attention mechanisms, replacing recurrence and convolution. It consists of an encoder (maps input to context representations) and decoder (generates output). Each encoder/decoder block has multi-head self-attention and feed-forward layers with residual connections and layer normalization. Transformers process all tokens in parallel (unlike RNNs) and capture long-range dependencies efficiently. They are the foundation of BERT, GPT, T5, and virtually all modern NLP models.