What is data leakage in machine learning?

Answer

Data leakage occurs when information from outside the training dataset (specifically from the test set or from the future) is used to build the model, resulting in overly optimistic performance estimates that do not generalize. Common examples: fitting a scaler on the full dataset before splitting, including future information in features (look-ahead bias), or having test samples appear in the training set. Preventing leakage requires performing all preprocessing steps (scaling, imputation, encoding) only on the training set and applying the fitted transformers to validation/test sets.