What is a data lake vs a data warehouse?
Why Interviewers Ask This
This is a classic screening question for System Design roles. Hiring managers ask it early in interviews to gauge your baseline understanding and determine if you can communicate technical concepts clearly.
Answer
Data Warehouse: a centralized repository for structured, processed, and curated data optimized for analytical queries (OLAP — Online Analytical Processing). Data is cleaned, transformed, and loaded (ETL — Extract, Transform, Load) before being stored in a highly organized schema (star schema, snowflake schema). Query performance is excellent for predefined analytics. Examples: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse. Suited for: business intelligence, dashboards, financial reporting. Data Lake: a centralized repository that stores raw, unprocessed data in its native format (structured, semi-structured, unstructured — CSV, JSON, Parquet, images, logs, video). Uses a schema-on-read approach — apply schema when querying, not when storing. Much cheaper storage (S3, HDFS). Examples: AWS S3 + Glue/Athena, Azure Data Lake, Databricks. Suited for: machine learning, data exploration, storing everything for future unknown use cases. Data Lakehouse: hybrid architecture combining the storage flexibility of data lakes with the performance and ACID transactions of data warehouses. Examples: Delta Lake (Databricks), Apache Iceberg, Apache Hudi. Key differences: Data warehouse = structured + schema-on-write + fast analytics + expensive; Data lake = raw + schema-on-read + flexible + cheap + can be slow to query without proper formats. Most modern architectures use both, with data flowing from lake (raw ingestion) to warehouse (refined analytics).
Pro Tip
Before answering, structure your response: one-line definition → real-world analogy → concrete example from a project. This makes even complex System Design answers easy to follow.