What is a data lake vs a data warehouse?

Q: What is a data lake vs a data warehouse?

Data Warehouse: a centralized repository for structured, processed, and curated data optimized for analytical queries (OLAP — Online Analytical Processing). Data is cleaned, transformed, and loaded (ETL — Extract, Transform, Load) before being stored in a highly organized schema (star schema, snowflake schema). Query performance is excellent for predefined analytics. Examples: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse. Suited for: business intelligence, dashboards, financial

Answer

Data Warehouse: a centralized repository for structured, processed, and curated data optimized for analytical queries (OLAP — Online Analytical Processing). Data is cleaned, transformed, and loaded (ETL — Extract, Transform, Load) before being stored in a highly organized schema (star schema, snowflake schema). Query performance is excellent for predefined analytics. Examples: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse. Suited for: business intelligence, dashboards, financial reporting. Data Lake: a centralized repository that stores raw, unprocessed data in its native format (structured, semi-structured, unstructured — CSV, JSON, Parquet, images, logs, video). Uses a schema-on-read approach — apply schema when querying, not when storing. Much cheaper storage (S3, HDFS). Examples: AWS S3 + Glue/Athena, Azure Data Lake, Databricks. Suited for: machine learning, data exploration, storing everything for future unknown use cases. Data Lakehouse: hybrid architecture combining the storage flexibility of data lakes with the performance and ACID transactions of data warehouses. Examples: Delta Lake (Databricks), Apache Iceberg, Apache Hudi. Key differences: Data warehouse = structured + schema-on-write + fast analytics + expensive; Data lake = raw + schema-on-read + flexible + cheap + can be slow to query without proper formats. Most modern architectures use both, with data flowing from lake (raw ingestion) to warehouse (refined analytics).

Answer

More System Design Questions