📊

Big Data & Data Engineering MCQ

Test your Big Data & Data Engineering knowledge with 100 multiple choice questions covering fundamentals to advanced concepts, with instant feedback and explanations.

100 Questions 40 Beginner 40 Intermediate 20 Advanced

What is "Big Data" most commonly defined by?

Which of the following are commonly called the "3 Vs" of Big Data?

What is HDFS?

What is the primary purpose of MapReduce?

What is the main difference between batch processing and stream processing?

What is the key difference between a data warehouse and a data lake?

What does ETL stand for in data engineering?

How does ELT differ from ETL?

Which best describes "structured data"?

Which of these is an example of semi-structured data?

What is a key characteristic of NoSQL databases?

What does "schema-on-read" mean?

What is a "data pipeline"?

What is the main difference between OLTP and OLAP systems?

Which file format is columnar and commonly used for analytical workloads?

What is "data partitioning" in the context of data storage?

Why is SQL important for data engineers?

What is "data ingestion"?

In a data lake, why is raw data often retained even after processing?

What is distributed computing?

What is Apache Spark primarily used for?

In a computing cluster, what is a "node"?

What does "data quality" generally refer to?

In a relational database, what is the purpose of a primary key?

What is the purpose of a database index?

In data modeling, what is a "fact table"?

What is a "dimension table" used for in a star schema?

What is the main goal of database normalization?

Why might a data engineer intentionally "denormalize" data in an analytics warehouse?

What does "data governance" primarily address?

What is "metadata"?

What is a "data catalog" used for?

What is "Change Data Capture" (CDC)?

In data pipelines, what does "idempotent" mean?

What is "data serialization"?

Why are compression formats like Snappy or Gzip used in big data systems?

What is the purpose of a scheduling tool like cron in data pipelines?

What is "data lineage"?

Which of these is a popular cloud data warehouse service?

What is the role of a "data engineer" most accurately described as?

What is the main difference between a Spark RDD and a DataFrame?

In Spark, what is the difference between a "transformation" and an "action"?

What is "lazy evaluation" in Spark, and why is it beneficial?

What is a "shuffle" in distributed data processing, and why is it costly?

What is a "broadcast join" and when is it useful?

In Apache Kafka, what is a "topic"?

What is the relationship between Kafka "partitions" and "offsets"?

In a workflow orchestration tool like Apache Airflow, what is a "DAG"?

Why is it important for pipeline tasks in Airflow to be idempotent?

What does a SQL "window function" allow you to do that a normal aggregate (GROUP BY) does not?

In dimensional modeling, what is a "Slowly Changing Dimension" (SCD) Type 2?

Why might a table be both "partitioned" by date and "bucketed" by a column like user_id?

What is a key benefit of columnar formats like Parquet for analytical queries?

What is "schema evolution" in formats like Avro, and why does it matter?

What is the purpose of data validation frameworks (e.g., Great Expectations) in a pipeline?

What is the "medallion architecture" (bronze, silver, gold layers) in a data lakehouse?

What does the CAP theorem state about distributed systems?

What does "eventual consistency" mean in distributed databases?

What role does Apache ZooKeeper traditionally play in distributed systems like older Kafka or Hadoop clusters?

In Spark architecture, what is the relationship between the "driver" and "executors"?

When should you use Spark's "cache()" or "persist()" on a DataFrame?

What is "data skew" in a distributed join or aggregation, and why is it a problem?

What does "exactly-once" processing semantics mean in a streaming system?

What is a "watermark" used for in stream processing (e.g., Spark Structured Streaming, Flink)?

What is "micro-batching" in stream processing frameworks like Spark Structured Streaming?

How does Apache Flink's streaming model generally differ from Spark's traditional micro-batch model?

What is a "materialized view" in a data warehouse, and why use one?

How might a Slowly Changing Dimension Type 2 implementation typically be structured in a table?

What is the core idea behind a "data mesh" architecture?

What problem does a "schema registry" (e.g., used with Kafka and Avro) solve?

What is "backpressure" in a streaming pipeline?

Why is it generally recommended to avoid SELECT * in large-scale analytical queries?

What is the purpose of "checkpointing" in streaming applications?

What is a key advantage of using a cloud object store (e.g., S3, GCS, ADLS) as the foundation for a data lake?

What is the difference between "vertical scaling" and "horizontal scaling" for data systems?

In a streaming join between two Kafka topics, why is "event time" often preferred over "processing time"?

What is the purpose of "data deduplication" in a pipeline?

In Hadoop, what is the role of YARN (Yet Another Resource Negotiator)?

In Spark, what is the difference between "repartition()" and "coalesce()"?

What is the benefit of "partition pruning" when querying a partitioned table?

What is the role of Spark's Catalyst optimizer?

What does Spark's Tungsten execution engine primarily improve?

What is Spark's Adaptive Query Execution (AQE) designed to do?

How does Kafka achieve "exactly-once" semantics across producer and consumer with transactions?

What happens during Kafka consumer group "rebalancing"?

What is the key architectural difference between "Lambda" and "Kappa" architectures for big data processing?

What core problem do "lakehouse" table formats like Delta Lake, Apache Iceberg, and Apache Hudi solve for data lakes?

How do table formats like Delta Lake or Iceberg typically implement "time travel" queries?

What is "Z-ordering" (multi-dimensional clustering) used for in lakehouse tables?

Why is "compaction" important for streaming-ingested data lake tables?

When does a distributed query engine typically choose a "shuffle hash join" vs. a "sort-merge join"?

What is "predicate pushdown" and why is it especially effective with columnar file formats?

What is "dictionary encoding" in columnar storage, and when is it most effective?

In an end-to-end exactly-once pipeline spanning a source, processing engine, and sink, what is typically the hardest part to guarantee?

How does Debezium typically implement Change Data Capture for relational databases?

What is "salting" as a technique to address data skew in a distributed join?

In streaming checkpointing, why is it important that checkpoint storage be durable and consistent with the state being saved?

When evolving a schema by adding a new non-nullable column to an existing large table in a lakehouse format, what is a key consideration?

What is "cost-based optimization" (CBO) in a query engine, and what does it rely on?

Why might increasing the number of Spark shuffle partitions help with out-of-memory errors during a large aggregation, but also potentially hurt performance if set too high?