Big Data & Data Engineering MCQ
Test your Big Data & Data Engineering knowledge with 100 multiple choice questions covering fundamentals to advanced concepts, with instant feedback and explanations.
What is "Big Data" most commonly defined by?
2Which of the following are commonly called the "3 Vs" of Big Data?
3What is HDFS?
4What is the primary purpose of MapReduce?
5What is the main difference between batch processing and stream processing?
6What is the key difference between a data warehouse and a data lake?
7What does ETL stand for in data engineering?
8How does ELT differ from ETL?
9Which best describes "structured data"?
10Which of these is an example of semi-structured data?
11What is a key characteristic of NoSQL databases?
12What does "schema-on-read" mean?
13What is a "data pipeline"?
14What is the main difference between OLTP and OLAP systems?
15Which file format is columnar and commonly used for analytical workloads?
16What is "data partitioning" in the context of data storage?
17Why is SQL important for data engineers?
18What is "data ingestion"?
19In a data lake, why is raw data often retained even after processing?
20What is distributed computing?
21What is Apache Spark primarily used for?
22In a computing cluster, what is a "node"?
23What does "data quality" generally refer to?
24In a relational database, what is the purpose of a primary key?
25What is the purpose of a database index?
26In data modeling, what is a "fact table"?
27What is a "dimension table" used for in a star schema?
28What is the main goal of database normalization?
29Why might a data engineer intentionally "denormalize" data in an analytics warehouse?
30What does "data governance" primarily address?
31What is "metadata"?
32What is a "data catalog" used for?
33What is "Change Data Capture" (CDC)?
34In data pipelines, what does "idempotent" mean?
35What is "data serialization"?
36Why are compression formats like Snappy or Gzip used in big data systems?
37What is the purpose of a scheduling tool like cron in data pipelines?
38What is "data lineage"?
39Which of these is a popular cloud data warehouse service?
40What is the role of a "data engineer" most accurately described as?
What is the main difference between a Spark RDD and a DataFrame?
2In Spark, what is the difference between a "transformation" and an "action"?
3What is "lazy evaluation" in Spark, and why is it beneficial?
4What is a "shuffle" in distributed data processing, and why is it costly?
5What is a "broadcast join" and when is it useful?
6In Apache Kafka, what is a "topic"?
7What is the relationship between Kafka "partitions" and "offsets"?
8In a workflow orchestration tool like Apache Airflow, what is a "DAG"?
9Why is it important for pipeline tasks in Airflow to be idempotent?
10What does a SQL "window function" allow you to do that a normal aggregate (GROUP BY) does not?
11In dimensional modeling, what is a "Slowly Changing Dimension" (SCD) Type 2?
12Why might a table be both "partitioned" by date and "bucketed" by a column like user_id?
13What is a key benefit of columnar formats like Parquet for analytical queries?
14What is "schema evolution" in formats like Avro, and why does it matter?
15What is the purpose of data validation frameworks (e.g., Great Expectations) in a pipeline?
16What is the "medallion architecture" (bronze, silver, gold layers) in a data lakehouse?
17What does the CAP theorem state about distributed systems?
18What does "eventual consistency" mean in distributed databases?
19What role does Apache ZooKeeper traditionally play in distributed systems like older Kafka or Hadoop clusters?
20In Spark architecture, what is the relationship between the "driver" and "executors"?
21When should you use Spark's "cache()" or "persist()" on a DataFrame?
22What is "data skew" in a distributed join or aggregation, and why is it a problem?
23What does "exactly-once" processing semantics mean in a streaming system?
24What is a "watermark" used for in stream processing (e.g., Spark Structured Streaming, Flink)?
25What is "micro-batching" in stream processing frameworks like Spark Structured Streaming?
26How does Apache Flink's streaming model generally differ from Spark's traditional micro-batch model?
27What is a "materialized view" in a data warehouse, and why use one?
28How might a Slowly Changing Dimension Type 2 implementation typically be structured in a table?
29What is the core idea behind a "data mesh" architecture?
30What problem does a "schema registry" (e.g., used with Kafka and Avro) solve?
31What is "backpressure" in a streaming pipeline?
32Why is it generally recommended to avoid SELECT * in large-scale analytical queries?
33What is the purpose of "checkpointing" in streaming applications?
34What is a key advantage of using a cloud object store (e.g., S3, GCS, ADLS) as the foundation for a data lake?
35What is the difference between "vertical scaling" and "horizontal scaling" for data systems?
36In a streaming join between two Kafka topics, why is "event time" often preferred over "processing time"?
37What is the purpose of "data deduplication" in a pipeline?
38In Hadoop, what is the role of YARN (Yet Another Resource Negotiator)?
39In Spark, what is the difference between "repartition()" and "coalesce()"?
40What is the benefit of "partition pruning" when querying a partitioned table?
What is the role of Spark's Catalyst optimizer?
2What does Spark's Tungsten execution engine primarily improve?
3What is Spark's Adaptive Query Execution (AQE) designed to do?
4How does Kafka achieve "exactly-once" semantics across producer and consumer with transactions?
5What happens during Kafka consumer group "rebalancing"?
6What is the key architectural difference between "Lambda" and "Kappa" architectures for big data processing?
7What core problem do "lakehouse" table formats like Delta Lake, Apache Iceberg, and Apache Hudi solve for data lakes?
8How do table formats like Delta Lake or Iceberg typically implement "time travel" queries?
9What is "Z-ordering" (multi-dimensional clustering) used for in lakehouse tables?
10Why is "compaction" important for streaming-ingested data lake tables?
11When does a distributed query engine typically choose a "shuffle hash join" vs. a "sort-merge join"?
12What is "predicate pushdown" and why is it especially effective with columnar file formats?
13What is "dictionary encoding" in columnar storage, and when is it most effective?
14In an end-to-end exactly-once pipeline spanning a source, processing engine, and sink, what is typically the hardest part to guarantee?
15How does Debezium typically implement Change Data Capture for relational databases?
16What is "salting" as a technique to address data skew in a distributed join?
17In streaming checkpointing, why is it important that checkpoint storage be durable and consistent with the state being saved?
18When evolving a schema by adding a new non-nullable column to an existing large table in a lakehouse format, what is a key consideration?
19What is "cost-based optimization" (CBO) in a query engine, and what does it rely on?
20Why might increasing the number of Spark shuffle partitions help with out-of-memory errors during a large aggregation, but also potentially hurt performance if set too high?