What is a "shuffle" in distributed data processing, and why is it costly?

Correct! Well done.

Incorrect.

The correct answer is A) A shuffle redistributes data across partitions/nodes (e.g., for joins or groupBy), involving expensive network and disk I/O

Correct Answer

A shuffle redistributes data across partitions/nodes (e.g., for joins or groupBy), involving expensive network and disk I/O

Explanation

Operations like groupBy, join, or repartition require data with the same key to be on the same node, causing data to be moved across the network — a shuffle.

Previous All Questions Next

Progress

44/100

📊

Browse All Big Data & Data Engineering Questions

100 questions · beginner to advanced