Intermediate
Big Data & Data Engineering
Q44 / 100
What is a "shuffle" in distributed data processing, and why is it costly?
Correct! Well done.
Incorrect.
The correct answer is A) A shuffle redistributes data across partitions/nodes (e.g., for joins or groupBy), involving expensive network and disk I/O
A
Correct Answer
A shuffle redistributes data across partitions/nodes (e.g., for joins or groupBy), involving expensive network and disk I/O
Explanation
Operations like groupBy, join, or repartition require data with the same key to be on the same node, causing data to be moved across the network — a shuffle.
Progress
44/100