Intermediate Big Data & Data Engineering
Q44 / 100

What is a "shuffle" in distributed data processing, and why is it costly?

Correct! Well done.

Incorrect.

The correct answer is A) A shuffle redistributes data across partitions/nodes (e.g., for joins or groupBy), involving expensive network and disk I/O

A

Correct Answer

A shuffle redistributes data across partitions/nodes (e.g., for joins or groupBy), involving expensive network and disk I/O

Explanation

Operations like groupBy, join, or repartition require data with the same key to be on the same node, causing data to be moved across the network — a shuffle.

Progress
44/100