What is "data skew" in a distributed join or aggregation, and why is it a problem?

Correct! Well done.

Incorrect.

The correct answer is A) When data is unevenly distributed across partitions (e.g., one key has far more rows), causing some tasks to take much longer than others

Correct Answer

When data is unevenly distributed across partitions (e.g., one key has far more rows), causing some tasks to take much longer than others

Explanation

Skew occurs when certain keys dominate the data, overloading the tasks/nodes handling them, which can be mitigated with techniques like salting or repartitioning.

Previous All Questions Next

Progress

62/100

📊

Browse All Big Data & Data Engineering Questions

100 questions · beginner to advanced