What is Cloud Dataflow?

Why Interviewers Ask This

Mid-level Google Cloud Platform (GCP) roles require deep understanding of this topic. Interviewers ask this to separate candidates who truly understand the mechanics from those who only know surface-level concepts.

Answer

Cloud Dataflow is a fully managed, serverless stream and batch data processing service based on Apache Beam. You write pipelines using the Apache Beam SDK (Java or Python), and Dataflow manages the distributed execution, auto-scaling, and fault tolerance. Key concepts: Pipeline: the entire data processing graph. PCollection: a distributed dataset. Transforms: operations on PCollections (ParDo, GroupByKey, Combine, Flatten). Windowing: process infinite streams in time windows (tumbling, sliding, session). Watermarks: handle late-arriving data. Common patterns: ETL from Cloud Storage to BigQuery, real-time fraud detection from Pub/Sub, log processing. The same Beam pipeline runs in batch on historical data or streaming on real-time data without code changes.

Common Mistake

Candidates often give textbook answers here. Interviewers are more impressed when you relate the concept to a specific problem you solved in a real Google Cloud Platform (GCP) project.