Data Engineering

Apache Beam & Dataflow

PCollections, transforms (ParDo, GroupByKey), windowing, triggers, watermarks, Dataflow runner, autoscaling, templates

20 interview questionsยท
Senior
1

What is a PCollection in Apache Beam?

Answer

A PCollection is the primary data abstraction in Apache Beam. It represents a distributed, potentially unbounded dataset that can be processed in parallel. Unlike regular collections, a PCollection is immutable, meaning each transform creates a new PCollection rather than modifying the original.

2

What is the main difference between a bounded and unbounded PCollection?

Answer

A bounded PCollection has a finite, known size (like a file or table), while an unbounded one represents a potentially infinite data stream (like streaming events). This distinction affects how Beam processes data: bounded uses classic batch processing, while unbounded requires windowing and triggers to handle the continuous flow.

3

What is the role of the ParDo transform in Apache Beam?

Answer

ParDo (Parallel Do) is the most flexible transform in Apache Beam. It applies a user-defined function (DoFn) to each element of a PCollection in parallel. ParDo can produce zero, one, or multiple output elements for each input element, making it suitable for filtering, mapping, and flat-mapping.

4

How to use side inputs in a ParDo transform?

5

What is the difference between GroupByKey and CoGroupByKey in Apache Beam?

+17 interview questions

Master Data Engineering for your next interview

Access all questions, flashcards, technical tests, code review exercises and interview simulators.

Start for free