Data Engineering

PySpark - Large-Scale Processing

SparkSession, RDD vs DataFrame, transformations, actions, partitioning, broadcast variables, UDFs, Spark SQL, caching

20 interview questionsยท
Senior
1

What is the main entry point for creating a PySpark application?

Answer

SparkSession is the unified entry point introduced in Spark 2.0. It replaces the old SparkContext, SQLContext, and HiveContext with a single object. SparkSession allows creating DataFrames, executing SQL queries, and configuring the Spark application in a centralized way.

2

What is the fundamental difference between an RDD and a DataFrame in PySpark?

Answer

A DataFrame has a structured schema with named and typed columns, allowing Spark to optimize queries through Catalyst. An RDD is an unstructured distributed collection where Spark doesn't know the internal data structure, limiting possible optimizations.

3

What is the difference between a transformation and an action in PySpark?

Answer

Transformations are lazily evaluated and build an execution plan without triggering computation. Actions trigger the actual execution of the plan on the cluster and return a result to the driver. This distinction allows Spark to optimize the plan before execution.

4

Among the following operations, which one is a PySpark action?

5

How to create a DataFrame from a Parquet file in PySpark?

+17 interview questions

Master Data Engineering for your next interview

Access all questions, flashcards, technical tests, code review exercises and interview simulators.

Start for free