Question 1

What is the main entry point for creating a PySpark application?

Accepted Answer

SparkSession is the unified entry point introduced in Spark 2.0. It replaces the old SparkContext, SQLContext, and HiveContext with a single object. SparkSession allows creating DataFrames, executing SQL queries, and configuring the Spark application in a centralized way.

Question 2

What is the fundamental difference between an RDD and a DataFrame in PySpark?

Accepted Answer

A DataFrame has a structured schema with named and typed columns, allowing Spark to optimize queries through Catalyst. An RDD is an unstructured distributed collection where Spark doesn't know the internal data structure, limiting possible optimizations.

Question 3

What is the difference between a transformation and an action in PySpark?

Accepted Answer

Transformations are lazily evaluated and build an execution plan without triggering computation. Actions trigger the actual execution of the plan on the cluster and return a result to the driver. This distinction allows Spark to optimize the plan before execution.

PySpark - Large-Scale Processing

What is the main entry point for creating a PySpark application?

Answer

What is the fundamental difference between an RDD and a DataFrame in PySpark?

Answer

What is the difference between a transformation and an action in PySpark?

Answer

Among the following operations, which one is a PySpark action?

How to create a DataFrame from a Parquet file in PySpark?

Other Data Engineering interview topics

Linux & Shell - Fundamentals

Git & GitHub - Fundamentals

Advanced Python for Data Engineering

Docker - Fundamentals

Google Cloud Platform - Fundamentals

CI/CD and Code Quality

Docker Compose

FastAPI - Data APIs

Advanced SQL for Data Engineering

Data Lake - Architecture and Ingestion

BigQuery for Data Engineering

PostgreSQL - Administration

Data Modeling for Data Engineering

Fivetran & Airbyte - Data Ingestion

dbt - Fundamentals

Apache Airflow - Fundamentals

Kubernetes - Fundamentals

dbt - Advanced Features

ETL / ELT / ETLT Patterns

Apache Airflow - Advanced

Airflow + dbt - Pipeline Orchestration

Google Pub/Sub - Data Streaming

Apache Beam & Dataflow

Kubernetes - Production and Scaling

Terraform - Infrastructure as Code

NoSQL Databases

Modern Data Architecture

Monitoring and Observability

IAM and Data Security

Master Data Engineering for your next interview