
PySpark - Large-Scale Processing
SparkSession, RDD vs DataFrame, transformations, actions, partitioning, broadcast variables, UDFs, Spark SQL, caching
1What is the main entry point for creating a PySpark application?
What is the main entry point for creating a PySpark application?
Answer
SparkSession is the unified entry point introduced in Spark 2.0. It replaces the old SparkContext, SQLContext, and HiveContext with a single object. SparkSession allows creating DataFrames, executing SQL queries, and configuring the Spark application in a centralized way.
2What is the fundamental difference between an RDD and a DataFrame in PySpark?
What is the fundamental difference between an RDD and a DataFrame in PySpark?
Answer
A DataFrame has a structured schema with named and typed columns, allowing Spark to optimize queries through Catalyst. An RDD is an unstructured distributed collection where Spark doesn't know the internal data structure, limiting possible optimizations.
3What is the difference between a transformation and an action in PySpark?
What is the difference between a transformation and an action in PySpark?
Answer
Transformations are lazily evaluated and build an execution plan without triggering computation. Actions trigger the actual execution of the plan on the cluster and return a result to the driver. This distinction allows Spark to optimize the plan before execution.
Among the following operations, which one is a PySpark action?
How to create a DataFrame from a Parquet file in PySpark?
+17 interview questions
Other Data Engineering interview topics
Linux & Shell - Fundamentals
Git & GitHub - Fundamentals
Advanced Python for Data Engineering
Docker - Fundamentals
Google Cloud Platform - Fundamentals
CI/CD and Code Quality
Docker Compose
FastAPI - Data APIs
Advanced SQL for Data Engineering
Data Lake - Architecture and Ingestion
BigQuery for Data Engineering
PostgreSQL - Administration
Data Modeling for Data Engineering
Fivetran & Airbyte - Data Ingestion
dbt - Fundamentals
Apache Airflow - Fundamentals
Kubernetes - Fundamentals
dbt - Advanced Features
ETL / ELT / ETLT Patterns
Apache Airflow - Advanced
Airflow + dbt - Pipeline Orchestration
Google Pub/Sub - Data Streaming
Apache Beam & Dataflow
Kubernetes - Production and Scaling
Terraform - Infrastructure as Code
NoSQL Databases
Modern Data Architecture
Monitoring and Observability
IAM and Data Security
Master Data Engineering for your next interview
Access all questions, flashcards, technical tests, code review exercises and interview simulators.
Start for free