Apache Spark 4 đánh dấu một bước tiến quan trọng trong hệ sinh thái xử lý dữ liệu phân tán, mang đến nhiều cải tiến về hiệu năng, khả năng tương thích và trải nghiệm người dùng. Phiên bản này tập trung vào việc tăng cường độ tin cậy thông qua ANSI SQL compliance, mở rộng khả năng xử lý dữ liệu bán cấu trúc với VARIANT data type, và giảm độ trễ trong streaming xuống mức millisecond với Real-Time Mode.

Đối với các data engineer và kỹ sư big data, việc nắm vững Spark 4 không chỉ giúp tối ưu hóa pipeline hiện tại mà còn mở ra nhiều cơ hội nghề nghiệp trong các dự án quy mô lớn. Các tính năng mới như Spark Connect và transformWithState API thay đổi cách thiết kế ứng dụng streaming, trong khi Spark Declarative Pipelines đơn giản hóa việc quản lý ETL pipeline phức tạp.

Bài viết này phân tích chi tiết các tính năng quan trọng nhất của Spark 4, kèm theo ví dụ code thực tế và câu hỏi phỏng vấn thường gặp. Mỗi phần được thiết kế để cung cấp kiến thức sâu về cả lý thuyết và thực hành, giúp người đọc có thể áp dụng ngay vào dự án thực tế hoặc chuẩn bị cho các buổi phỏng vấn kỹ thuật.

Lịch Phát Hành Spark 4

Apache Spark 4.0.0 được phát hành chính thức vào tháng 1 năm 2026 sau hai năm phát triển với hơn 3,000 commits từ cộng đồng toàn cầu. Phiên bản này yêu cầu Java 17 trở lên và tương thích với Python 3.9+, Scala 2.13 và R 4.0+.

ANSI SQL Mode: Tăng Cường Độ Tin Cậy và Tuân Thủ Chuẩn

Spark 4 mặc định bật ANSI SQL mode, đánh dấu sự chuyển đổi quan trọng từ chế độ "lenient" sang "strict" trong xử lý lỗi. Thay vì trả về NULL khi gặp lỗi chuyển đổi kiểu dữ liệu hoặc tràn số, engine giờ đây throw exception ngay lập tức, giúp phát hiện bug sớm hơn trong quá trình phát triển.

Thay đổi này đặc biệt quan trọng trong môi trường production, nơi dữ liệu hỏng có thể dẫn đến kết quả phân tích sai lệch nghiêm trọng. ANSI mode đảm bảo rằng các phép toán số học, chuyển đổi kiểu dữ liệu và so sánh string tuân theo chuẩn SQL-92, tăng khả năng tương thích với các hệ thống database truyền thống như PostgreSQL và Oracle.

Việc migrate code từ Spark 3.x sang ANSI mode đòi hỏi xử lý tường minh các trường hợp lỗi tiềm ẩn. Spark 4 cung cấp họ hàm TRY_* như TRY_CAST, TRY_DIVIDE, TRY_ADD để xử lý exception một cách graceful mà không làm crash job.

python

# ansi_mode_migration.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, try_cast

spark = SparkSession.builder.appName("ANSIMigration").getOrCreate()

# This now throws ArithmeticException in Spark 4 ANSI mode
# SELECT CAST('abc' AS INT)  -- runtime error

# Safe migration: use TRY_CAST to preserve NULL behavior
df = spark.sql("""
    SELECT
        TRY_CAST(raw_value AS INT) AS parsed_value,
        TRY_CAST(amount AS DECIMAL(10,2)) AS safe_amount
    FROM raw_events
    WHERE TRY_CAST(raw_value AS INT) IS NOT NULL
""")

# Alternatively, wrap arithmetic in try_divide / try_add
result = df.withColumn(
    "ratio",
    col("safe_amount") / col("parsed_value")  # throws on zero
)

ANSI mode cũng thay đổi hành vi của string comparison và aggregate functions. Các phép so sánh string giờ đây case-sensitive theo mặc định, và các hàm aggregate như SUM, AVG không còn skip NULL values một cách ngầm định. Điều này yêu cầu review lại logic business trong các query phức tạp.

VARIANT Data Type: Xử Lý JSON Linh Hoạt và Hiệu Quả

VARIANT là kiểu dữ liệu mới cho phép lưu trữ và query JSON semi-structured data trực tiếp trong Spark SQL mà không cần định nghĩa schema cứng nhắc trước. Khác với STRING column chứa JSON text, VARIANT sử dụng binary format tối ưu với metadata shredding, cho phép truy xuất nested fields nhanh hơn đến 10 lần.

Kiến trúc bên trong VARIANT sử dụng columnar encoding cho các primitive types phổ biến và dictionary encoding cho các string lặp lại, giảm storage footprint đáng kể so với text-based JSON. Engine query optimizer có thể pushdown filters và projections xuống VARIANT columns, tận dụng statistics để skip data không liên quan.

Điều này đặc biệt hữu ích cho các use case như event tracking, IoT sensor data và application logging, nơi schema thường xuyên thay đổi hoặc không đồng nhất giữa các sources. VARIANT cho phép ingest data ngay lập tức mà không cần schema evolution planning.

VariantExample.scalascala

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.appName("VariantDemo").getOrCreate()
import spark.implicits._

// Create a table with VARIANT column
spark.sql("""
    CREATE TABLE events (
        event_id BIGINT,
        timestamp TIMESTAMP,
        payload VARIANT
    ) USING PARQUET
""")

// Insert JSON data — automatically converted to VARIANT
spark.sql("""
    INSERT INTO events VALUES
    (1, current_timestamp(),
     PARSE_JSON('{"user_id": 42, "action": "click", "metadata": {"page": "/home", "duration_ms": 320}}')),
    (2, current_timestamp(),
     PARSE_JSON('{"user_id": 87, "action": "purchase", "metadata": {"item_id": "SKU-9001", "amount": 49.99}}'))
""")

// Query nested fields with dot notation — uses shredding for fast access
val clicks = spark.sql("""
    SELECT
        event_id,
        payload:user_id::INT AS user_id,
        payload:action::STRING AS action,
        payload:metadata.page::STRING AS page
    FROM events
    WHERE payload:action::STRING = 'click'
""")
clicks.show()

VARIANT tương thích với các file formats như Parquet, ORC và Delta Lake, cho phép write/read hiệu quả trong data lake architecture. Khi combine với Delta Lake's schema evolution, VARIANT trở thành giải pháp lý tưởng cho Bronze layer trong medallion architecture.

Real-Time Mode: Streaming với Độ Trễ Millisecond

Real-Time Mode (RTM) là tính năng breakthrough trong Spark Structured Streaming, giảm end-to-end latency từ mức seconds xuống sub-second range. Khác với micro-batch mode truyền thống chạy triggers định kỳ, RTM sử dụng continuous processing engine xử lý records ngay khi chúng arrive.

Kiến trúc RTM dựa trên asynchronous task execution và zero-copy data transfer giữa sources và sinks. Engine duy trì low-watermark tracking để đảm bảo exactly-once semantics trong khi giảm thiểu checkpoint overhead. Điều này đạt được thông qua incremental state snapshots thay vì full checkpoint mỗi batch.

Use cases điển hình cho RTM bao gồm fraud detection, stock trading alerts, real-time recommendation systems và IoT monitoring. Trong các scenario này, latency reduction từ 5 seconds xuống 200 milliseconds có thể tạo ra business impact đáng kể.

python

# real_time_streaming.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, window
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType

spark = SparkSession.builder \
    .appName("RTMDemo") \
    .config("spark.sql.streaming.mode", "realtime") \
    .getOrCreate()

schema = StructType([
    StructField("sensor_id", StringType()),
    StructField("temperature", DoubleType()),
    StructField("event_time", TimestampType())
])

# Read from Kafka with Real-Time Mode enabled
raw_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "sensor-readings") \
    .option("startingOffsets", "latest") \
    .load()

parsed = raw_stream \
    .select(from_json(col("value").cast("string"), schema).alias("data")) \
    .select("data.*")

# Write to Kafka sink — sub-second end-to-end
query = parsed \
    .filter(col("temperature") > 80.0) \
    .writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("topic", "alerts") \
    .option("checkpointLocation", "/tmp/checkpoints/alerts") \
    .start()

query.awaitTermination()

RTM yêu cầu tuning cẩn thận về resource allocation và backpressure handling. Spark 4 cung cấp adaptive execution cho streaming, tự động điều chỉnh parallelism dựa trên input rate và processing capacity. Monitoring dashboard được nâng cấp với real-time metrics về processing lag và throughput.

Spark Connect: Thin Client Architecture

Spark Connect giới thiệu client-server architecture mới, tách biệt hoàn toàn driver application khỏi Spark cluster. Client chỉ cần lightweight Python/Scala library để gửi logical plans qua gRPC, không còn yêu cầu JVM instance trên client machine.

Kiến trúc này giải quyết pain points lớn trong deployment và scaling. Developer có thể chạy Spark code từ Jupyter notebook, IDE hoặc serverless functions mà không lo về version conflicts, classpath management hay memory overhead. Spark cluster trở thành shared service có thể phục vụ hàng trăm concurrent clients.

Về mặt security, Spark Connect hỗ trợ authentication qua OAuth2, mutual TLS và integration với enterprise identity providers như LDAP và Active Directory. Network traffic được encrypt end-to-end, và access control được enforce tại cluster level thay vì client level.

python

# spark_connect_client.py
from pyspark.sql import SparkSession

# Connect to remote Spark cluster — no local JVM needed
spark = SparkSession.builder \
    .remote("sc://spark-cluster.internal:15002") \
    .getOrCreate()

# Full DataFrame API available through thin client
df = spark.read.parquet("s3a://data-lake/events/2026/")

aggregated = df.groupBy("country", "event_type") \
    .count() \
    .orderBy("count", ascending=False)

aggregated.show(20)

Spark Connect đặc biệt hữu ích trong môi trường Kubernetes, nơi pod restarts và scaling events thường xuyên xảy ra. Client reconnection được xử lý tự động với retry logic và circuit breaker patterns, đảm bảo fault tolerance mà không cần application code phải lo về connection management.

Sẵn sàng chinh phục phỏng vấn Data Engineering?

Luyện tập với mô phỏng tương tác, flashcards và bài kiểm tra kỹ thuật.

Khám phá Data Engineering

transformWithState API: Stateful Streaming Nâng Cao

transformWithState là API mới thay thế flatMapGroupsWithState, cung cấp interface trực quan hơn và hiệu năng cao hơn cho stateful streaming operations. API này cho phép maintain arbitrary state per key với TTL (time-to-live) configuration, tự động cleanup expired state để tránh memory leaks.

Khác với mapGroupsWithState chỉ hỗ trợ single value state, transformWithState cho phép manage multiple state variables với different TTL policies. Engine sử dụng RocksDB làm state backend mặc định, hỗ trợ state size lớn hơn available memory thông qua disk spilling.

API design mới tách biệt rõ ràng giữa state initialization, input processing và timer handling. Điều này làm cho code dễ test hơn và cho phép optimizer áp dụng các optimization techniques như state schema pruning và incremental checkpointing.

SessionTracker.scalascala

import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.StatefulProcessor
import java.time.Duration

case class UserEvent(userId: String, action: String, timestamp: Long)
case class SessionSummary(userId: String, actionCount: Int, durationMs: Long)

class SessionProcessor extends StatefulProcessor[String, UserEvent, SessionSummary] {

  // ValueState with 30-minute TTL — auto-cleanup of idle sessions
  @transient private var sessionStart: ValueState[Long] = _
  @transient private var actionCount: ValueState[Int] = _

  override def init(outputMode: OutputMode, timeMode: TimeMode): Unit = {
    sessionStart = getHandle.getValueState[Long]("start", TTLConfig(Duration.ofMinutes(30)))
    actionCount = getHandle.getValueState[Int]("count", TTLConfig(Duration.ofMinutes(30)))
  }

  override def handleInputRows(
      key: String,
      rows: Iterator[UserEvent],
      timerValues: TimerValues
  ): Iterator[SessionSummary] = {
    rows.flatMap { event =>
      if (!sessionStart.exists()) {
        sessionStart.update(event.timestamp)
        actionCount.update(1)
        Iterator.empty
      } else {
        val count = actionCount.get() + 1
        actionCount.update(count)
        // Emit summary every 10 actions
        if (count % 10 == 0) {
          val duration = event.timestamp - sessionStart.get()
          Iterator(SessionSummary(key, count, duration))
        } else Iterator.empty
      }
    }
  }
}

transformWithState tương thích với all output modes (append, update, complete) và hỗ trợ watermark-based late data handling. Khi combine với Real-Time Mode, API này cho phép build complex event processing systems với độ trễ thấp và state management đáng tin cậy.

Spark Declarative Pipelines: ETL Pipeline Đơn Giản Hóa

Spark Declarative Pipelines (SDP) là framework mới cho phép định nghĩa data pipelines dưới dạng declarative dependencies thay vì imperative code. Developer chỉ cần khai báo input, output và transformation logic, Spark tự động resolve dependencies, optimize execution plan và handle error recovery.

Framework này lấy cảm hứng từ dbt và Apache Airflow nhưng tận dụng Spark's distributed computing capabilities. SDP tự động parallelize independent tasks, reuse intermediate results và skip unchanged computations dựa trên data lineage tracking.

Một lợi ích quan trọng là testability và maintainability. Mỗi table definition là một pure function có thể unit test độc lập. Pipeline evolution trở nên dễ dàng hơn vì dependency graph được infer tự động, giảm thiểu breaking changes khi thêm hoặc modify tables.

python

# declarative_pipeline.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum, current_date

spark = SparkSession.builder.appName("SDP-Demo").getOrCreate()

# Define datasets declaratively — Spark resolves dependencies
@spark.declarative.table("clean_orders")
def clean_orders():
    return spark.read.table("raw_orders") \
        .filter(col("status") != "cancelled") \
        .filter(col("amount") > 0)

@spark.declarative.table("daily_revenue")
def daily_revenue():
    return spark.read.table("clean_orders") \
        .groupBy("order_date", "region") \
        .agg(_sum("amount").alias("total_revenue"))

@spark.declarative.table("revenue_summary")
def revenue_summary():
    return spark.read.table("daily_revenue") \
        .groupBy("region") \
        .agg(_sum("total_revenue").alias("grand_total")) \
        .orderBy("grand_total", ascending=False)

# Execute — Spark builds the DAG, parallelizes, handles retries
spark.declarative.run()

SDP tích hợp với Delta Lake để cung cấp incremental processing tự động. Framework track table versions và chỉ process changed partitions, giảm compute cost đáng kể cho large datasets. Monitoring UI mới visualize pipeline DAG với real-time execution status và bottleneck identification.

Câu Hỏi Phỏng Vấn Apache Spark 4

Câu 1: ANSI SQL mode trong Spark 4 ảnh hưởng như thế nào đến backward compatibility với Spark 3.x, và làm sao để migrate code an toàn?

ANSI mode thay đổi hành vi mặc định từ lenient sang strict error handling. Các operations như CAST, arithmetic và string comparison giờ đây throw exceptions thay vì return NULL hoặc silent overflow. Để migrate an toàn: (1) Enable ANSI mode trong test environment trước, (2) Sử dụng TRY_* functions (TRY_CAST, TRY_DIVIDE) để preserve NULL behavior, (3) Add explicit WHERE clauses để filter invalid data upstream, (4) Review aggregate functions vì NULL handling behavior đã thay đổi.

Migration strategy nên bao gồm comprehensive testing với production-like data volume. Sử dụng Spark's query execution listener để capture failed queries và analyze patterns. Một số edge cases như date parsing và timezone conversion cũng cần special attention vì ANSI mode strict hơn về format validation.

Câu 2: So sánh performance giữa VARIANT data type và nested struct columns trong Spark SQL. Khi nào nên dùng VARIANT?

VARIANT sử dụng binary encoding với metadata shredding, cho phép fast field access mà không cần parse entire JSON. Performance tương đương hoặc tốt hơn struct khi: (1) Schema không cố định hoặc thường xuyên evolve, (2) Chỉ access một phần nhỏ của nested fields, (3) Data có high cardinality trong field names. Struct columns tốt hơn khi: (1) Schema stable và well-defined, (2) Cần strongly-typed operations, (3) Sử dụng complex nested aggregations. VARIANT phù hợp cho Bronze/Silver layers, struct cho Gold layer sau khi schema đã stabilize.

Benchmarks cho thấy VARIANT queries có thể nhanh hơn 5-10x so với JSON string parsing, nhưng vẫn chậm hơn 2-3x so với native struct columns trong optimal cases. Trade-off chính là giữa flexibility và performance.

Câu 3: Real-Time Mode streaming đạt được sub-second latency như thế nào? Có trade-offs gì so với micro-batch mode?

Real-Time Mode sử dụng continuous processing engine với asynchronous task execution và incremental checkpointing. Latency reduction đạt được thông qua: (1) Eliminate batch scheduling overhead, (2) Zero-copy data transfer giữa operators, (3) Adaptive task parallelism dựa trên input rate. Trade-offs: (1) Higher CPU utilization do constant processing, (2) Phức tạp hơn trong fault recovery scenarios, (3) Một số operations như windowing aggregations có limitations, (4) Yêu cầu tuning cẩn thận về backpressure và resource allocation. RTM phù hợp cho use cases cần low latency hơn high throughput.

Production deployment RTM nên monitor metrics như processing lag, task failure rate và checkpoint duration. Có thể cần increase executor memory và tune garbage collection để maintain stable performance dưới high load.

Câu 4: Spark Connect architecture khác gì so với traditional Spark client mode? Security implications là gì?

Spark Connect tách driver thành thin client và remote server. Client gửi logical plans qua gRPC, server execute trên cluster. Benefits: (1) No JVM required on client, (2) Multiple concurrent clients share cluster resources, (3) Client crashes không ảnh hưởng running jobs, (4) Easier version management và dependency isolation. Security: (1) Authentication qua OAuth2/TLS, (2) Network traffic encrypted end-to-end, (3) Fine-grained access control tại server, (4) Audit logging cho all operations. Phù hợp cho notebook environments, serverless functions và multi-tenant platforms.

Implementation considerations bao gồm network latency between client và server, retry logic cho transient failures, và proper resource quotas để prevent single client monopolizing cluster.

Câu 5: transformWithState API cải thiện gì so với flatMapGroupsWithState? Khi nào nên migrate?

transformWithState cung cấp: (1) Multiple state variables với independent TTL, (2) Cleaner API separation giữa init/process/timer logic, (3) Better performance với incremental checkpointing, (4) Native support cho state schema evolution. Migration cần thiết khi: (1) Cần manage complex state với multiple variables, (2) State size grows indefinitely cần TTL-based cleanup, (3) Gặp performance issues với current state backend, (4) Code quá phức tạp với flatMapGroupsWithState. Migration path: (1) Implement StatefulProcessor interface, (2) Define state variables với TTL configs, (3) Test thoroughly vì state format incompatible, (4) Plan for state re-initialization during deployment.

Production experience cho thấy transformWithState giảm checkpoint overhead lên đến 40% trong large state scenarios, đặc biệt khi combine với RocksDB state backend và proper tuning.

Pipeline ETL Hoàn Chỉnh với Apache Spark 4

Ví dụ sau minh họa một production-grade ETL pipeline tận dụng các tính năng mới của Spark 4. Pipeline này implement medallion architecture (Bronze-Silver-Gold) với ANSI mode, VARIANT data type và streaming capabilities.

python

# spark4_etl_pipeline.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, current_timestamp, parse_json

spark = SparkSession.builder \
    .appName("Spark4ETL") \
    .config("spark.sql.ansi.enabled", "true") \
    .getOrCreate()

# Bronze: raw ingestion from Kafka
bronze = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "user-events") \
    .load() \
    .select(
        col("key").cast("string").alias("event_key"),
        parse_json(col("value").cast("string")).alias("payload"),
        col("timestamp").alias("kafka_ts")
    )

# Silver: extract typed fields from VARIANT
silver = bronze.select(
    col("event_key"),
    col("payload:user_id").cast("int").alias("user_id"),
    col("payload:action").cast("string").alias("action"),
    col("payload:metadata.session_id").cast("string").alias("session_id"),
    col("kafka_ts"),
    current_timestamp().alias("processed_at")
).filter(
    col("user_id").isNotNull()  # ANSI mode makes this explicit
)

# Gold: write to Delta Lake with schema enforcement
query = silver.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3a://checkpoints/user-events/") \
    .option("mergeSchema", "true") \
    .toTable("gold.user_events")

query.awaitTermination()

Pipeline này demonstate best practices: (1) ANSI mode enforcement để catch data quality issues sớm, (2) VARIANT cho flexible ingestion trong Bronze layer, (3) Explicit type casting và null filtering trong Silver layer, (4) Delta Lake integration với schema evolution support, (5) Proper checkpoint location để đảm bảo exactly-once processing.

Production deployment nên add monitoring cho checkpoint size, processing lag và data quality metrics. Alerting rules cần setup cho failed batches, schema conflicts và downstream dependencies.

Kết Luận

Apache Spark 4 mang đến những cải tiến đáng kể về reliability, performance và developer experience. Các tính năng chính bao gồm:

ANSI SQL Mode: Tăng cường data quality thông qua strict error handling và SQL standard compliance, yêu cầu explicit error handling với TRY_* functions
VARIANT Data Type: Xử lý JSON semi-structured data hiệu quả với binary encoding và metadata shredding, ideal cho Bronze/Silver layers
Real-Time Mode Streaming: Đạt sub-second latency thông qua continuous processing và incremental checkpointing, phù hợp cho low-latency use cases
Spark Connect: Thin client architecture tách biệt driver khỏi cluster, enable multi-tenant deployments và serverless integration
transformWithState API: Stateful streaming operations với multiple state variables và TTL-based cleanup, thay thế flatMapGroupsWithState
Declarative Pipelines: ETL pipeline management với automatic dependency resolution và incremental processing

Việc upgrade lên Spark 4 đòi hỏi planning cẩn thận, đặc biệt với ANSI mode migration. Các organization nên bắt đầu với non-critical workloads, build comprehensive test suites và monitor performance metrics trong production. ROI từ improved reliability và reduced latency thường justify migration effort trong 3-6 tháng.

Đối với data engineers đang chuẩn bị phỏng vấn, focus vào understanding trade-offs giữa các options, hands-on experience với streaming patterns và ability để design end-to-end pipelines. Spark 4 knowledge đang trở thành requirement trong nhiều job descriptions cho senior data engineering positions.

Bắt đầu luyện tập!

Kiểm tra kiến thức với mô phỏng phỏng vấn và bài kiểm tra kỹ thuật.

Tạo tài khoản miễn phí

Apache Spark 4: Tính Năng Mới và Structured Streaming - Hướng Dẫn Phỏng Vấn

ANSI SQL Mode: Tăng Cường Độ Tin Cậy và Tuân Thủ Chuẩn

VARIANT Data Type: Xử Lý JSON Linh Hoạt và Hiệu Quả

Real-Time Mode: Streaming với Độ Trễ Millisecond

Spark Connect: Thin Client Architecture

Sẵn sàng chinh phục phỏng vấn Data Engineering?

transformWithState API: Stateful Streaming Nâng Cao

Spark Declarative Pipelines: ETL Pipeline Đơn Giản Hóa

Câu Hỏi Phỏng Vấn Apache Spark 4

Pipeline ETL Hoàn Chỉnh với Apache Spark 4

Kết Luận

Bắt đầu luyện tập!

Bài viết liên quan

Apache Kafka cho Kỹ sư Dữ liệu: Streaming, Partitions và Câu hỏi Phỏng vấn

ETL vs ELT năm 2026: Kiến trúc Data Pipeline và So sánh Chi tiết

Apache Spark với Python: Xây dựng Data Pipeline từng bước