ในปี 2026 งาน Data Science ส่วนใหญ่ยังคงพึ่งพาไลบรารีหลักสามตัวของ Python ได้แก่ NumPy สำหรับการคำนวณเชิงตัวเลข, Pandas สำหรับการจัดการข้อมูลแบบตาราง และ Scikit-Learn สำหรับ Machine Learning บทความนี้จะพาผู้อ่านผ่านแต่ละไลบรารีพร้อมโค้ดตัวอย่างที่ใช้งานได้จริง ตั้งแต่การโหลดข้อมูลดิบ ทำความสะอาด สร้าง Feature ไปจนถึงการฝึกโมเดลและนำไปใช้งานจริง

เวอร์ชันที่ใช้ในบทความนี้

โค้ดทั้งหมดรันบน Python 3.12+, NumPy 2.1, Pandas 2.2 และ Scikit-Learn 1.6 โดย API ของแต่ละไลบรารีมีความเสถียรข้ามเวอร์ชันย่อย แต่การปรับปรุงประสิทธิภาพใน NumPy 2.x ทำให้การอัปเกรดคุ้มค่าอย่างมากสำหรับการประมวลผลอาร์เรย์ขนาดใหญ่

การดำเนินการกับ NumPy Array เพื่อการคำนวณที่มีประสิทธิภาพ

NumPy แทนที่ Python list ด้วย n-dimensional array ที่ใช้หน่วยความจำแบบต่อเนื่อง (contiguous memory block) ความแตกต่างด้านประสิทธิภาพนั้นชัดเจนมาก กล่าวคือการดำเนินการแบบ vectorized ของ NumPy บนข้อมูล 1 ล้านรายการจะเร็วกว่า Python loop ปกติถึง 50-100 เท่า DataFrame ทุกตัวใน Pandas และโมเดลทุกตัวใน Scikit-Learn ล้วนใช้ NumPy array เป็นพื้นฐานภายใน

python

# numpy_basics.py
import numpy as np

# Create arrays from different sources
prices = np.array([29.99, 49.99, 19.99, 99.99, 39.99])
quantities = np.arange(1, 6)  # [1, 2, 3, 4, 5]

# Vectorized arithmetic — no loops needed
revenue = prices * quantities
print(revenue)  # [29.99, 99.98, 59.97, 399.96, 199.95]

# Statistical aggregations
print(f"Total revenue: ${revenue.sum():.2f}")   # $789.85
print(f"Mean price: ${prices.mean():.2f}")       # $47.99
print(f"Std deviation: ${prices.std():.2f}")     # $27.64

# Boolean indexing — filter without loops
premium_mask = prices > 40
premium_items = prices[premium_mask]  # [49.99, 99.99]

Boolean indexing เป็นรูปแบบที่พบบ่อยที่สุดในโค้ด Data Science จริง แทนที่จะเขียน if statement ภายใน loop การใช้ boolean mask จะเลือกเฉพาะสมาชิกที่ตรงตามเงื่อนไขได้ในการดำเนินการเพียงครั้งเดียว วิธีนี้ไม่เพียงแค่เร็วกว่าเท่านั้น แต่ยังอ่านเข้าใจง่ายกว่าเมื่อนำไปใช้กับชุดข้อมูลที่มีหลายล้านแถว

การ Reshape และ Broadcasting ใน NumPy

การดำเนินการกับข้อมูลสองมิติต้องอาศัยความเข้าใจในเรื่อง broadcasting ซึ่งเป็นกลไกที่ NumPy ใช้ในการดำเนินการระหว่างอาร์เรย์ที่มี shape ต่างกัน กฎของ broadcasting ช่วยให้สามารถคำนวณระหว่าง matrix กับ vector ได้โดยไม่ต้องทำ explicit tiling

python

# numpy_reshape.py
import numpy as np

# Monthly sales data: 4 products x 3 months
sales = np.array([
    [120, 150, 130],  # Product A
    [200, 180, 220],  # Product B
    [90, 110, 95],    # Product C
    [300, 280, 310],  # Product D
])

# Column-wise mean (average per month)
monthly_avg = sales.mean(axis=0)  # [177.5, 180.0, 188.75]

# Row-wise sum (total per product)
product_totals = sales.sum(axis=1)  # [400, 600, 295, 890]

# Normalize each product relative to its own max
normalized = sales / sales.max(axis=1, keepdims=True)
# keepdims=True preserves the shape for broadcasting
print(normalized[0])  # [0.8, 1.0, 0.867] — Product A relative to its peak

# Reshape for Scikit-Learn (requires 2D input)
flat_sales = sales.flatten()  # 1D array of 12 values
reshaped = flat_sales.reshape(-1, 1)  # 12x1 column vector

พารามิเตอร์ axis ควบคุมทิศทางของการรวมค่า โดย axis=0 จะยุบแถว (ดำเนินการในแนวคอลัมน์) ส่วน axis=1 จะยุบคอลัมน์ (ดำเนินการในแนวแถว) การใช้ keepdims=True ช่วยป้องกันปัญหา shape ไม่ตรงกันขณะทำ broadcasting ซึ่งเป็นข้อผิดพลาดที่พบบ่อยในหมู่ผู้เริ่มต้น นอกจากนี้ Scikit-Learn กำหนดให้ข้อมูลอินพุตต้องเป็นอาร์เรย์ 2 มิติ ดังนั้นการใช้ .reshape(-1, 1) จึงเป็นขั้นตอนที่จำเป็นก่อนป้อนข้อมูลเข้าโมเดล

การจัดการและทำความสะอาดข้อมูลด้วย Pandas DataFrame

Pandas ห่อหุ้ม NumPy array ด้วย labeled axes และการดำเนินการแบบ SQL ทำให้การทำงานกับข้อมูลแบบตารางสะดวกขึ้นอย่างมาก DataFrame เป็นโครงสร้างข้อมูลหลักของ Pandas ซึ่งเปรียบเสมือนตารางที่แต่ละคอลัมน์สามารถเก็บข้อมูลคนละชนิดกันได้ ใน Pandas 2.2 คอลัมน์ string จะใช้ PyArrow เป็น backend โดยค่าเริ่มต้น ซึ่งใช้หน่วยความจำน้อยกว่า object-dtype string อย่างเห็นได้ชัด

python

# pandas_cleaning.py
import pandas as pd
import numpy as np

# Load and inspect raw data
df = pd.read_csv("candidates.csv")
print(df.shape)       # (1500, 8)
print(df.dtypes)      # Check column types
print(df.isna().sum()) # Count missing values per column

# Clean in a reproducible chain
df_clean = (
    df
    .dropna(subset=["salary", "experience_years"])      # Drop rows missing critical fields
    .assign(
        salary=lambda x: x["salary"].clip(lower=20000, upper=500000),  # Cap outliers
        experience_years=lambda x: x["experience_years"].astype(int),
        hired_date=lambda x: pd.to_datetime(x["hired_date"], errors="coerce"),
    )
    .drop_duplicates(subset=["email"])                  # Remove duplicate candidates
    .query("experience_years >= 0")                     # Filter invalid entries
    .reset_index(drop=True)
)

print(f"Cleaned: {len(df)} -> {len(df_clean)} rows")

การใช้ method chaining ร่วมกับ .assign() ทำให้ขั้นตอนการแปลงข้อมูลอ่านง่ายและทำซ้ำได้ แต่ละขั้นตอนบันทึกไว้ชัดเจนว่าอะไรเปลี่ยนแปลงไปและเพราะเหตุใด รูปแบบ lambda x: ภายใน .assign() จะอ้างอิง DataFrame ตามสถานะ ณ จุดนั้นของ chain ซึ่งหลีกเลี่ยงปัญหาการอ้างอิงข้อมูลเก่าที่อาจทำให้เกิดบั๊กที่ตรวจจับได้ยาก

Pandas 2.2 Copy-on-Write

Pandas 2.2 เปิดใช้งาน Copy-on-Write เป็นค่าเริ่มต้น การเปลี่ยนแปลงนี้ขจัดปัญหา SettingWithCopyWarning และทำให้การดำเนินการแบบ chain ปลอดภัยยิ่งขึ้น การแก้ไข slice จะไม่เปลี่ยนแปลง DataFrame ต้นฉบับอีกต่อไป โดย Pandas จะสร้างสำเนาเฉพาะเมื่อมีการเขียนจริงเท่านั้น

GroupBy Aggregation และ Feature Engineering ด้วย Pandas

GroupBy ทำงานโดยแบ่งข้อมูลออกเป็นกลุ่มย่อย ประยุกต์ใช้ฟังก์ชันกับแต่ละกลุ่ม แล้วรวมผลลัพธ์เข้าด้วยกัน รูปแบบนี้เป็นหัวใจหลักของ Feature Engineering ในงาน Machine Learning แบบตาราง ไม่ว่าจะเป็นการสร้างสถิติระดับกลุ่ม อัตราส่วนเปรียบเทียบ หรือตัวแปรที่ได้จากการรวมข้อมูล

python

# pandas_groupby.py
import pandas as pd

# Aggregate candidate stats by department
dept_stats = (
    df_clean
    .groupby("department")
    .agg(
        avg_salary=("salary", "mean"),
        median_experience=("experience_years", "median"),
        headcount=("email", "count"),
        max_salary=("salary", "max"),
    )
    .sort_values("avg_salary", ascending=False)
)
print(dept_stats.head())

# Create features for ML: encode categorical + add aggregated stats
df_features = (
    df_clean
    .assign(
        # Ratio of individual salary to department average
        salary_ratio=lambda x: x["salary"] / x.groupby("department")["salary"].transform("mean"),
        # Time since hire in days
        tenure_days=lambda x: (pd.Timestamp.now() - x["hired_date"]).dt.days,
        # Binary encoding
        is_senior=lambda x: (x["experience_years"] >= 5).astype(int),
    )
)

เมธอด .transform() คืนค่า Series ที่ align กับ index ของ DataFrame ต้นฉบับ จึงใช้งานร่วมกับ .assign() ได้อย่างปลอดภัย วิธีนี้หลีกเลี่ยงปัญหาที่พบบ่อยในการ join ผลลัพธ์ที่ aggregate แล้วกลับเข้าไปยัง DataFrame ต้นฉบับด้วยตนเอง ซึ่งมักทำให้เกิดข้อผิดพลาดเรื่อง index ไม่ตรงกัน

พร้อมที่จะพิชิตการสัมภาษณ์ Data Science & ML แล้วหรือยังครับ?

ฝึกฝนด้วยตัวจำลองแบบโต้ตอบ, flashcards และแบบทดสอบเทคนิคครับ

สำรวจ Data Science & ML

การสร้าง Scikit-Learn Pipeline ตั้งแต่เริ่มต้น

Scikit-Learn Pipeline เชื่อมขั้นตอนการ preprocessing และโมเดลเข้าด้วยกันเป็น object เดียว ข้อดีที่สำคัญที่สุดคือ Pipeline ป้องกัน data leakage โดยการ fit transformer เฉพาะบนข้อมูล training แล้วนำ transformation เดียวกันไปใช้กับข้อมูล test นี่เป็นรูปแบบมาตรฐานสำหรับ งาน Supervised Learning

python

# sklearn_pipeline.py
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
import pandas as pd

# Define column groups
numeric_features = ["salary", "experience_years", "salary_ratio", "tenure_days"]
categorical_features = ["department", "role_level"]
target = "promoted"

# Split before any preprocessing
X = df_features[numeric_features + categorical_features]
y = df_features[target]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Build the preprocessing + model pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),           # Scale numeric columns
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),  # Encode categories
    ]
)

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=4,
        random_state=42,
    )),
])

# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

ColumnTransformer ทำให้สามารถประยุกต์ใช้ preprocessing ที่แตกต่างกันกับคอลัมน์แต่ละประเภทได้ในขั้นตอนเดียว โดย StandardScaler จะปรับค่า numeric feature ให้มีค่าเฉลี่ยเป็นศูนย์และส่วนเบี่ยงเบนมาตรฐานเป็นหนึ่ง ส่วน OneHotEncoder จะแปลง categorical string เป็นคอลัมน์ binary การตั้งค่า handle_unknown="ignore" มีความสำคัญเพราะช่วยให้ Pipeline ไม่ crash หากข้อมูล test มีหมวดหมู่ที่ไม่เคยปรากฏในข้อมูล training

Cross-Validation และ Hyperparameter Tuning

การแบ่ง train/test เพียงครั้งเดียวอาจให้ผลลัพธ์ที่คลาดเคลื่อนขึ้นอยู่กับว่าตัวอย่างใดตกอยู่ในชุดไหน Cross-validation แก้ปัญหานี้โดยการรันหลายรอบของการแบ่งข้อมูลแล้วหาค่าเฉลี่ยของคะแนน ทำให้ได้การประเมินประสิทธิภาพที่น่าเชื่อถือมากขึ้น

python

# sklearn_tuning.py
from sklearn.model_selection import cross_val_score, GridSearchCV
import numpy as np

# 5-fold cross-validation on the full pipeline
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="f1")
print(f"F1 scores: {scores}")
print(f"Mean F1: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# Grid search over hyperparameters
param_grid = {
    "classifier__n_estimators": [100, 200, 300],
    "classifier__max_depth": [3, 4, 5],
    "classifier__learning_rate": [0.05, 0.1, 0.2],
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring="f1",
    n_jobs=-1,      # Use all CPU cores
    verbose=1,
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1: {grid_search.best_score_:.3f}")

# Evaluate the best model on held-out test set
best_model = grid_search.best_estimator_
print(classification_report(y_test, best_model.predict(X_test)))

ไวยากรณ์ double-underscore (classifier__max_depth) ใช้สำหรับอ้างอิง parameter ที่ซ้อนอยู่ภายใน Pipeline การตั้ง n_jobs=-1 ช่วยให้การค้นหาทำงานแบบขนานบน CPU core ทั้งหมดที่มี ซึ่งสามารถลดเวลาการ tuning จากหลายชั่วโมงเหลือเพียงไม่กี่นาทีบนเครื่องที่มีหลาย core

ระวัง Data Leakage ใน Cross-Validation

การ fit scaler บนชุดข้อมูลทั้งหมดก่อนการแบ่ง fold จะทำให้ข้อมูลจาก validation fold รั่วไหลเข้าไปใน training fold การวาง scaler ไว้ภายใน Pipeline รับประกันว่า scaling จะถูก fit เฉพาะบนส่วน training ของแต่ละ fold เท่านั้น นี่เป็นข้อผิดพลาดที่พบบ่อยที่สุดในคำถามสัมภาษณ์ Machine Learning

การบันทึกและโหลดโมเดลสำหรับ Production

Pipeline ที่ผ่านการฝึกแล้วต้องถูก serialize เพื่อนำไปใช้งานจริง ไลบรารี joblib จัดการ NumPy array ได้มีประสิทธิภาพกว่า pickle ที่มาพร้อมกับ Python โดยเฉพาะอย่างยิ่งเมื่อโมเดลมีอาร์เรย์ขนาดใหญ่ สิ่งสำคัญคือควรบันทึก Pipeline ทั้งหมด ไม่ใช่เฉพาะโมเดลเท่านั้น เพื่อให้ preprocessing ถูกนำไปใช้อย่างถูกต้องเมื่อทำ inference

python

# sklearn_export.py
import joblib
from pathlib import Path

# Save the complete pipeline (preprocessor + model)
model_dir = Path("models")
model_dir.mkdir(exist_ok=True)
joblib.dump(best_model, model_dir / "promotion_model_v1.joblib")

# Load and predict in a different process
loaded_model = joblib.load(model_dir / "promotion_model_v1.joblib")
new_data = pd.DataFrame({
    "salary": [75000],
    "experience_years": [4],
    "salary_ratio": [1.05],
    "tenure_days": [730],
    "department": ["Engineering"],
    "role_level": ["Mid"],
})

prediction = loaded_model.predict(new_data)
probability = loaded_model.predict_proba(new_data)[:, 1]
print(f"Promoted: {bool(prediction[0])}, Confidence: {probability[0]:.2%}")

การบันทึก Pipeline ทั้งหมดแทนที่จะบันทึกเฉพาะโมเดลช่วยให้มั่นใจว่าขั้นตอน preprocessing จะถูกนำไปใช้อย่างเหมือนกันทุกประการ ณ เวลา inference ควร version ไฟล์โมเดลควบคู่กับ training script เพื่อรักษา reproducibility เอกสาร Scikit-Learn persistence อย่างเป็นทางการครอบคลุมข้อควรระวังด้านความปลอดภัยเมื่อโหลดไฟล์โมเดลที่ไม่น่าเชื่อถือ

สรุป

NumPy vectorized operation แทนที่ Python loop ด้วยการคำนวณอาร์เรย์ที่เร็วกว่า 50-100 เท่า ขณะที่ boolean indexing และ broadcasting ครอบคลุมความต้องการในการกรองและแปลงข้อมูลส่วนใหญ่
Pandas method chaining ด้วย .assign() และ .transform() สร้าง data cleaning pipeline ที่อ่านง่ายและทำซ้ำได้ ส่วน Copy-on-Write ใน Pandas 2.2 ขจัดบั๊กจากการ mutate ข้อมูลโดยไม่ตั้งใจ
Scikit-Learn Pipeline รวม preprocessing และ modeling เข้าด้วยกันเป็น object เดียว ป้องกัน data leakage ระหว่างชุด train และ test
Cross-validation ร่วมกับ GridSearchCV ให้การประเมินประสิทธิภาพที่น่าเชื่อถือและทำ hyperparameter tuning อัตโนมัติบน CPU core ทั้งหมด
การ serialize Pipeline ทั้งหมดด้วย joblib รับประกันว่า preprocessing จะเหมือนกันทั้งเวลา training และ inference
ฝึกฝนรูปแบบเหล่านี้ด้วยคำถามสัมภาษณ์ Data Science ที่ทดสอบทั้งความคล่องแคล่วในการเขียนโค้ดและความเข้าใจพื้นฐานทางสถิติ

เริ่มฝึกซ้อมเลย!

ทดสอบความรู้ของคุณด้วยตัวจำลองสัมภาษณ์และแบบทดสอบเทคนิคครับ

สร้างบัญชีฟรี

Python สำหรับ Data Science: NumPy, Pandas และ Scikit-Learn ในปี 2026

การดำเนินการกับ NumPy Array เพื่อการคำนวณที่มีประสิทธิภาพ

การ Reshape และ Broadcasting ใน NumPy

การจัดการและทำความสะอาดข้อมูลด้วย Pandas DataFrame

GroupBy Aggregation และ Feature Engineering ด้วย Pandas

พร้อมที่จะพิชิตการสัมภาษณ์ Data Science & ML แล้วหรือยังครับ?

การสร้าง Scikit-Learn Pipeline ตั้งแต่เริ่มต้น

Cross-Validation และ Hyperparameter Tuning

การบันทึกและโหลดโมเดลสำหรับ Production

สรุป

เริ่มฝึกซ้อมเลย!

บทความที่เกี่ยวข้อง

อัลกอริทึม Machine Learning อธิบายครบจบ: คู่มือสัมภาษณ์งานด้านเทคนิคปี 2026

25 คำถามสัมภาษณ์ Data Science ยอดนิยมในปี 2026

MLOps ในปี 2026: MLflow, Model Registry และคำถามสัมภาษณ์เชิงเทคนิค