Feature engineering เป็นตัวกำหนดว่าโมเดล machine learning จะทำงานได้ดีเพียงใด ข้อมูลดิบไม่ค่อยมาในรูปแบบที่อัลกอริทึมสามารถนำไปใช้ได้โดยตรง การแปลงข้อมูลให้เป็น feature ที่มีความหมายคือสะพานเชื่อมระหว่างการเก็บรวบรวมข้อมูลกับความแม่นยำของโมเดล

ทำไม Feature Engineering จึงสำคัญ

งานวิจัยแสดงให้เห็นอย่างสม่ำเสมอว่าการเลือก preprocessing และ feature engineering มีผลกระทบต่อความแม่นยำของโมเดลมากกว่าการปรับ hyperparameter ชุด feature ที่ออกแบบมาอย่างดีสามารถทำให้ logistic regression ธรรมดาทำงานได้ดีกว่าโมเดล gradient boosting ที่มี feature ที่ไม่ดี

กลยุทธ์ Encoding หมวดหมู่สำหรับโมเดล ML

อัลกอริทึม machine learning ส่วนใหญ่ต้องการข้อมูลนำเข้าเป็นตัวเลข ตัวแปรหมวดหมู่ เช่น ข้อความ "สูง" "กลาง" "ต่ำ" หรือชื่อประเทศ จำเป็นต้องแปลงเป็นตัวเลขที่โมเดลสามารถประมวลผลได้ กลยุทธ์ encoding ส่งผลโดยตรงต่อประสิทธิภาพและความสามารถในการตีความของโมเดล

เทคนิค encoding สามประเภทครอบคลุมสถานการณ์ในโลกจริงส่วนใหญ่: label encoding สำหรับข้อมูลลำดับ, one-hot encoding สำหรับหมวดหมู่นาม และ target encoding สำหรับ feature ที่มีจำนวนค่าที่ไม่ซ้ำกันสูง

python

# encoding_strategies.py
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

df = pd.DataFrame({
    "size": ["small", "medium", "large", "medium", "small"],
    "color": ["red", "blue", "green", "red", "blue"],
    "price": [10, 25, 40, 22, 12]
})

# Label encoding for ordinal feature (size has natural order)
le = LabelEncoder()
df["size_encoded"] = le.fit_transform(df["size"])  # large=0, medium=1, small=2

# One-hot encoding for nominal feature (color has no order)
ct = ColumnTransformer(
    transformers=[
        ("onehot", OneHotEncoder(drop="first", sparse_output=False), ["color"])
    ],
    remainder="passthrough"  # Keep other columns unchanged
)
result = ct.fit_transform(df[["color", "price"]])
# Produces: color_green, color_red columns (blue dropped as reference)

Label encoding เหมาะสำหรับ feature ที่เป็นลำดับ โดยลำดับตัวเลขสอดคล้องกับลำดับหมวดหมู่ One-hot encoding ป้องกันไม่ให้โมเดลอนุมานความสัมพันธ์ลำดับที่ผิดระหว่างหมวดหมู่นาม พารามิเตอร์ drop="first" หลีกเลี่ยง กับดักตัวแปรจำลอง โดยการลบคอลัมน์ที่ซ้ำซ้อนออกหนึ่งคอลัมน์

Target Encoding และการรั่วไหลของข้อมูล

Target encoding แทนที่แต่ละหมวดหมู่ด้วยค่าเฉลี่ยของตัวแปรเป้าหมายของกลุ่มนั้น แม้จะมีประสิทธิภาพสูงสำหรับ feature ที่มีจำนวนค่าที่ไม่ซ้ำกันสูง (รหัสไปรษณีย์, ID สินค้า) แต่เทคนิคนี้ทำให้ข้อมูลเป้าหมายรั่วไหลเข้าไปใน feature ควรใช้ target encoding ภายใน fold ของ cross-validation โดยใช้คลาส TargetEncoder ของ scikit-learn เสมอ เพื่อป้องกันค่าประเมินที่มองโลกในแง่ดีเกินไป

Feature Scaling: StandardScaler vs MinMaxScaler vs RobustScaler

อัลกอริทึมที่ใช้ระยะทาง (KNN, SVM, K-Means) และตัวปรับค่า gradient descent จะปฏิบัติต่อ feature ทั้งหมดอย่างเท่าเทียมกันโดยค่าเริ่มต้น คอลัมน์เงินเดือนที่มีค่าตั้งแต่ 30,000 ถึง 200,000 จะครอบงำคอลัมน์ประสบการณ์การทำงานที่มีค่าตั้งแต่ 0 ถึง 40 หากไม่มีการ scaling

การเลือก scaler ขึ้นอยู่กับการกระจายข้อมูลและความไวต่อ outlier

python

# feature_scaling.py
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Simulated dataset: salary with outliers
data = np.array([[35000], [42000], [55000], [67000], [450000]])  # 450k is an outlier

# StandardScaler: mean=0, std=1 (sensitive to outliers)
standard = StandardScaler().fit_transform(data)
# Result: [-0.72, -0.68, -0.60, -0.53, 2.53] — outlier distorts the scale

# MinMaxScaler: maps to [0, 1] (very sensitive to outliers)
minmax = MinMaxScaler().fit_transform(data)
# Result: [0.0, 0.017, 0.048, 0.077, 1.0] — most values crushed near zero

# RobustScaler: uses median and IQR (robust to outliers)
robust = RobustScaler().fit_transform(data)
# Result: [-0.77, -0.5, 0.0, 0.46, 15.38] — outlier isolated, core data preserved

StandardScaler เหมาะกับกรณีการใช้งานส่วนใหญ่ โมเดลเชิงเส้น, neural network และ PCA ล้วนคาดหวัง feature ที่ได้รับการมาตรฐานแล้ว MinMaxScaler เหมาะกับฟังก์ชันกระตุ้นแบบมีขอบเขต (sigmoid, tanh) และการปรับค่า pixel ของภาพ RobustScaler ควรเป็นตัวเลือกเริ่มต้นเมื่อมี outlier ในชุดข้อมูล ตามที่ระบุไว้ในคู่มือ preprocessing ของ scikit-learn

การแปลงทางคณิตศาสตร์สำหรับการแจกแจงที่เบ้

Feature ที่เบ้ละเมิดสมมติฐานการแจกแจงปกติของโมเดลเชิงเส้น และเพิ่มอิทธิพลของค่าที่สุดขั้ว การแปลง log, รากที่สอง และ Box-Cox บีบอัดหางของการแจกแจงที่เบ้ ทำให้ใกล้เคียงกับการแจกแจงปกติมากขึ้น

python

# skew_transformations.py
import numpy as np
from sklearn.preprocessing import PowerTransformer

# Right-skewed income data (common in real datasets)
income = np.array([[25000], [32000], [41000], [55000], [72000],
                   [150000], [320000], [890000]])

# Log transform: simple, effective for right-skewed data
log_income = np.log1p(income)  # log1p handles zero values safely

# Box-Cox: finds optimal power parameter automatically
pt = PowerTransformer(method="box-cox")  # Requires strictly positive values
income_boxcox = pt.fit_transform(income)
print(f"Optimal lambda: {pt.lambdas_[0]:.3f}")  # Shows learned parameter

# Yeo-Johnson: works with zero and negative values too
pt_yj = PowerTransformer(method="yeo-johnson")
income_yj = pt_yj.fit_transform(income)

PowerTransformer ด้วย Box-Cox หรือ Yeo-Johnson จะเรียนรู้พารามิเตอร์การแปลงที่เหมาะสมที่สุดโดยอัตโนมัติ Yeo-Johnson จัดการกับค่าศูนย์และค่าลบได้ ทำให้เป็นตัวเลือกเริ่มต้นที่ปลอดภัยกว่า ควรตรวจสอบการแจกแจงผลลัพธ์ด้วย Q-Q plot หรือการทดสอบ Shapiro-Wilk เสมอ เพื่อยืนยันว่าการแปลงช่วยปรับปรุงความเป็นปกติ

การคัดเลือก Feature: กำจัด Noise ก่อนการฝึกสอน

Feature มากขึ้นไม่ได้หมายความว่าการทำนายจะดีขึ้นเสมอไป Feature ที่ไม่เกี่ยวข้องหรือซ้ำซ้อนสร้าง noise เพิ่มเวลาฝึกสอน และทำให้เกิด overfitting โดยเฉพาะในชุดข้อมูลมิติสูงที่จำนวน feature เข้าใกล้หรือเกินจำนวนตัวอย่าง

วิธีการคัดเลือก feature มีสามประเภท: วิธี filter (การทดสอบทางสถิติ), วิธี wrapper (การประเมินจากโมเดล) และวิธี embedded (รวมอยู่ในกระบวนการฝึกสอน)

python

# feature_selection.py
from sklearn.datasets import make_classification
from sklearn.feature_selection import (
    SelectKBest, f_classif,  # Filter method
    SequentialFeatureSelector,  # Wrapper method
    SelectFromModel  # Embedded method
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoCV

# Dataset: 20 features, only 5 are informative
X, y = make_classification(
    n_samples=1000, n_features=20,
    n_informative=5, n_redundant=3, random_state=42
)

# Filter: ANOVA F-test selects top k features by statistical significance
selector_filter = SelectKBest(f_classif, k=8)
X_filtered = selector_filter.fit_transform(X, y)
print(f"Selected features: {selector_filter.get_support(indices=True)}")

# Embedded: L1 regularization (Lasso) zeros out irrelevant feature weights
lasso = LassoCV(cv=5, random_state=42).fit(X, y)
selector_embedded = SelectFromModel(lasso, prefit=True)
X_lasso = selector_embedded.transform(X)
print(f"Lasso kept {X_lasso.shape[1]} features out of {X.shape[1]}")

วิธี filter (ANOVA, chi-squared, mutual information) ทำงานได้เร็วแต่ประเมิน feature แบบอิสระ จึงพลาดปฏิสัมพันธ์ระหว่าง feature วิธี embedded เช่น การปรับค่า Lasso ทำการคัดเลือกระหว่างการฝึกสอน สร้างโมเดลที่เบาบางซึ่งสามารถ generalize ได้ดีกว่า สำหรับความเข้าใจที่ลึกซึ้งยิ่งขึ้นเกี่ยวกับอัลกอริทึมเบื้องหลังโมเดลเหล่านี้ คู่มืออัลกอริทึม machine learning ครอบคลุมพื้นฐานทางคณิตศาสตร์

พร้อมที่จะพิชิตการสัมภาษณ์ Data Science & ML แล้วหรือยังครับ?

ฝึกฝนด้วยตัวจำลองแบบโต้ตอบ, flashcards และแบบทดสอบเทคนิคครับ

สำรวจ Data Science & ML

การสร้าง Pipeline พร้อมใช้งานจริงด้วย ColumnTransformer

การกระจายขั้นตอน preprocessing ไปตาม cell ต่าง ๆ ของ notebook สร้าง workflow ที่เปราะบางและไม่สามารถทำซ้ำได้ Pipeline และ ColumnTransformer ของ scikit-learn รวมกระบวนการ feature engineering ทั้งหมดไว้ในอ็อบเจกต์เดียวที่สามารถ serialize ได้ และป้องกันการรั่วไหลของข้อมูลระหว่าง cross-validation

python

# production_pipeline.py
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Define column groups by type
numeric_features = ["age", "income", "credit_score"]
categorical_features = ["education", "employment_type", "region"]

# Numeric pipeline: impute missing values, then scale
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),  # Median resists outliers
    ("scaler", StandardScaler())
])

# Categorical pipeline: impute missing, then one-hot encode
categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

# Combine into a single preprocessor
preprocessor = ColumnTransformer([
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features)
])

# Full pipeline: preprocessing + model in one object
full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(n_estimators=200, random_state=42))
])

# Cross-validation applies preprocessing correctly within each fold
scores = cross_val_score(full_pipeline, X, y, cv=5, scoring="roc_auc")
print(f"AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

Pipeline รับประกันว่าสถิติการเติมค่า พารามิเตอร์การ scaling และการแมป encoding จะถูกคำนวณจากข้อมูลฝึกสอนเท่านั้นภายในแต่ละ fold ของ cross-validation การเรียก full_pipeline.fit(X_train, y_train) ตามด้วย full_pipeline.predict(X_test) จะใช้การแปลงที่เหมือนกันโดยไม่มีการรั่วไหลของข้อมูล Pipeline ทั้งหมดสามารถ serialize ได้ด้วย joblib.dump() สำหรับการ deploy

Feature Engineering อัตโนมัติด้วย Featuretools

Feature engineering แบบแมนนวลทำงานได้ดีสำหรับชุดข้อมูลที่มีโครงสร้างพร้อมคอลัมน์ไม่กี่สิบคอลัมน์ ชุดข้อมูลเชิงสัมพันธ์ที่ครอบคลุมหลายตาราง เช่น ธุรกรรมที่เชื่อมกับลูกค้าที่เชื่อมกับผู้ขาย ต้องการแนวทางอัตโนมัติเพื่อสำรวจการรวมกันของ feature อย่างเป็นระบบ

Featuretools นำ Deep Feature Synthesis (DFS) มาใช้ ซึ่งสำรวจความสัมพันธ์ระหว่าง entity และใช้ primitive ของการแปลงและการรวมค่าเพื่อสร้าง feature ที่เป็นตัวเลือกหลายร้อยรายการโดยอัตโนมัติ

python

# automated_feature_engineering.py
import featuretools as ft
import pandas as pd

# Define entities from relational tables
customers = pd.DataFrame({
    "customer_id": [1, 2, 3],
    "signup_date": pd.to_datetime(["2024-01-15", "2024-03-22", "2024-06-01"]),
    "region": ["US", "EU", "APAC"]
})

transactions = pd.DataFrame({
    "txn_id": range(1, 8),
    "customer_id": [1, 1, 1, 2, 2, 3, 3],
    "amount": [50, 120, 30, 200, 85, 340, 15],
    "category": ["food", "tech", "food", "tech", "travel", "food", "tech"]
})

# Create an EntitySet with relationships
es = ft.EntitySet(id="retail")
es = es.add_dataframe(dataframe=customers, dataframe_name="customers",
                      index="customer_id", time_index="signup_date")
es = es.add_dataframe(dataframe=transactions, dataframe_name="transactions",
                      index="txn_id")
es = es.add_relationship("customers", "customer_id",
                         "transactions", "customer_id")

# DFS generates features: COUNT, MEAN, MAX, STD of transactions per customer
feature_matrix, feature_defs = ft.dfs(
    entityset=es, target_dataframe_name="customers",
    max_depth=2,  # Controls complexity of generated features
    trans_primitives=["month", "weekday"],  # Transformation primitives
    agg_primitives=["count", "mean", "std", "max"]  # Aggregation primitives
)
print(f"Generated {len(feature_defs)} features from 2 tables")

DFS สร้าง feature เช่น MEAN(transactions.amount), STD(transactions.amount) และ COUNT(transactions) ซึ่งเป็นการรวมค่าที่นักวิทยาศาสตร์ข้อมูลมักสร้างด้วยตนเอง แต่ใช้เวลาน้อยกว่ามาก พารามิเตอร์ max_depth ควบคุมความซับซ้อนของ feature: ความลึก 2 สร้างการรวมค่าแบบซ้อนกัน เช่น STD(transactions.MONTH(signup_date))

เคล็ดลับสัมภาษณ์ Feature Engineering

ผู้สัมภาษณ์มักขอให้ผู้สมัครออกแบบ feature สำหรับปัญหาทางธุรกิจเฉพาะ (การทำนายการยกเลิก, การตรวจจับการฉ้อโกง, ระบบแนะนำ) ควรฝึกฝนการสร้าง feature ในกรอบ recency, frequency และ monetary value (การวิเคราะห์ RFM) สามมิตินี้ใช้ได้กับปัญหา ML ที่เกี่ยวกับลูกค้าเกือบทุกประเภท ศึกษาคำถามสัมภาษณ์ feature engineering ที่พบบ่อยเพื่อเตรียมตัว

คำถามสัมภาษณ์ Feature Engineering ที่พบบ่อย

การสัมภาษณ์ data science ในปี 2026 ทดสอบทั้งความเข้าใจเชิงทฤษฎีและการนำไปใช้จริงของ feature engineering คำถามต่อไปนี้ปรากฏบ่อยในการสัมภาษณ์ที่บริษัทเทคโนโลยีขนาดใหญ่และสตาร์ทอัป

ถ: จะจัดการกับ feature หมวดหมู่ที่มีค่าไม่ซ้ำกันมากกว่า 10,000 ค่าอย่างไร?

ตัวแปรหมวดหมู่ที่มีจำนวนค่าไม่ซ้ำกันสูงไม่สามารถใช้ one-hot encoding ได้ เพราะจะสร้างคอลัมน์ sparse ถึง 10,000 คอลัมน์ กลยุทธ์ที่มีประสิทธิภาพ ได้แก่: target encoding (พร้อม cross-validation ที่เหมาะสมเพื่อหลีกเลี่ยงการรั่วไหล), frequency encoding (แทนที่หมวดหมู่ด้วยจำนวนการเกิด), hashing (ใช้ HashingVectorizer เพื่อแมปหมวดหมู่เข้าสู่พื้นที่ขนาดคงที่) และ embedding layer ใน neural network

ถ: เมื่อไหร่ที่ไม่ควร scaling feature?

โมเดลที่ใช้ต้นไม้ (Random Forest, XGBoost, LightGBM) ไม่ได้รับผลกระทบจากการแปลงแบบ monotonic บน feature การ scaling เพิ่มแต่การคำนวณที่ไม่จำเป็นโดยไม่มีผลต่อ split โมเดลที่ใช้ระยะทางและ neural network ต้องการ scaling เสมอ

ถ: อธิบายความแตกต่างระหว่าง feature selection และ dimensionality reduction

Feature selection เก็บ subset ของ feature ดั้งเดิม ซึ่ง feature ที่ถูกเลือกยังคงตีความได้ Dimensionality reduction (PCA, t-SNE, UMAP) สร้าง feature สังเคราะห์ใหม่เป็นการรวมกันแบบเชิงเส้นหรือไม่เชิงเส้นของ feature ดั้งเดิม ส่วนประกอบ PCA เพิ่ม explained variance สูงสุด แต่สูญเสียความสามารถในการตีความโดยตรง ทางเลือกที่เหมาะสมขึ้นอยู่กับว่าความสามารถในการอธิบายของโมเดลเป็นข้อกำหนดของโครงการหรือไม่

ถ: จะตรวจจับและจัดการ multicollinearity อย่างไร?

Variance Inflation Factor (VIF) ที่สูงกว่า 5-10 บ่งบอกถึง multicollinearity ที่เป็นปัญหา เมทริกซ์สหสัมพันธ์จับความสัมพันธ์แบบคู่ได้ แต่พลาดการพึ่งพาแบบหลายตัวแปร วิธีแก้ไข ได้แก่ การลบ feature ที่สัมพันธ์กันออกหนึ่งตัว การรวมเข้าด้วยกัน (PCA บน subset ที่สัมพันธ์กัน) หรือใช้การปรับค่า (Ridge/Lasso) ซึ่งจัดการ collinearity ภายใน

สำหรับการฝึกปฏิบัติกับไลบรารีจัดการข้อมูล Python ที่จำเป็นสำหรับ feature engineering คู่มือ Python สำหรับ Data Science ครอบคลุม workflow ของ NumPy, Pandas และ scikit-learn อย่างละเอียด

เริ่มฝึกซ้อมเลย!

ทดสอบความรู้ของคุณด้วยตัวจำลองสัมภาษณ์และแบบทดสอบเทคนิคครับ

สร้างบัญชีฟรี

สรุป

ทำ encoding หมวดหมู่อย่างมีจุดประสงค์: label encoding สำหรับข้อมูลลำดับ, one-hot สำหรับนาม, target encoding (ภายใน fold CV) สำหรับ feature ที่มีจำนวนค่าไม่ซ้ำกันสูง
Scaling feature ตามความต้องการของโมเดล: StandardScaler สำหรับโมเดลเชิงเส้นและ neural network, ข้าม scaling สำหรับโมเดลที่ใช้ต้นไม้
ใช้การแปลงเลขยกกำลัง (PowerTransformer) เพื่อลดความเบ้ก่อนป้อนข้อมูลเข้าอัลกอริทึมที่สมมติการแจกแจงปกติ
คัดเลือก feature อย่างเป็นระบบโดยใช้วิธี filter สำหรับความเร็ว, วิธี embedded (Lasso) สำหรับความแม่นยำ และตรวจสอบด้วย cross-validation เสมอ
รวม preprocessing ทั้งหมดไว้ใน Pipeline + ColumnTransformer ของ scikit-learn เพื่อป้องกันการรั่วไหลของข้อมูลและรับประกันความสามารถในการทำซ้ำ
ใช้ Featuretools DFS สำหรับชุดข้อมูลเชิงสัมพันธ์ที่ครอบคลุมหลายตารางในการสร้าง feature อัตโนมัติ
ฝึกฝนการอธิบายการตัดสินใจเกี่ยวกับ feature engineering ด้วยภาษาธุรกิจ ผู้สัมภาษณ์ประเมินทั้งทักษะทางเทคนิคและความชัดเจนในการสื่อสาร

เริ่มฝึกซ้อมเลย!

ทดสอบความรู้ของคุณด้วยตัวจำลองสัมภาษณ์และแบบทดสอบเทคนิคครับ

สร้างบัญชีฟรี

Feature Engineering สำหรับ Machine Learning: เทคนิคและคำถามสัมภาษณ์ 2026

กลยุทธ์ Encoding หมวดหมู่สำหรับโมเดล ML

Feature Scaling: StandardScaler vs MinMaxScaler vs RobustScaler

การแปลงทางคณิตศาสตร์สำหรับการแจกแจงที่เบ้

การคัดเลือก Feature: กำจัด Noise ก่อนการฝึกสอน

พร้อมที่จะพิชิตการสัมภาษณ์ Data Science & ML แล้วหรือยังครับ?

การสร้าง Pipeline พร้อมใช้งานจริงด้วย ColumnTransformer

Feature Engineering อัตโนมัติด้วย Featuretools

คำถามสัมภาษณ์ Feature Engineering ที่พบบ่อย

เริ่มฝึกซ้อมเลย!

สรุป

เริ่มฝึกซ้อมเลย!

บทความที่เกี่ยวข้อง

Python สำหรับ Data Science: NumPy, Pandas และ Scikit-Learn ในปี 2026

อัลกอริทึม Machine Learning อธิบายครบจบ: คู่มือสัมภาษณ์งานด้านเทคนิคปี 2026

25 คำถามสัมภาษณ์ Data Science ยอดนิยมในปี 2026