Feature engineering determines how well a machine learning model performs. Raw data rarely arrives in a format that algorithms can exploit directly — transforming it into meaningful features bridges the gap between data collection and model accuracy.

Why Feature Engineering Matters

Research consistently shows that preprocessing and feature engineering choices outweigh hyperparameter tuning in their effect on downstream model accuracy. A well-engineered feature set can make a simple logistic regression outperform a poorly-fed gradient boosting model.

Categorical Encoding Strategies for ML Models

Most machine learning algorithms require numerical input. Categorical variables — text labels like "high", "medium", "low" or country names — need conversion into numbers the model can process. The encoding strategy directly affects model performance and interpretability.

Three encoding techniques cover the vast majority of real-world scenarios: label encoding for ordinal data, one-hot encoding for nominal categories, and target encoding for high-cardinality features.

python

# encoding_strategies.py
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

df = pd.DataFrame({
    "size": ["small", "medium", "large", "medium", "small"],
    "color": ["red", "blue", "green", "red", "blue"],
    "price": [10, 25, 40, 22, 12]
})

# Label encoding for ordinal feature (size has natural order)
le = LabelEncoder()
df["size_encoded"] = le.fit_transform(df["size"])  # large=0, medium=1, small=2

# One-hot encoding for nominal feature (color has no order)
ct = ColumnTransformer(
    transformers=[
        ("onehot", OneHotEncoder(drop="first", sparse_output=False), ["color"])
    ],
    remainder="passthrough"  # Keep other columns unchanged
)
result = ct.fit_transform(df[["color", "price"]])
# Produces: color_green, color_red columns (blue dropped as reference)

Label encoding works for ordinal features where the numerical order matches the category order. One-hot encoding prevents the model from inferring false ordinal relationships between nominal categories. The drop="first" parameter avoids the dummy variable trap by removing one redundant column.

Target Encoding and Data Leakage

Target encoding replaces each category with the mean target value for that group. While powerful for high-cardinality features (zip codes, product IDs), it leaks target information into features. Always apply target encoding inside cross-validation folds using scikit-learn's TargetEncoder class to prevent overly optimistic evaluation metrics.

Feature Scaling: StandardScaler vs MinMaxScaler vs RobustScaler

Distance-based algorithms (KNN, SVM, K-Means) and gradient descent optimizers treat all features equally by default. A salary column ranging from 30,000 to 200,000 would dominate a years-of-experience column ranging from 0 to 40 without scaling.

The choice of scaler depends on data distribution and outlier sensitivity.

python

# feature_scaling.py
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Simulated dataset: salary with outliers
data = np.array([[35000], [42000], [55000], [67000], [450000]])  # 450k is an outlier

# StandardScaler: mean=0, std=1 (sensitive to outliers)
standard = StandardScaler().fit_transform(data)
# Result: [-0.72, -0.68, -0.60, -0.53, 2.53] — outlier distorts the scale

# MinMaxScaler: maps to [0, 1] (very sensitive to outliers)
minmax = MinMaxScaler().fit_transform(data)
# Result: [0.0, 0.017, 0.048, 0.077, 1.0] — most values crushed near zero

# RobustScaler: uses median and IQR (robust to outliers)
robust = RobustScaler().fit_transform(data)
# Result: [-0.77, -0.5, 0.0, 0.46, 15.38] — outlier isolated, core data preserved

StandardScaler fits most use cases — linear models, neural networks, and PCA all expect standardized features. MinMaxScaler suits bounded activations (sigmoid, tanh) and image pixel normalization. RobustScaler should be the default choice when outliers exist in the dataset, as documented in the scikit-learn preprocessing guide.

Mathematical Transformations for Skewed Distributions

Skewed features violate the normality assumption of linear models and inflate the influence of extreme values. Log, square root, and Box-Cox transformations compress the tail of skewed distributions, bringing them closer to normal.

python

# skew_transformations.py
import numpy as np
from sklearn.preprocessing import PowerTransformer

# Right-skewed income data (common in real datasets)
income = np.array([[25000], [32000], [41000], [55000], [72000],
                   [150000], [320000], [890000]])

# Log transform: simple, effective for right-skewed data
log_income = np.log1p(income)  # log1p handles zero values safely

# Box-Cox: finds optimal power parameter automatically
pt = PowerTransformer(method="box-cox")  # Requires strictly positive values
income_boxcox = pt.fit_transform(income)
print(f"Optimal lambda: {pt.lambdas_[0]:.3f}")  # Shows learned parameter

# Yeo-Johnson: works with zero and negative values too
pt_yj = PowerTransformer(method="yeo-johnson")
income_yj = pt_yj.fit_transform(income)

PowerTransformer with Box-Cox or Yeo-Johnson automatically learns the optimal transformation parameter. Yeo-Johnson handles zero and negative values, making it the safer default. Always check the resulting distribution with a Q-Q plot or Shapiro-Wilk test to confirm the transformation improved normality.

Feature Selection: Removing Noise Before Training

More features do not always mean better predictions. Irrelevant or redundant features introduce noise, increase training time, and cause overfitting — especially in high-dimensional datasets where the number of features approaches or exceeds the number of samples.

Three categories of feature selection methods exist: filter methods (statistical tests), wrapper methods (model-based evaluation), and embedded methods (built into the training process).

python

# feature_selection.py
from sklearn.datasets import make_classification
from sklearn.feature_selection import (
    SelectKBest, f_classif,  # Filter method
    SequentialFeatureSelector,  # Wrapper method
    SelectFromModel  # Embedded method
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoCV

# Dataset: 20 features, only 5 are informative
X, y = make_classification(
    n_samples=1000, n_features=20,
    n_informative=5, n_redundant=3, random_state=42
)

# Filter: ANOVA F-test selects top k features by statistical significance
selector_filter = SelectKBest(f_classif, k=8)
X_filtered = selector_filter.fit_transform(X, y)
print(f"Selected features: {selector_filter.get_support(indices=True)}")

# Embedded: L1 regularization (Lasso) zeros out irrelevant feature weights
lasso = LassoCV(cv=5, random_state=42).fit(X, y)
selector_embedded = SelectFromModel(lasso, prefit=True)
X_lasso = selector_embedded.transform(X)
print(f"Lasso kept {X_lasso.shape[1]} features out of {X.shape[1]}")

Filter methods (ANOVA, chi-squared, mutual information) run fast but evaluate features independently — they miss feature interactions. Embedded methods like Lasso regularization perform selection during training, producing sparse models that generalize better. For a deeper understanding of the algorithms behind these models, the machine learning algorithms guide covers the mathematical foundations.

Ready to ace your Data Science & ML interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Explore Data Science & ML

Building Production-Ready Pipelines with ColumnTransformer

Scattering preprocessing steps across notebook cells creates fragile, unreproducible workflows. Scikit-learn's Pipeline and ColumnTransformer encapsulate the entire feature engineering process into a single, serializable object that prevents data leakage during cross-validation.

python

# production_pipeline.py
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Define column groups by type
numeric_features = ["age", "income", "credit_score"]
categorical_features = ["education", "employment_type", "region"]

# Numeric pipeline: impute missing values, then scale
numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),  # Median resists outliers
    ("scaler", StandardScaler())
])

# Categorical pipeline: impute missing, then one-hot encode
categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

# Combine into a single preprocessor
preprocessor = ColumnTransformer([
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features)
])

# Full pipeline: preprocessing + model in one object
full_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(n_estimators=200, random_state=42))
])

# Cross-validation applies preprocessing correctly within each fold
scores = cross_val_score(full_pipeline, X, y, cv=5, scoring="roc_auc")
print(f"AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

The pipeline guarantees that imputation statistics, scaling parameters, and encoding mappings are computed only on training data within each cross-validation fold. Calling full_pipeline.fit(X_train, y_train) followed by full_pipeline.predict(X_test) applies identical transformations without information leakage. The entire pipeline can be serialized with joblib.dump() for deployment.

Automated Feature Engineering with Featuretools

Manual feature engineering works well for structured datasets with a few dozen columns. Relational datasets spanning multiple tables — transactions linked to customers linked to merchants — require automated approaches to systematically explore feature combinations.

Featuretools implements Deep Feature Synthesis (DFS), which traverses entity relationships and applies transformation and aggregation primitives to generate hundreds of candidate features automatically.

python

# automated_feature_engineering.py
import featuretools as ft
import pandas as pd

# Define entities from relational tables
customers = pd.DataFrame({
    "customer_id": [1, 2, 3],
    "signup_date": pd.to_datetime(["2024-01-15", "2024-03-22", "2024-06-01"]),
    "region": ["US", "EU", "APAC"]
})

transactions = pd.DataFrame({
    "txn_id": range(1, 8),
    "customer_id": [1, 1, 1, 2, 2, 3, 3],
    "amount": [50, 120, 30, 200, 85, 340, 15],
    "category": ["food", "tech", "food", "tech", "travel", "food", "tech"]
})

# Create an EntitySet with relationships
es = ft.EntitySet(id="retail")
es = es.add_dataframe(dataframe=customers, dataframe_name="customers",
                      index="customer_id", time_index="signup_date")
es = es.add_dataframe(dataframe=transactions, dataframe_name="transactions",
                      index="txn_id")
es = es.add_relationship("customers", "customer_id",
                         "transactions", "customer_id")

# DFS generates features: COUNT, MEAN, MAX, STD of transactions per customer
feature_matrix, feature_defs = ft.dfs(
    entityset=es, target_dataframe_name="customers",
    max_depth=2,  # Controls complexity of generated features
    trans_primitives=["month", "weekday"],  # Transformation primitives
    agg_primitives=["count", "mean", "std", "max"]  # Aggregation primitives
)
print(f"Generated {len(feature_defs)} features from 2 tables")

DFS produces features like MEAN(transactions.amount), STD(transactions.amount), and COUNT(transactions) — aggregations that a data scientist would create manually but at a fraction of the time. The max_depth parameter controls feature complexity: depth 2 produces nested aggregations like STD(transactions.MONTH(signup_date)).

Feature Engineering Interview Tip

Interviewers frequently ask candidates to design features for a specific business problem (churn prediction, fraud detection, recommendation systems). Practice framing features in terms of recency, frequency, and monetary value (RFM analysis) — these three dimensions apply to nearly every customer-facing ML problem. Review common feature engineering interview questions to prepare.

Common Feature Engineering Interview Questions

Data science interviews in 2026 test both theoretical understanding and practical implementation of feature engineering. The following questions appear frequently across FAANG and startup interviews.

Q: How do you handle a categorical feature with 10,000+ unique values?

High-cardinality categoricals cannot use one-hot encoding — it would create 10,000 sparse columns. Effective strategies include: target encoding (with proper cross-validation to avoid leakage), frequency encoding (replacing categories with their occurrence count), hashing (using HashingVectorizer to map categories into a fixed-size space), and embedding layers in neural networks.

Q: When would you NOT scale your features?

Tree-based models (Random Forest, XGBoost, LightGBM) are invariant to monotonic feature transformations. Scaling adds unnecessary computation without affecting splits. Distance-based models and neural networks always require scaling.

Q: Explain the difference between feature selection and dimensionality reduction.

Feature selection keeps a subset of original features — the selected features remain interpretable. Dimensionality reduction (PCA, t-SNE, UMAP) creates new synthetic features as linear or nonlinear combinations of originals. PCA components maximize explained variance but lose direct interpretability. The right choice depends on whether model explainability is a project requirement.

Q: How do you detect and handle multicollinearity?

Variance Inflation Factor (VIF) above 5-10 indicates problematic multicollinearity. Correlation matrices catch pairwise relationships but miss multi-variable dependencies. Solutions include dropping one of the correlated features, combining them (PCA on the correlated subset), or using regularization (Ridge/Lasso) which handles collinearity internally.

For hands-on practice with Python data manipulation libraries essential to feature engineering, the Python for Data Science guide covers NumPy, Pandas, and scikit-learn workflows in depth.

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Create my free account

Conclusion

Encode categoricals intentionally: label encoding for ordinal data, one-hot for nominal, target encoding (inside CV folds) for high-cardinality features
Scale features based on model requirements: StandardScaler for linear models and neural networks, skip scaling for tree-based models
Apply power transformations (PowerTransformer) to reduce skewness before feeding data into algorithms that assume normality
Select features systematically using filter methods for speed, embedded methods (Lasso) for accuracy, and always validate with cross-validation
Wrap all preprocessing into scikit-learn Pipeline + ColumnTransformer to prevent data leakage and ensure reproducibility
Automate feature generation with Featuretools DFS for relational datasets spanning multiple tables
Practice explaining feature engineering decisions in business terms — interviewers evaluate both technical skill and communication clarity

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Create my free account

Feature Engineering for Machine Learning: Techniques and Interview Questions 2026

Categorical Encoding Strategies for ML Models

Feature Scaling: StandardScaler vs MinMaxScaler vs RobustScaler

Mathematical Transformations for Skewed Distributions

Feature Selection: Removing Noise Before Training

Ready to ace your Data Science & ML interviews?

Building Production-Ready Pipelines with ColumnTransformer

Automated Feature Engineering with Featuretools

Common Feature Engineering Interview Questions

Start practicing!

Conclusion

Start practicing!

Related articles

Machine Learning Algorithms Explained: Complete Guide for Technical Interviews

Python for Data Science: NumPy, Pandas and Scikit-Learn in 2026

Top 25 Data Science Interview Questions in 2026