Top 25 Data Science Interview Questions in 2026

Data science interview questions covering statistics, machine learning, feature engineering, deep learning, SQL, and system design — with Python code examples and detailed answers for 2026.

Data science interview questions with neural networks, statistical charts, and Python code on a dark background

Data science interview questions in 2026 test far more than textbook definitions. Hiring teams probe candidates on statistical reasoning, feature engineering tradeoffs, model evaluation beyond accuracy, and the ability to translate business problems into analytical pipelines. This guide covers 25 questions that appear repeatedly in data scientist interviews at companies ranging from startups to FAANG.

What Interviewers Actually Evaluate

Data science interviews assess three dimensions: statistical foundations (probability, hypothesis testing, distributions), machine learning mechanics (bias-variance, regularization, tree ensembles), and applied judgment (feature selection, metric choice, communicating results to stakeholders). Strong candidates connect theory to real-world impact.

Probability and Statistics Fundamentals

Q1: A coin is flipped 10 times and lands heads 8 times. Is the coin biased?

The naive answer is "yes," but the rigorous approach applies a binomial test. Under the null hypothesis (fair coin, p=0.5), the probability of 8 or more heads in 10 flips is approximately 5.5% — not below the standard 5% significance threshold. The coin cannot be declared biased at alpha = 0.05.

python
# binomial_test.py
from scipy import stats

# Two-sided binomial test: is the coin fair?
result = stats.binomtest(k=8, n=10, p=0.5, alternative='two-sided')
print(f"p-value: {result.pvalue:.4f}")  # 0.1094 (two-sided)
print(f"Reject H0 at alpha=0.05? {result.pvalue < 0.05}")  # False

This question tests whether candidates default to intuition or apply formal hypothesis testing. Interviewers follow up by asking about Type I vs. Type II errors and what sample size would be needed to detect a bias of p=0.7 with 80% power.

Q2: Explain the difference between Bayesian and frequentist approaches to inference.

Frequentist inference treats parameters as fixed but unknown. A 95% confidence interval means: if the experiment were repeated many times, 95% of computed intervals would contain the true parameter. Bayesian inference treats parameters as random variables with prior distributions, updating beliefs via Bayes' theorem to produce posterior distributions. A 95% credible interval means: given the observed data and prior, there is a 95% probability the parameter lies in this interval.

python
# bayesian_vs_frequentist.py
import numpy as np
from scipy import stats

# Frequentist: confidence interval for a mean
data = np.array([23.1, 25.4, 22.8, 24.9, 23.7, 25.1, 24.3])
ci = stats.t.interval(0.95, df=len(data)-1, loc=np.mean(data), scale=stats.sem(data))
print(f"95% CI (frequentist): [{ci[0]:.2f}, {ci[1]:.2f}]")

# Bayesian: posterior with conjugate prior (Normal-Normal)
prior_mean, prior_var = 24.0, 4.0  # prior belief
data_mean, data_var = np.mean(data), np.var(data, ddof=1) / len(data)
# Posterior parameters (conjugate update)
post_var = 1 / (1/prior_var + 1/data_var)
post_mean = post_var * (prior_mean/prior_var + data_mean/data_var)
post_ci = stats.norm.interval(0.95, loc=post_mean, scale=np.sqrt(post_var))
print(f"95% credible interval (Bayesian): [{post_ci[0]:.2f}, {post_ci[1]:.2f}]")

The practical difference matters in interviews: Bayesian methods incorporate domain knowledge through priors, which is valuable when data is scarce. Frequentist methods make fewer assumptions but require larger samples for reliable inference.

Q3: When does the Central Limit Theorem fail, and why does it matter for data science?

The CLT guarantees that sample means converge to a normal distribution as n grows — but only when the population has finite variance. Heavy-tailed distributions (Cauchy, Pareto with alpha less than or equal to 2) violate this assumption. In practice, this affects financial return modeling, network traffic analysis, and any domain with extreme outliers. The median or trimmed mean becomes a more robust estimator in these cases.

Feature Engineering and Data Preparation

Q4: How do missing values impact model training, and what strategies exist beyond simple imputation?

Missing data mechanisms determine the right strategy. MCAR (Missing Completely At Random) allows listwise deletion without bias. MAR (Missing At Random) benefits from multiple imputation or model-based approaches. MNAR (Missing Not At Random) requires domain-specific modeling of the missingness mechanism itself.

python
# missing_data_strategies.py
import pandas as pd
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df = pd.DataFrame({
    'age': [25, 30, None, 45, None, 38],
    'income': [50000, None, 70000, 90000, 60000, None],
    'score': [85, 90, 78, None, 88, 92]
})

# Strategy 1: KNN imputation (preserves local structure)
knn_imp = KNNImputer(n_neighbors=2)
df_knn = pd.DataFrame(knn_imp.fit_transform(df), columns=df.columns)

# Strategy 2: Iterative imputation (MICE-like, models each feature)
iter_imp = IterativeImputer(max_iter=10, random_state=42)
df_iter = pd.DataFrame(iter_imp.fit_transform(df), columns=df.columns)

# Strategy 3: Add missingness indicator (preserves signal in the pattern)
df['income_missing'] = df['income'].isna().astype(int)
print(df_knn.round(1))

The key insight interviewers look for: the missingness pattern itself can be informative. Adding binary indicator columns for missing features often improves model performance, especially in gradient-boosted trees.

Q5: Explain target encoding and when it outperforms one-hot encoding.

One-hot encoding creates a sparse binary column per category. For high-cardinality features (zip codes, user IDs, product SKUs), this explodes dimensionality. Target encoding replaces each category with the mean of the target variable for that category, producing a single dense numeric column.

python
# target_encoding.py
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

def target_encode_kfold(df, col, target, n_splits=5):
    """Target encoding with K-fold regularization to prevent leakage."""
    encoded = pd.Series(index=df.index, dtype=float)
    global_mean = df[target].mean()
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

    for train_idx, val_idx in kf.split(df):
        # Compute means only from training fold
        means = df.iloc[train_idx].groupby(col)[target].mean()
        encoded.iloc[val_idx] = df.iloc[val_idx][col].map(means)

    # Fill categories unseen in training fold with global mean
    encoded.fillna(global_mean, inplace=True)
    return encoded

df = pd.DataFrame({
    'city': ['Paris', 'Lyon', 'Paris', 'Marseille', 'Lyon', 'Paris',
             'Marseille', 'Lyon', 'Paris', 'Marseille'],
    'hired': [1, 0, 1, 0, 1, 1, 0, 0, 1, 1]
})
df['city_encoded'] = target_encode_kfold(df, 'city', 'hired')
print(df[['city', 'city_encoded', 'hired']])

The K-fold approach prevents data leakage by computing target means on out-of-fold samples only. Without this regularization, target encoding overfits severely on small datasets.

Ready to ace your Data Science & ML interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Machine Learning Core Concepts

Q6: What is the bias-variance tradeoff, and how does it manifest in tree-based models?

Bias measures systematic error — how far predictions are from the true values on average. Variance measures sensitivity to the training set — how much predictions change across different samples. A single deep decision tree has low bias but high variance: it memorizes training data. A random forest reduces variance through bootstrap aggregation (bagging) while maintaining low bias. Gradient-boosted trees reduce bias iteratively by fitting residuals, but risk increasing variance without proper regularization (learning rate, max depth, subsampling).

Q7: Why does accuracy fail as a metric for imbalanced datasets? What alternatives exist?

On a dataset with 95% negative and 5% positive samples, a model predicting "negative" for every input achieves 95% accuracy while being completely useless. Precision-recall curves, F1 score, and area under the precision-recall curve (AUPRC) better capture performance on the minority class. For ranking tasks, AUC-ROC works well since it evaluates discrimination ability across all thresholds.

python
# imbalanced_metrics.py
from sklearn.metrics import (
    classification_report, precision_recall_curve,
    average_precision_score, roc_auc_score
)
import numpy as np

# Simulated predictions on imbalanced data
np.random.seed(42)
y_true = np.array([0]*950 + [1]*50)  # 95/5 imbalance
y_scores = np.random.beta(2, 5, 1000)  # predicted probabilities
y_scores[y_true == 1] += 0.3  # positive class scores slightly higher
y_scores = np.clip(y_scores, 0, 1)
y_pred = (y_scores > 0.5).astype(int)

print(classification_report(y_true, y_pred, digits=3))
print(f"ROC AUC: {roc_auc_score(y_true, y_scores):.3f}")
print(f"Average Precision (AUPRC): {average_precision_score(y_true, y_scores):.3f}")

The follow-up question is often: "How do you choose between precision and recall?" The answer depends on the cost of false positives vs. false negatives. Fraud detection prioritizes recall (catching fraud matters more than false alarms). Spam filtering prioritizes precision (false positives annoy users more than missed spam).

Q8: Explain L1 vs. L2 regularization and when each is preferred.

L1 regularization (Lasso) adds the sum of absolute weights to the loss function, driving some coefficients exactly to zero. This produces sparse models and acts as embedded feature selection. L2 regularization (Ridge) adds the sum of squared weights, shrinking all coefficients toward zero without eliminating any. Use L1 when many features are irrelevant and sparsity is desired. Use L2 when all features contribute and multicollinearity is present. Elastic Net combines both and is often the default choice when the optimal ratio is unknown.

Q9: How do gradient-boosted trees (XGBoost/LightGBM) handle categorical features differently?

XGBoost traditionally requires one-hot encoding or ordinal encoding before training. LightGBM handles categorical features natively through an optimal split-finding algorithm that groups categories by their gradient statistics, reducing the exponential search space to linear time. This native handling often outperforms one-hot encoding, especially for high-cardinality features, and explains why LightGBM typically trains faster on tabular datasets.

Model Evaluation and Validation

Q10: What is data leakage, and what are the three most common sources?

Data leakage occurs when information from outside the training distribution bleeds into the model, inflating validation metrics but degrading production performance. The three most common sources:

  1. Temporal leakage: using future data to predict the past (e.g., using a customer's lifetime value to predict first-month churn)
  2. Target leakage: features derived from the target variable (e.g., "number of refunds" to predict "will request refund")
  3. Preprocessing leakage: fitting scalers, encoders, or imputers on the full dataset before splitting
python
# leakage_prevention.py
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# WRONG: fit scaler on all data, then cross-validate
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # leakage: test fold info in scaling

# CORRECT: pipeline ensures preprocessing fits only on training folds
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(n_estimators=100, random_state=42))
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')
print(f"AUC (no leakage): {scores.mean():.3f} +/- {scores.std():.3f}")

Scikit-learn Pipelines are the standard defense against preprocessing leakage. Every transformation step inside the pipeline refits on each training fold independently.

Q11: When should time-series cross-validation replace standard K-fold?

Standard K-fold randomly shuffles data, which breaks temporal ordering. For any dataset where rows have a time component (stock prices, user activity, sensor data), this creates future-to-past leakage. Time-series cross-validation uses expanding or sliding windows: each fold's training set contains only data prior to the validation set. Scikit-learn provides TimeSeriesSplit for this pattern.

Common Interview Trap

Candidates often claim they "always use K-fold cross-validation." Interviewers will probe with time-series or grouped data scenarios where K-fold is incorrect. Know when to use TimeSeriesSplit, GroupKFold, or StratifiedKFold depending on the data structure.

Deep Learning and Neural Networks

Q12: Why do deep networks use ReLU instead of sigmoid, and when does ReLU fail?

Sigmoid squashes outputs to (0, 1), but its gradient vanishes for extreme inputs — near 0 or 1, the gradient approaches zero, stalling backpropagation in deep networks. ReLU (max(0, x)) has a constant gradient of 1 for positive inputs, enabling faster training. ReLU fails when neurons receive consistently negative inputs ("dying ReLU" problem), producing zero gradients permanently. Leaky ReLU and GELU (used in Transformers) address this by allowing small gradients for negative values.

Q13: Explain the attention mechanism and why Transformers replaced RNNs for sequence tasks.

RNNs process sequences step by step, creating a bottleneck: information from early tokens must survive through every subsequent hidden state. Attention computes direct pairwise relationships between all positions simultaneously, with O(n^2) complexity but full parallelization. Self-attention in Transformers produces Query, Key, and Value matrices from the input, computing attention weights as softmax(QK^T / sqrt(d_k)) * V.

python
# self_attention.py
import torch
import torch.nn.functional as F

def self_attention(x, d_k):
    """Scaled dot-product self-attention from scratch."""
    # x shape: (batch_size, seq_len, d_model)
    batch_size, seq_len, d_model = x.shape

    # Linear projections for Q, K, V
    W_q = torch.randn(d_model, d_k) * 0.1
    W_k = torch.randn(d_model, d_k) * 0.1
    W_v = torch.randn(d_model, d_k) * 0.1

    Q = x @ W_q  # (batch, seq_len, d_k)
    K = x @ W_k
    V = x @ W_v

    # Scaled dot-product attention
    scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5)  # (batch, seq_len, seq_len)
    weights = F.softmax(scores, dim=-1)  # attention weights
    output = weights @ V  # (batch, seq_len, d_k)

    return output, weights

# Example: 1 batch, 4 tokens, 8-dim embeddings
x = torch.randn(1, 4, 8)
out, attn = self_attention(x, d_k=8)
print(f"Output shape: {out.shape}")    # (1, 4, 8)
print(f"Attention weights:\n{attn[0].detach().numpy().round(3)}")

The key advantage: attention allows the model to directly connect any two positions regardless of distance, solving the long-range dependency problem that plagued LSTMs.

Q14: What is the difference between fine-tuning and transfer learning, and when does each apply?

Transfer learning uses a pretrained model's learned representations as a starting point. Fine-tuning is a specific transfer learning strategy that unfreezes some or all pretrained layers and continues training on the target task. Feature extraction (another strategy) freezes all pretrained layers and only trains a new classification head. Fine-tuning works best with moderate target data (thousands of samples). Feature extraction suits small datasets where full fine-tuning would overfit.

SQL and Data Manipulation

Q15: Write a query to find the second-highest salary in each department without using window functions.

sql
-- second_highest_salary.sql
-- Approach: correlated subquery counting distinct higher salaries
SELECT d.department_name, e.employee_name, e.salary
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE (
    SELECT COUNT(DISTINCT e2.salary)
    FROM employees e2
    WHERE e2.department_id = e.department_id
      AND e2.salary > e.salary
) = 1
ORDER BY d.department_name;

Interviewers use this to verify SQL fundamentals without the convenience of RANK() or DENSE_RANK(). The follow-up is to solve it with window functions, which is cleaner but tests a different skill.

Q16: Explain the difference between WHERE and HAVING. When is each evaluated?

WHERE filters rows before aggregation. HAVING filters groups after aggregation. In the SQL execution order, WHERE applies during the row scan phase, GROUP BY creates groups, aggregate functions compute, then HAVING filters those groups. A common mistake: using HAVING without GROUP BY or trying to reference aliases defined in SELECT from within WHERE.

Applied Data Science and System Design

Q17: How would you design an A/B test for a new recommendation algorithm?

The design requires five components: (1) a clear primary metric (e.g., click-through rate, conversion rate, or revenue per user), (2) a power analysis to determine sample size, (3) randomization at the user level (not session level, to avoid cross-contamination), (4) guardrail metrics that must not degrade (latency, crash rate), and (5) a pre-registered analysis plan to prevent p-hacking.

Sample Size Calculation

For a two-proportion z-test detecting a 2% absolute lift (from 10% to 12% CTR) with 80% power and alpha=0.05, approximately 3,800 users per group are needed. Smaller expected effects require exponentially larger samples. Always run a power analysis before launching.

Q18: A model performs well on validation but poorly in production. What are the top three causes?

  1. Distribution shift: training data does not represent production traffic (different user demographics, time periods, or geographies)
  2. Feature skew: features computed differently in training (batch) vs. serving (real-time) pipelines — timestamp parsing, missing value handling, or aggregation windows diverge
  3. Label leakage: the training label was derived from data unavailable at prediction time in production

Debugging starts with comparing feature distributions between training and production using statistical tests (KS test, PSI — Population Stability Index) on each feature independently.

Q19: Explain the CAP theorem and its implications for ML feature stores.

CAP states that a distributed system can provide at most two of three guarantees: Consistency (every read returns the latest write), Availability (every request receives a response), and Partition tolerance (the system operates despite network failures). Feature stores face this tradeoff directly: online stores (Redis, DynamoDB) prioritize availability and partition tolerance, accepting eventual consistency. Offline stores (BigQuery, Hive) prioritize consistency for batch training. A dual-store architecture serves both needs.

Q20: How do you handle concept drift in a production ML system?

Concept drift occurs when the statistical relationship between features and target changes over time. Detection methods include monitoring prediction distribution shifts (PSI on model outputs), tracking performance metrics on labeled production data, and statistical tests on incoming feature distributions. Mitigation strategies: scheduled retraining on a sliding window, online learning for gradual drift, or triggering retraining when drift metrics exceed thresholds.

Ready to ace your Data Science & ML interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Python and Pandas Proficiency

Q21: What is the difference between apply(), map(), and transform() in Pandas?

map() operates element-wise on a Series only. apply() works on both Series and DataFrames, passing each row/column to a function. transform() requires that the output has the same shape as the input — it cannot aggregate. The practical implication: transform() enables group-level computations that broadcast back to the original index, which apply() can do but less efficiently.

python
# pandas_operations.py
import pandas as pd

df = pd.DataFrame({
    'team': ['A', 'A', 'B', 'B', 'A'],
    'score': [10, 20, 30, 40, 50]
})

# map: element-wise on Series
df['team_upper'] = df['team'].map({'A': 'Alpha', 'B': 'Beta'})

# apply: arbitrary function per row
df['score_label'] = df['score'].apply(lambda x: 'high' if x > 25 else 'low')

# transform: group-level, same shape output (broadcasts back)
df['team_mean'] = df.groupby('team')['score'].transform('mean')
df['score_normalized'] = df.groupby('team')['score'].transform(
    lambda x: (x - x.mean()) / x.std()
)
print(df)

Q22: Explain Python's GIL and its impact on data science workflows.

The Global Interpreter Lock prevents multiple threads from executing Python bytecode simultaneously. CPU-bound tasks (numerical computation, model training) do not benefit from threading. This is why NumPy, pandas, and scikit-learn use C extensions that release the GIL, and why multiprocessing (not threading) is the standard parallelism strategy for pure Python code. For I/O-bound tasks (API calls, database queries), threading works because the GIL is released during I/O waits.

Dimensionality Reduction and Unsupervised Learning

Q23: PCA vs. t-SNE vs. UMAP — when does each apply?

PCA is a linear method that maximizes variance in projected dimensions. It preserves global structure and is deterministic, making it suitable for preprocessing (reducing feature count before modeling) and interpretable visualization. t-SNE is nonlinear, optimized for 2D/3D visualization by preserving local neighborhoods. It distorts global distances and is not suitable for preprocessing. UMAP preserves both local and global structure better than t-SNE, runs faster on large datasets, and produces more consistent results across runs.

| Method | Linear | Preserves Global Structure | Speed (100K points) | Use Case | |--------|--------|---------------------------|---------------------|----------| | PCA | Yes | Yes | Seconds | Preprocessing, feature reduction | | t-SNE | No | No | Minutes | Cluster visualization (small data) | | UMAP | No | Partially | Seconds | Visualization + preprocessing |

Q24: How does K-means fail, and what alternatives handle non-spherical clusters?

K-means assumes spherical, equally-sized clusters and uses Euclidean distance. It fails on elongated, ring-shaped, or density-varying clusters. DBSCAN identifies clusters by density, handling arbitrary shapes and automatically detecting outliers. Gaussian Mixture Models (GMMs) model each cluster as a multivariate Gaussian, allowing elliptical shapes. Spectral clustering uses graph Laplacian eigenvectors and handles complex geometries but scales poorly beyond ~10K points.

Communication and Business Impact

Q25: A stakeholder asks: "Which features are most important in the model?" How do you answer?

Feature importance has multiple valid definitions, and the answer depends on the audience. For tree-based models, three common measures exist: split-based importance (how often a feature is used in splits), gain-based importance (total reduction in loss from splits on that feature), and permutation importance (how much performance drops when a feature is randomly shuffled). SHAP values provide the most rigorous answer: they quantify each feature's contribution to individual predictions using game-theoretic principles.

python
# feature_importance_shap.py
import shap
import xgboost as xgb
from sklearn.datasets import make_classification

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10,
                           n_informative=5, random_state=42)

# Train XGBoost model
model = xgb.XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss')
model.fit(X, y)

# SHAP explanation
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# Global importance: mean absolute SHAP value per feature
import numpy as np
importance = np.abs(shap_values).mean(axis=0)
for i in np.argsort(importance)[::-1]:
    print(f"Feature {i}: {importance[i]:.4f}")

For stakeholders, translate SHAP values into business language: "Customers with income above $80K are 15% more likely to convert, all else equal" rather than "Feature 3 has a mean SHAP value of 0.23."

Conclusion

  • Hypothesis testing precision matters: know when to apply binomial tests, understand p-values, and distinguish Bayesian from frequentist reasoning
  • Feature engineering often determines model performance more than algorithm choice — master target encoding, missing data strategies, and leakage prevention
  • Evaluation goes beyond accuracy: use AUPRC for imbalanced data, TimeSeriesSplit for temporal data, and pipelines to prevent preprocessing leakage
  • Deep learning fundamentals (attention, activation functions, transfer learning) appear even in data science roles that primarily use tabular methods
  • Production ML questions (drift detection, A/B testing, feature stores) increasingly separate senior candidates from junior ones
  • Communication skills are tested through feature importance explanations — practice translating SHAP values into stakeholder-friendly language

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Tags

#data-science
#interview
#machine-learning
#python
#statistics

Share

Related articles