Python for Data Science: NumPy, Pandas and Scikit-Learn in 2026

A hands-on tutorial covering NumPy array operations, Pandas data manipulation, and Scikit-Learn model training. Build a complete data pipeline from raw CSV to trained model with production-ready Python code.

Python data science tutorial with NumPy Pandas and Scikit-Learn code and dashboards illustration

Python data science workflows in 2026 rely on three libraries that handle 90% of the heavy lifting: NumPy for numerical computation, Pandas for tabular data manipulation, and Scikit-Learn for machine learning. This tutorial walks through each library with practical code, building toward a complete pipeline that loads raw data, cleans it, engineers features, and trains a model.

Versions Used in This Tutorial

All code runs on Python 3.12+, NumPy 2.1, Pandas 2.2, and Scikit-Learn 1.6. The API surface is stable across minor versions, but specific performance improvements in NumPy 2.x make the upgrade worthwhile for large-array operations.

NumPy Array Operations for Efficient Computation

NumPy replaces Python lists with n-dimensional arrays backed by contiguous memory blocks. The performance gap matters: a vectorized NumPy operation on 1 million elements runs 50-100x faster than an equivalent Python loop. Every Pandas DataFrame and Scikit-Learn model uses NumPy arrays internally.

python
# numpy_basics.py
import numpy as np

# Create arrays from different sources
prices = np.array([29.99, 49.99, 19.99, 99.99, 39.99])
quantities = np.arange(1, 6)  # [1, 2, 3, 4, 5]

# Vectorized arithmetic — no loops needed
revenue = prices * quantities
print(revenue)  # [29.99, 99.98, 59.97, 399.96, 199.95]

# Statistical aggregations
print(f"Total revenue: ${revenue.sum():.2f}")   # $789.85
print(f"Mean price: ${prices.mean():.2f}")       # $47.99
print(f"Std deviation: ${prices.std():.2f}")     # $27.64

# Boolean indexing — filter without loops
premium_mask = prices > 40
premium_items = prices[premium_mask]  # [49.99, 99.99]

Boolean indexing is the pattern that appears most in real data science code. Instead of writing if statements inside loops, a boolean mask selects elements that satisfy a condition in a single operation.

Reshaping and Broadcasting in NumPy

Two-dimensional operations require understanding how NumPy broadcasts arrays of different shapes. Broadcasting rules allow operations between a matrix and a vector without explicit tiling.

python
# numpy_reshape.py
import numpy as np

# Monthly sales data: 4 products x 3 months
sales = np.array([
    [120, 150, 130],  # Product A
    [200, 180, 220],  # Product B
    [90, 110, 95],    # Product C
    [300, 280, 310],  # Product D
])

# Column-wise mean (average per month)
monthly_avg = sales.mean(axis=0)  # [177.5, 180.0, 188.75]

# Row-wise sum (total per product)
product_totals = sales.sum(axis=1)  # [400, 600, 295, 890]

# Normalize each product relative to its own max
normalized = sales / sales.max(axis=1, keepdims=True)
# keepdims=True preserves the shape for broadcasting
print(normalized[0])  # [0.8, 1.0, 0.867] — Product A relative to its peak

# Reshape for Scikit-Learn (requires 2D input)
flat_sales = sales.flatten()  # 1D array of 12 values
reshaped = flat_sales.reshape(-1, 1)  # 12x1 column vector

The axis parameter controls the direction of aggregation: axis=0 collapses rows (operates column-wise), axis=1 collapses columns (operates row-wise). The keepdims=True argument prevents shape mismatches during broadcasting.

Pandas DataFrame Manipulation and Cleaning

Pandas wraps NumPy arrays with labeled axes and SQL-like operations. The DataFrame is the central data structure — a table where each column can hold a different data type. Pandas 2.2 defaults to PyArrow-backed string columns, which use significantly less memory than object-dtype strings.

python
# pandas_cleaning.py
import pandas as pd
import numpy as np

# Load and inspect raw data
df = pd.read_csv("candidates.csv")
print(df.shape)       # (1500, 8)
print(df.dtypes)      # Check column types
print(df.isna().sum()) # Count missing values per column

# Clean in a reproducible chain
df_clean = (
    df
    .dropna(subset=["salary", "experience_years"])      # Drop rows missing critical fields
    .assign(
        salary=lambda x: x["salary"].clip(lower=20000, upper=500000),  # Cap outliers
        experience_years=lambda x: x["experience_years"].astype(int),
        hired_date=lambda x: pd.to_datetime(x["hired_date"], errors="coerce"),
    )
    .drop_duplicates(subset=["email"])                  # Remove duplicate candidates
    .query("experience_years >= 0")                     # Filter invalid entries
    .reset_index(drop=True)
)

print(f"Cleaned: {len(df)} -> {len(df_clean)} rows")

Method chaining with .assign() keeps transformations readable and reproducible. Each step documents exactly what changes and why. The lambda x: pattern inside .assign() references the DataFrame as it exists at that point in the chain, which avoids referencing stale data.

Pandas 2.2 Copy-on-Write

Pandas 2.2 enables Copy-on-Write by default. This eliminates the SettingWithCopyWarning and makes chained operations safer. Modifications to a slice no longer silently mutate the original DataFrame — Pandas creates a copy only when a write actually occurs.

GroupBy Aggregations and Feature Engineering with Pandas

GroupBy splits data into subsets, applies a function to each subset, and combines the results. This pattern drives most feature engineering workflows in tabular machine learning.

python
# pandas_groupby.py
import pandas as pd

# Aggregate candidate stats by department
dept_stats = (
    df_clean
    .groupby("department")
    .agg(
        avg_salary=("salary", "mean"),
        median_experience=("experience_years", "median"),
        headcount=("email", "count"),
        max_salary=("salary", "max"),
    )
    .sort_values("avg_salary", ascending=False)
)
print(dept_stats.head())

# Create features for ML: encode categorical + add aggregated stats
df_features = (
    df_clean
    .assign(
        # Ratio of individual salary to department average
        salary_ratio=lambda x: x["salary"] / x.groupby("department")["salary"].transform("mean"),
        # Time since hire in days
        tenure_days=lambda x: (pd.Timestamp.now() - x["hired_date"]).dt.days,
        # Binary encoding
        is_senior=lambda x: (x["experience_years"] >= 5).astype(int),
    )
)

The .transform() method returns a Series aligned with the original index, which makes it safe to use inside .assign(). This avoids the common pitfall of joining aggregated results back to the original DataFrame manually.

Ready to ace your Data Science & ML interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Building a Scikit-Learn Pipeline from Scratch

Scikit-Learn pipelines chain preprocessing steps and a model into a single object. The pipeline prevents data leakage by fitting transformers only on training data, then applying the same transformations to test data. This is the standard pattern for supervised learning tasks.

python
# sklearn_pipeline.py
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
import pandas as pd

# Define column groups
numeric_features = ["salary", "experience_years", "salary_ratio", "tenure_days"]
categorical_features = ["department", "role_level"]
target = "promoted"

# Split before any preprocessing
X = df_features[numeric_features + categorical_features]
y = df_features[target]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Build the preprocessing + model pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),           # Scale numeric columns
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),  # Encode categories
    ]
)

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=4,
        random_state=42,
    )),
])

# Train and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

The ColumnTransformer applies different preprocessing to different column types in a single step. StandardScaler normalizes numeric features to zero mean and unit variance. OneHotEncoder converts categorical strings into binary columns. The handle_unknown="ignore" parameter ensures the pipeline does not crash if test data contains categories absent from training.

Cross-Validation and Hyperparameter Tuning

A single train/test split can produce misleading results depending on which samples land in each set. Cross-validation runs multiple splits and averages the scores, giving a more reliable performance estimate.

python
# sklearn_tuning.py
from sklearn.model_selection import cross_val_score, GridSearchCV
import numpy as np

# 5-fold cross-validation on the full pipeline
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="f1")
print(f"F1 scores: {scores}")
print(f"Mean F1: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# Grid search over hyperparameters
param_grid = {
    "classifier__n_estimators": [100, 200, 300],
    "classifier__max_depth": [3, 4, 5],
    "classifier__learning_rate": [0.05, 0.1, 0.2],
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring="f1",
    n_jobs=-1,      # Use all CPU cores
    verbose=1,
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1: {grid_search.best_score_:.3f}")

# Evaluate the best model on held-out test set
best_model = grid_search.best_estimator_
print(classification_report(y_test, best_model.predict(X_test)))

The double-underscore syntax (classifier__max_depth) references nested parameters inside the pipeline. n_jobs=-1 parallelizes the search across all available CPU cores, which can cut tuning time from hours to minutes on multi-core machines.

Data Leakage in Cross-Validation

Fitting a scaler on the entire dataset before splitting leaks information from validation folds into training folds. Placing the scaler inside a Pipeline guarantees that scaling is fitted only on each fold's training portion. This is the single most common mistake in machine learning interview questions.

Saving and Loading Models for Production

A trained pipeline needs to be serialized for deployment. The joblib library handles NumPy arrays more efficiently than Python's built-in pickle.

python
# sklearn_export.py
import joblib
from pathlib import Path

# Save the complete pipeline (preprocessor + model)
model_dir = Path("models")
model_dir.mkdir(exist_ok=True)
joblib.dump(best_model, model_dir / "promotion_model_v1.joblib")

# Load and predict in a different process
loaded_model = joblib.load(model_dir / "promotion_model_v1.joblib")
new_data = pd.DataFrame({
    "salary": [75000],
    "experience_years": [4],
    "salary_ratio": [1.05],
    "tenure_days": [730],
    "department": ["Engineering"],
    "role_level": ["Mid"],
})

prediction = loaded_model.predict(new_data)
probability = loaded_model.predict_proba(new_data)[:, 1]
print(f"Promoted: {bool(prediction[0])}, Confidence: {probability[0]:.2%}")

Saving the entire pipeline rather than just the model ensures preprocessing steps are applied identically at inference time. Version the model file alongside the training script to maintain reproducibility. The official Scikit-Learn persistence documentation covers security considerations for loading untrusted model files.

Conclusion

  • NumPy vectorized operations replace Python loops with 50-100x faster array computations; boolean indexing and broadcasting handle most filtering and transformation needs
  • Pandas method chaining with .assign() and .transform() produces readable, reproducible data cleaning pipelines; Copy-on-Write in Pandas 2.2 eliminates silent mutation bugs
  • Scikit-Learn Pipelines bundle preprocessing and modeling into a single object, preventing data leakage between train and test sets
  • Cross-validation with GridSearchCV provides reliable performance estimates and automates hyperparameter tuning across all CPU cores
  • Serializing the complete pipeline with joblib guarantees identical preprocessing at training and inference time
  • Practice these patterns with data science interview questions that test both coding fluency and understanding of statistical foundations

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Tags

#data-science
#python
#numpy
#pandas
#scikit-learn
#machine-learning
#tutorial

Share

Related articles