Question 1

What is the main advantage of using a scikit-learn Pipeline instead of applying transformations manually?

Accepted Answer

A Pipeline ensures that the same transformations are consistently applied to both training and test data. It encapsulates all preprocessing and modeling steps into a single object, which simplifies code, prevents data leakage, and makes it easier to deploy the model to production.

Question 2

Which method should be called on a Pipeline to train all steps and make a prediction?

Accepted Answer

The fit_predict method does not exist for regression or classification Pipelines. You need to first call fit() to train the pipeline, then predict() to get predictions. Alternatively, fit() followed by predict() can be called separately for more control.

Question 3

What is data leakage in a machine learning context?

Accepted Answer

Data leakage occurs when information from the test set or future data is accidentally used during training. This can happen during preprocessing (calculating mean over entire dataset before split) or through features that indirectly contain the target. It results in artificially high performance that does not generalize.

ML Pipelines & Validation

What is the main advantage of using a scikit-learn Pipeline instead of applying transformations manually?

Answer

Which method should be called on a Pipeline to train all steps and make a prediction?

Answer

What is data leakage in a machine learning context?

Answer

What is the role of ColumnTransformer in scikit-learn?

What is K-Fold cross-validation?

Other Data Science & ML interview topics

Python Basics

Python Object-Oriented Programming

Python Data Structures

Git Fundamentals

SQL Basics

NumPy Fundamentals

Pandas Basics

Jupyter & Google Colab

SQL Joins & Advanced Queries

Advanced Pandas

Visualization with Matplotlib & Seaborn

Interactive Visualizations with Plotly

Descriptive Statistics

Inferential Statistics

Web Scraping

BigQuery & Cloud Data

Feature Engineering

Supervised ML: Regression

Supervised ML: Classification

Decision Trees & Ensembles

Unsupervised ML

Time Series & Forecasting

Deep Learning Fundamentals

TensorFlow & Keras

CNN & Image Classification

RNN & Sequences

Transformers & Attention

NLP & Hugging Face

GenAI & LangChain

MLOps & Deployment

Master Data Science & ML for your next interview