Data Science & ML

ML Pipelines & Validation

Scikit-learn pipelines, cross-validation, GridSearchCV, RandomizedSearchCV, data leakage, stratification

22 interview questionsยท
Mid-Level
1

What is the main advantage of using a scikit-learn Pipeline instead of applying transformations manually?

Answer

A Pipeline ensures that the same transformations are consistently applied to both training and test data. It encapsulates all preprocessing and modeling steps into a single object, which simplifies code, prevents data leakage, and makes it easier to deploy the model to production.

2

Which method should be called on a Pipeline to train all steps and make a prediction?

Answer

The fit_predict method does not exist for regression or classification Pipelines. You need to first call fit() to train the pipeline, then predict() to get predictions. Alternatively, fit() followed by predict() can be called separately for more control.

3

What is data leakage in a machine learning context?

Answer

Data leakage occurs when information from the test set or future data is accidentally used during training. This can happen during preprocessing (calculating mean over entire dataset before split) or through features that indirectly contain the target. It results in artificially high performance that does not generalize.

4

What is the role of ColumnTransformer in scikit-learn?

5

What is K-Fold cross-validation?

+19 interview questions

Master Data Science & ML for your next interview

Access all questions, flashcards, technical tests, code review exercises and interview simulators.

Start for free