
ML Pipelines & Validation
Scikit-learn pipelines, cross-validation, GridSearchCV, RandomizedSearchCV, data leakage, stratification
1What is the main advantage of using a scikit-learn Pipeline instead of applying transformations manually?
What is the main advantage of using a scikit-learn Pipeline instead of applying transformations manually?
Answer
A Pipeline ensures that the same transformations are consistently applied to both training and test data. It encapsulates all preprocessing and modeling steps into a single object, which simplifies code, prevents data leakage, and makes it easier to deploy the model to production.
2Which method should be called on a Pipeline to train all steps and make a prediction?
Which method should be called on a Pipeline to train all steps and make a prediction?
Answer
The fit_predict method does not exist for regression or classification Pipelines. You need to first call fit() to train the pipeline, then predict() to get predictions. Alternatively, fit() followed by predict() can be called separately for more control.
3What is data leakage in a machine learning context?
What is data leakage in a machine learning context?
Answer
Data leakage occurs when information from the test set or future data is accidentally used during training. This can happen during preprocessing (calculating mean over entire dataset before split) or through features that indirectly contain the target. It results in artificially high performance that does not generalize.
What is the role of ColumnTransformer in scikit-learn?
What is K-Fold cross-validation?
+19 interview questions
Other Data Science & ML interview topics
Python Basics
Python Object-Oriented Programming
Python Data Structures
Git Fundamentals
SQL Basics
NumPy Fundamentals
Pandas Basics
Jupyter & Google Colab
SQL Joins & Advanced Queries
Advanced Pandas
Visualization with Matplotlib & Seaborn
Interactive Visualizations with Plotly
Descriptive Statistics
Inferential Statistics
Web Scraping
BigQuery & Cloud Data
Feature Engineering
Supervised ML: Regression
Supervised ML: Classification
Decision Trees & Ensembles
Unsupervised ML
Time Series & Forecasting
Deep Learning Fundamentals
TensorFlow & Keras
CNN & Image Classification
RNN & Sequences
Transformers & Attention
NLP & Hugging Face
GenAI & LangChain
MLOps & Deployment
Master Data Science & ML for your next interview
Access all questions, flashcards, technical tests, code review exercises and interview simulators.
Start for free