Data Analytics

Data Cleaning

Missing values, duplicates, outliers, business rules, transformation, data quality

20 interview questionsยท
Junior
1

What is a missing value in a dataset?

Answer

A missing value represents absent or unfilled data in a field. It can appear as an empty cell, NULL in a database, or NaN in a DataFrame. Identifying missing values is the first step in data cleaning because they can distort statistical analyses and aggregations.

2

What is the difference between a NULL value and an empty string in a database?

Answer

NULL means the value is unknown or does not exist, while an empty string is a known value that happens to be empty. This distinction is fundamental in SQL because NULL cannot be compared with the = operator (IS NULL must be used), whereas an empty string can be compared normally with = ''.

3

What is a duplicate in a dataset?

Answer

A duplicate is a record that appears more than once in a dataset, either exactly (all columns identical) or partially (certain key columns identical). Duplicates distort counts, sums, and averages. Their detection typically relies on identifying key columns that should be unique.

4

Which technique allows detecting exact duplicates in SQL?

5

What is an outlier in a dataset?

+17 interview questions

Master Data Analytics for your next interview

Access all questions, flashcards, technical tests, code review exercises and interview simulators.

Start for free