Top 25 Data Analytics Interview Questions in 2026
Master the most asked data analytics interview questions in 2026. Covers SQL, Python, Power BI, statistics and behavioral questions with detailed answers and code examples.

Data analytics interview questions in 2026 test far more than SQL syntax or chart-building. Hiring teams evaluate candidates on their ability to extract meaning from messy datasets, communicate findings to non-technical stakeholders, and tie every metric back to a business outcome. This guide covers the 25 questions that appear most frequently across junior, mid-level and senior data analyst roles.
Each question includes a concise model answer and, where relevant, runnable SQL or Python code. Practice writing the queries from scratch before reading the solution. Interviewers care more about the reasoning process than a memorized answer.
SQL Questions for Data Analyst Interviews
SQL remains the single most tested skill in data analytics interviews. Every company with a relational database expects analysts to write queries without assistance.
1. What is the difference between WHERE and HAVING?
WHERE filters rows before aggregation. HAVING filters groups after aggregation. Mixing them up is one of the most common mistakes in technical screens.
-- monthly_revenue.sql
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
WHERE status = 'completed' -- row-level filter
GROUP BY month
HAVING SUM(amount) > 10000; -- group-level filterWHERE reduces the dataset before the database engine performs the GROUP BY. HAVING operates on the aggregated result. Using WHERE whenever possible improves query performance because it shrinks the working set earlier in the execution plan.
2. Explain window functions with a practical example
Window functions compute a value across a set of rows related to the current row, without collapsing the result set. They are essential for ranking, running totals and period-over-period comparisons.
-- user_ranking.sql
SELECT
user_id,
purchase_date,
amount,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY purchase_date) AS purchase_rank,
SUM(amount) OVER (PARTITION BY user_id ORDER BY purchase_date) AS running_total,
LAG(amount) OVER (PARTITION BY user_id ORDER BY purchase_date) AS prev_amount
FROM purchases;ROW_NUMBER assigns a sequential rank per user. SUM(...) OVER(...) computes a running total without a self-join. LAG pulls the previous row's value for direct comparison. Interviewers commonly ask candidates to calculate month-over-month growth using exactly this pattern.
3. How to identify and remove duplicate records
Duplicate detection is a daily task for analysts working with event logs, CRM exports or third-party data feeds.
-- deduplicate_events.sql
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id, event_type, DATE_TRUNC('minute', created_at)
ORDER BY created_at
) AS rn
FROM events
)
SELECT * FROM ranked WHERE rn = 1;The CTE assigns a row number within each duplicate group. Keeping only rn = 1 returns the first occurrence. This approach preserves the original data while filtering out noise.
4. Write a query to calculate retention by weekly cohort
Cohort retention is one of the most requested analyses in product and growth teams. The SQL pattern follows a consistent structure across all databases.
-- weekly_cohort_retention.sql
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC('week', MIN(event_date)) AS cohort_week
FROM user_events
GROUP BY user_id
),
activity AS (
SELECT
c.cohort_week,
DATE_TRUNC('week', e.event_date) AS active_week,
COUNT(DISTINCT e.user_id) AS active_users
FROM user_events e
JOIN cohorts c ON c.user_id = e.user_id
GROUP BY c.cohort_week, DATE_TRUNC('week', e.event_date)
)
SELECT
cohort_week,
EXTRACT(WEEK FROM active_week - cohort_week) AS weeks_since_signup,
active_users
FROM activity
ORDER BY cohort_week, weeks_since_signup;The first CTE identifies each user's signup week. The second CTE counts distinct active users per cohort per week. The final query pivots the result into a retention matrix. Interviewers expect candidates to explain the business interpretation: a steep drop at week 1 signals an onboarding problem, while a flat curve from week 4 onward suggests a stable core user base.
5. What is a CTE, and when should it replace a subquery?
A Common Table Expression (CTE) is a named temporary result set defined with WITH. CTEs improve readability and allow recursive queries. Subqueries tend to produce deeply nested SQL that is hard to debug. A CTE is preferred when the same derived table is referenced multiple times in the query, or when the logic has three or more transformation steps.
Python Data Analysis Interview Questions
Python questions focus on Pandas, data cleaning workflows, and the ability to explain code in business terms. Interviewers want to see pragmatic solutions, not academic abstractions.
6. How to handle missing values in a Pandas DataFrame
Missing data corrupts aggregations and model inputs. The approach depends on the column type and the business context.
# handle_missing.py
import pandas as pd
import numpy as np
df = pd.read_csv('sales.csv')
# Inspect the extent of missing data
print(df.isnull().sum())
print(df.isnull().mean().round(3)) # percentage per column
# Strategy 1: drop rows where critical columns are null
df_clean = df.dropna(subset=['customer_id', 'amount'])
# Strategy 2: fill numeric columns with median (robust to outliers)
df['amount'] = df['amount'].fillna(df['amount'].median())
# Strategy 3: fill categorical columns with the mode
df['region'] = df['region'].fillna(df['region'].mode()[0])Dropping rows works when the missing rate is below 5% and the column is critical. Median imputation is safer than mean for skewed distributions. Mode imputation suits categorical columns with a dominant value. Documenting the imputation strategy is critical for audit trails.
7. Merge, join and concatenate: when to use each
Pandas offers multiple ways to combine DataFrames. The choice depends on whether the operation is row-wise or column-wise, and whether a key-based match is needed.
# combine_dataframes.py
import pandas as pd
orders = pd.read_csv('orders.csv')
customers = pd.read_csv('customers.csv')
# Key-based merge (equivalent to SQL JOIN)
result = orders.merge(customers, on='customer_id', how='left')
# Stack rows from multiple sources
all_events = pd.concat([events_q1, events_q2], ignore_index=True)
# Add columns side-by-side (same row count required)
combined = pd.concat([features, labels], axis=1)merge is the right choice for key-based joins. concat with default axis=0 stacks rows. concat with axis=1 attaches columns. Using how='left' preserves all rows from the left DataFrame, which mirrors a SQL LEFT JOIN.
8. Group, aggregate and transform sales data
GroupBy operations are the Pandas equivalent of SQL GROUP BY. The pattern appears in almost every data analyst take-home assignment.
# sales_analysis.py
import pandas as pd
df = pd.read_csv('transactions.csv', parse_dates=['date'])
# Monthly revenue by product category
monthly = (
df.groupby([pd.Grouper(key='date', freq='M'), 'category'])
.agg(revenue=('amount', 'sum'), orders=('order_id', 'nunique'))
.reset_index()
)
# Add a column with each category's share of total monthly revenue
monthly['share'] = (
monthly.groupby('date')['revenue']
.transform(lambda x: x / x.sum())
.round(4)
)
print(monthly.head(10))agg with named aggregation keeps the output readable. transform broadcasts a group-level computation back to each row, which avoids a separate merge step. This combination covers the majority of ad-hoc reporting tasks.
9. Detect and handle outliers in a dataset
Outlier detection determines whether extreme values are data entry errors, fraud signals, or genuine edge cases.
# detect_outliers.py
import pandas as pd
import numpy as np
df = pd.read_csv('transactions.csv')
# IQR method
Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df['amount'] < lower) | (df['amount'] > upper)]
print(f'Outliers found: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)')
# Cap instead of remove (winsorization)
df['amount_capped'] = df['amount'].clip(lower=lower, upper=upper)The IQR method flags values beyond 1.5 times the interquartile range. Capping (winsorization) preserves the row count while limiting the influence of extreme values. Removal is appropriate only when outliers represent clear errors.
Ready to ace your Data Analytics interviews?
Practice with our interactive simulators, flashcards, and technical tests.
Statistics and Probability Questions
Statistical literacy separates data analysts who report numbers from those who interpret them correctly. These questions test foundational understanding.
10. Explain p-value in plain language
A p-value is the probability of observing a result at least as extreme as the measured one, assuming the null hypothesis is true. A p-value of 0.03 does not mean there is a 3% chance the null hypothesis is true. It means that if the null hypothesis were true, results this extreme would appear 3% of the time by chance alone. The threshold (alpha) is typically set at 0.05, but the choice depends on the cost of a false positive in the specific business context.
11. When to use median instead of mean
The mean is sensitive to extreme values. Income data, transaction amounts and session durations are classic examples where the median gives a more representative central tendency. A dataset with a few high-value enterprise deals can push the mean far above what a typical customer experiences. Reporting the median alongside the mean, and including the standard deviation, gives stakeholders a complete picture.
12. What is the difference between correlation and causation?
Correlation measures the strength and direction of a linear relationship between two variables. Causation means one variable directly influences the other. Ice cream sales and drowning incidents correlate because both increase in summer, not because one causes the other. Establishing causation requires a controlled experiment (A/B test) or a rigorous causal inference framework such as difference-in-differences or instrumental variables.
13. Explain Type I and Type II errors with a business example
A Type I error (false positive) occurs when a test incorrectly rejects the null hypothesis. Example: concluding that a new checkout flow increases conversion when it actually has no effect, leading to an unnecessary product change. A Type II error (false negative) occurs when a test fails to detect a real effect. Example: keeping the old checkout flow because the test lacked sufficient sample size to detect a genuine 2% lift. Increasing sample size reduces Type II errors. Lowering the alpha threshold reduces Type I errors but increases Type II errors.
Power BI and Data Visualization Questions
Visualization questions assess whether a candidate can choose the right chart, build maintainable dashboards, and communicate data stories to non-technical audiences.
14. What is DAX, and how does it differ from SQL?
DAX (Data Analysis Expressions) is the formula language used in Power BI, Analysis Services and Excel Power Pivot. Unlike SQL, which operates on sets of rows, DAX operates within a filter context that changes dynamically based on slicers, filters and row context. A CALCULATE function in DAX modifies the filter context before evaluating an expression, a concept with no direct SQL equivalent.
15. Explain the difference between import mode and DirectQuery
Import mode loads data into the Power BI in-memory engine (VertiPaq), delivering fast query performance but requiring scheduled refreshes. DirectQuery sends queries to the source database in real time, ensuring data freshness but depending on the source system's performance. In 2026, Microsoft Fabric introduces Direct Lake mode, which reads Parquet files from OneLake without importing or querying a SQL endpoint, combining the speed of import mode with near-real-time freshness.
16. When to use a bar chart versus a line chart
Bar charts compare discrete categories (revenue by region, count by product). Line charts show trends over continuous intervals (daily active users, monthly churn rate). Using a line chart for categorical data implies a trend between unrelated categories, which misleads the audience. A common interview follow-up asks about pie charts: they should be limited to showing parts of a whole when there are no more than five slices, because the human eye struggles to compare angles accurately.
17. How to optimize a slow Power BI report
Slow reports typically stem from excessive columns in the data model, complex DAX measures evaluated row by row, or visuals that query millions of rows. The optimization checklist includes: removing unused columns to shrink the model size, replacing iterator functions (SUMX, FILTER) with their non-iterator equivalents where possible, reducing the number of visuals per page to under 15, and using aggregation tables for high-cardinality dimensions. The Performance Analyzer built into Power BI Desktop pinpoints exactly which visual or DAX query is the bottleneck.
Behavioral and Case Study Questions
Technical skills get candidates into the interview. Behavioral questions determine whether they get the offer. Every answer should follow the STAR framework: Situation, Task, Action, Result.
18. Describe a time data analysis changed a business decision
Strong answers include a specific metric that moved, the analysis method used, and the dollar or percentage impact. Weak answers describe a dashboard without connecting it to an outcome. Example structure: "Churn analysis on a SaaS product revealed that users who did not complete onboarding within 48 hours had a 60% higher churn rate at 30 days. The product team added an automated email sequence triggered at hour 24, reducing 30-day churn by 8 percentage points."
19. How to handle conflicting data requests from stakeholders
This question tests prioritization and communication skills. The recommended approach: clarify the business question behind each request, identify shared metrics, propose a single dashboard that serves both needs, and document trade-offs transparently. Saying "no" to a stakeholder is acceptable when backed by data governance concerns or resource constraints, provided an alternative is offered.
20. Explain a complex analysis to a non-technical audience
Interviewers often ask candidates to explain a concept like regression or cohort analysis as if speaking to a marketing manager. The best responses use analogies, avoid jargon, and anchor the explanation in a business outcome. "Cohort analysis groups customers by the month they signed up and tracks how many are still active three months later. The chart shows that January signups retained better than February signups, which aligns with the onboarding changes shipped in late December."
Situation: set the context. Task: define the objective. Action: describe the specific steps taken. Result: quantify the impact. Interviewers penalize vague answers that skip the Result step.
Data Cleaning and ETL Interview Questions
Data quality issues consume a significant portion of an analyst's time. These questions test real-world data wrangling skills.
21. How to validate data after an ETL pipeline runs
Validation checks should cover row counts (source vs. destination), null rates on critical columns, referential integrity across tables, and value distribution shifts (detecting silent schema changes). Automated checks that run after every pipeline execution prevent bad data from reaching dashboards.
# etl_validation.py
import pandas as pd
def validate_pipeline(source_df: pd.DataFrame, target_df: pd.DataFrame) -> dict:
checks = {}
# Row count match
checks['row_count_match'] = len(source_df) == len(target_df)
# Null rate on critical columns
for col in ['user_id', 'event_date', 'amount']:
null_rate = target_df[col].isnull().mean()
checks[f'{col}_null_rate'] = round(null_rate, 4)
# Revenue reconciliation
source_total = source_df['amount'].sum()
target_total = target_df['amount'].sum()
checks['revenue_diff_pct'] = round(
abs(source_total - target_total) / source_total * 100, 2
)
return checks
results = validate_pipeline(source, target)
for check, value in results.items():
print(f'{check}: {value}')This function returns a dictionary of validation results that can feed into an alerting system. A revenue difference above 0.01% should trigger an investigation.
22. What is the difference between normalized and denormalized data?
Normalized data minimizes redundancy by splitting information across related tables (3NF). Denormalized data combines tables into wider, flatter structures optimized for read-heavy analytical queries. OLTP systems favor normalization for data integrity. Data warehouses and BI tools favor denormalization (star schema, snowflake schema) for query speed. Understanding this trade-off is fundamental to designing efficient data pipelines.
Advanced Analytics Questions
Senior roles require familiarity with experimentation, forecasting and cross-functional data strategy.
23. How to design an A/B test from scratch
An A/B test requires five components: a clear hypothesis ("Changing the CTA button color from grey to green will increase click-through rate by 5%"), a primary metric (click-through rate), a sample size calculation based on the expected effect size and desired power (typically 80%), random assignment of users to control and treatment groups, and a predetermined test duration that accounts for weekly seasonality. Running the test for at least two full business cycles prevents day-of-week bias from skewing results.
Peeking at results before the test reaches statistical significance inflates the false positive rate. Pre-register the analysis plan, sample size and duration before launching the test.
24. Explain funnel analysis and where drop-offs matter most
Funnel analysis tracks user progression through a sequence of steps (e.g., homepage visit, product view, add to cart, checkout, purchase). The highest-impact optimization targets the step with the largest absolute drop-off, not the largest percentage drop-off. A 40% drop from 1,000 users at product view to 600 at add-to-cart represents 400 lost users. A 50% drop from 100 users at checkout to 50 at purchase represents only 50 lost users. Prioritizing the first bottleneck delivers more total conversions.
25. How to build a KPI dashboard that stakeholders actually use
Dashboards fail when they display too many metrics without hierarchy. An effective KPI dashboard follows the inverted pyramid: one or two north-star metrics at the top (e.g., monthly recurring revenue, net promoter score), supporting metrics in the middle (conversion rate, churn rate, average order value), and diagnostic details accessible through drill-downs. Limiting the top-level view to seven or fewer metrics reduces cognitive load. Adding automated annotations ("Revenue dropped 12% on March 5 due to a payment gateway outage") provides context that raw numbers lack.
Start practicing!
Test your knowledge with our interview simulators and technical tests.
Conclusion
- SQL window functions, CTEs and cohort queries appear in nearly every data analyst technical screen. Writing them from memory is a baseline expectation.
- Python Pandas proficiency covers 80% of ad-hoc analysis tasks. Focus on
merge,groupby,transformand null handling. - Statistical reasoning (p-values, correlation vs. causation, experiment design) separates analysts who report numbers from those who interpret them correctly.
- Power BI and visualization questions test chart selection, DAX fundamentals and dashboard performance optimization.
- Behavioral answers that follow the STAR framework and include quantified results outperform vague narratives.
- Practice data analytics interview modules on SharpSkill to drill these concepts with timed exercises.
- For related preparation, the Data Science interview guide covers machine learning and statistical modeling questions that overlap with senior data analyst roles.
Start practicing!
Test your knowledge with our interview simulators and technical tests.
Tags
Share
Related articles

Top 25 Data Science Interview Questions in 2026
Data science interview questions covering statistics, machine learning, feature engineering, deep learning, SQL, and system design — with Python code examples and detailed answers for 2026.

Django Interview Questions: ORM, Middleware and DRF Deep Dive
Django interview questions covering ORM optimization with select_related and prefetch_related, middleware architecture, and Django REST Framework serializer performance, permissions, and pagination patterns.