Top 25 Data Analytics Interview Questions in 2026

Master the most asked data analytics interview questions in 2026. Covers SQL, Python, Power BI, statistics and behavioral questions with detailed answers and code examples.

Data analytics interview questions covering SQL queries, Python scripts and dashboard visualizations

Data analytics interview questions in 2026 test far more than SQL syntax or chart-building. Hiring teams evaluate candidates on their ability to extract meaning from messy datasets, communicate findings to non-technical stakeholders, and tie every metric back to a business outcome. This guide covers the 25 questions that appear most frequently across junior, mid-level and senior data analyst roles.

How to Use This Guide

Each question includes a concise model answer and, where relevant, runnable SQL or Python code. Practice writing the queries from scratch before reading the solution. Interviewers care more about the reasoning process than a memorized answer.

SQL Questions for Data Analyst Interviews

SQL remains the single most tested skill in data analytics interviews. Every company with a relational database expects analysts to write queries without assistance.

1. What is the difference between WHERE and HAVING?

WHERE filters rows before aggregation. HAVING filters groups after aggregation. Mixing them up is one of the most common mistakes in technical screens.

sql
-- monthly_revenue.sql
SELECT
  DATE_TRUNC('month', order_date) AS month,
  SUM(amount)                     AS revenue
FROM orders
WHERE status = 'completed'          -- row-level filter
GROUP BY month
HAVING SUM(amount) > 10000;         -- group-level filter

WHERE reduces the dataset before the database engine performs the GROUP BY. HAVING operates on the aggregated result. Using WHERE whenever possible improves query performance because it shrinks the working set earlier in the execution plan.

2. Explain window functions with a practical example

Window functions compute a value across a set of rows related to the current row, without collapsing the result set. They are essential for ranking, running totals and period-over-period comparisons.

sql
-- user_ranking.sql
SELECT
  user_id,
  purchase_date,
  amount,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY purchase_date) AS purchase_rank,
  SUM(amount)   OVER (PARTITION BY user_id ORDER BY purchase_date) AS running_total,
  LAG(amount)   OVER (PARTITION BY user_id ORDER BY purchase_date) AS prev_amount
FROM purchases;

ROW_NUMBER assigns a sequential rank per user. SUM(...) OVER(...) computes a running total without a self-join. LAG pulls the previous row's value for direct comparison. Interviewers commonly ask candidates to calculate month-over-month growth using exactly this pattern.

3. How to identify and remove duplicate records

Duplicate detection is a daily task for analysts working with event logs, CRM exports or third-party data feeds.

sql
-- deduplicate_events.sql
WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id, event_type, DATE_TRUNC('minute', created_at)
      ORDER BY created_at
    ) AS rn
  FROM events
)
SELECT * FROM ranked WHERE rn = 1;

The CTE assigns a row number within each duplicate group. Keeping only rn = 1 returns the first occurrence. This approach preserves the original data while filtering out noise.

4. Write a query to calculate retention by weekly cohort

Cohort retention is one of the most requested analyses in product and growth teams. The SQL pattern follows a consistent structure across all databases.

sql
-- weekly_cohort_retention.sql
WITH cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('week', MIN(event_date)) AS cohort_week
  FROM user_events
  GROUP BY user_id
),
activity AS (
  SELECT
    c.cohort_week,
    DATE_TRUNC('week', e.event_date) AS active_week,
    COUNT(DISTINCT e.user_id)        AS active_users
  FROM user_events e
  JOIN cohorts c ON c.user_id = e.user_id
  GROUP BY c.cohort_week, DATE_TRUNC('week', e.event_date)
)
SELECT
  cohort_week,
  EXTRACT(WEEK FROM active_week - cohort_week) AS weeks_since_signup,
  active_users
FROM activity
ORDER BY cohort_week, weeks_since_signup;

The first CTE identifies each user's signup week. The second CTE counts distinct active users per cohort per week. The final query pivots the result into a retention matrix. Interviewers expect candidates to explain the business interpretation: a steep drop at week 1 signals an onboarding problem, while a flat curve from week 4 onward suggests a stable core user base.

5. What is a CTE, and when should it replace a subquery?

A Common Table Expression (CTE) is a named temporary result set defined with WITH. CTEs improve readability and allow recursive queries. Subqueries tend to produce deeply nested SQL that is hard to debug. A CTE is preferred when the same derived table is referenced multiple times in the query, or when the logic has three or more transformation steps.

Python Data Analysis Interview Questions

Python questions focus on Pandas, data cleaning workflows, and the ability to explain code in business terms. Interviewers want to see pragmatic solutions, not academic abstractions.

6. How to handle missing values in a Pandas DataFrame

Missing data corrupts aggregations and model inputs. The approach depends on the column type and the business context.

python
# handle_missing.py
import pandas as pd
import numpy as np

df = pd.read_csv('sales.csv')

# Inspect the extent of missing data
print(df.isnull().sum())
print(df.isnull().mean().round(3))  # percentage per column

# Strategy 1: drop rows where critical columns are null
df_clean = df.dropna(subset=['customer_id', 'amount'])

# Strategy 2: fill numeric columns with median (robust to outliers)
df['amount'] = df['amount'].fillna(df['amount'].median())

# Strategy 3: fill categorical columns with the mode
df['region'] = df['region'].fillna(df['region'].mode()[0])

Dropping rows works when the missing rate is below 5% and the column is critical. Median imputation is safer than mean for skewed distributions. Mode imputation suits categorical columns with a dominant value. Documenting the imputation strategy is critical for audit trails.

7. Merge, join and concatenate: when to use each

Pandas offers multiple ways to combine DataFrames. The choice depends on whether the operation is row-wise or column-wise, and whether a key-based match is needed.

python
# combine_dataframes.py
import pandas as pd

orders = pd.read_csv('orders.csv')
customers = pd.read_csv('customers.csv')

# Key-based merge (equivalent to SQL JOIN)
result = orders.merge(customers, on='customer_id', how='left')

# Stack rows from multiple sources
all_events = pd.concat([events_q1, events_q2], ignore_index=True)

# Add columns side-by-side (same row count required)
combined = pd.concat([features, labels], axis=1)

merge is the right choice for key-based joins. concat with default axis=0 stacks rows. concat with axis=1 attaches columns. Using how='left' preserves all rows from the left DataFrame, which mirrors a SQL LEFT JOIN.

8. Group, aggregate and transform sales data

GroupBy operations are the Pandas equivalent of SQL GROUP BY. The pattern appears in almost every data analyst take-home assignment.

python
# sales_analysis.py
import pandas as pd

df = pd.read_csv('transactions.csv', parse_dates=['date'])

# Monthly revenue by product category
monthly = (
    df.groupby([pd.Grouper(key='date', freq='M'), 'category'])
    .agg(revenue=('amount', 'sum'), orders=('order_id', 'nunique'))
    .reset_index()
)

# Add a column with each category's share of total monthly revenue
monthly['share'] = (
    monthly.groupby('date')['revenue']
    .transform(lambda x: x / x.sum())
    .round(4)
)

print(monthly.head(10))

agg with named aggregation keeps the output readable. transform broadcasts a group-level computation back to each row, which avoids a separate merge step. This combination covers the majority of ad-hoc reporting tasks.

9. Detect and handle outliers in a dataset

Outlier detection determines whether extreme values are data entry errors, fraud signals, or genuine edge cases.

python
# detect_outliers.py
import pandas as pd
import numpy as np

df = pd.read_csv('transactions.csv')

# IQR method
Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df['amount'] < lower) | (df['amount'] > upper)]
print(f'Outliers found: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)')

# Cap instead of remove (winsorization)
df['amount_capped'] = df['amount'].clip(lower=lower, upper=upper)

The IQR method flags values beyond 1.5 times the interquartile range. Capping (winsorization) preserves the row count while limiting the influence of extreme values. Removal is appropriate only when outliers represent clear errors.

Ready to ace your Data Analytics interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Statistics and Probability Questions

Statistical literacy separates data analysts who report numbers from those who interpret them correctly. These questions test foundational understanding.

10. Explain p-value in plain language

A p-value is the probability of observing a result at least as extreme as the measured one, assuming the null hypothesis is true. A p-value of 0.03 does not mean there is a 3% chance the null hypothesis is true. It means that if the null hypothesis were true, results this extreme would appear 3% of the time by chance alone. The threshold (alpha) is typically set at 0.05, but the choice depends on the cost of a false positive in the specific business context.

11. When to use median instead of mean

The mean is sensitive to extreme values. Income data, transaction amounts and session durations are classic examples where the median gives a more representative central tendency. A dataset with a few high-value enterprise deals can push the mean far above what a typical customer experiences. Reporting the median alongside the mean, and including the standard deviation, gives stakeholders a complete picture.

12. What is the difference between correlation and causation?

Correlation measures the strength and direction of a linear relationship between two variables. Causation means one variable directly influences the other. Ice cream sales and drowning incidents correlate because both increase in summer, not because one causes the other. Establishing causation requires a controlled experiment (A/B test) or a rigorous causal inference framework such as difference-in-differences or instrumental variables.

13. Explain Type I and Type II errors with a business example

A Type I error (false positive) occurs when a test incorrectly rejects the null hypothesis. Example: concluding that a new checkout flow increases conversion when it actually has no effect, leading to an unnecessary product change. A Type II error (false negative) occurs when a test fails to detect a real effect. Example: keeping the old checkout flow because the test lacked sufficient sample size to detect a genuine 2% lift. Increasing sample size reduces Type II errors. Lowering the alpha threshold reduces Type I errors but increases Type II errors.

Power BI and Data Visualization Questions

Visualization questions assess whether a candidate can choose the right chart, build maintainable dashboards, and communicate data stories to non-technical audiences.

14. What is DAX, and how does it differ from SQL?

DAX (Data Analysis Expressions) is the formula language used in Power BI, Analysis Services and Excel Power Pivot. Unlike SQL, which operates on sets of rows, DAX operates within a filter context that changes dynamically based on slicers, filters and row context. A CALCULATE function in DAX modifies the filter context before evaluating an expression, a concept with no direct SQL equivalent.

15. Explain the difference between import mode and DirectQuery

Import mode loads data into the Power BI in-memory engine (VertiPaq), delivering fast query performance but requiring scheduled refreshes. DirectQuery sends queries to the source database in real time, ensuring data freshness but depending on the source system's performance. In 2026, Microsoft Fabric introduces Direct Lake mode, which reads Parquet files from OneLake without importing or querying a SQL endpoint, combining the speed of import mode with near-real-time freshness.

16. When to use a bar chart versus a line chart

Bar charts compare discrete categories (revenue by region, count by product). Line charts show trends over continuous intervals (daily active users, monthly churn rate). Using a line chart for categorical data implies a trend between unrelated categories, which misleads the audience. A common interview follow-up asks about pie charts: they should be limited to showing parts of a whole when there are no more than five slices, because the human eye struggles to compare angles accurately.

17. How to optimize a slow Power BI report

Slow reports typically stem from excessive columns in the data model, complex DAX measures evaluated row by row, or visuals that query millions of rows. The optimization checklist includes: removing unused columns to shrink the model size, replacing iterator functions (SUMX, FILTER) with their non-iterator equivalents where possible, reducing the number of visuals per page to under 15, and using aggregation tables for high-cardinality dimensions. The Performance Analyzer built into Power BI Desktop pinpoints exactly which visual or DAX query is the bottleneck.

Behavioral and Case Study Questions

Technical skills get candidates into the interview. Behavioral questions determine whether they get the offer. Every answer should follow the STAR framework: Situation, Task, Action, Result.

18. Describe a time data analysis changed a business decision

Strong answers include a specific metric that moved, the analysis method used, and the dollar or percentage impact. Weak answers describe a dashboard without connecting it to an outcome. Example structure: "Churn analysis on a SaaS product revealed that users who did not complete onboarding within 48 hours had a 60% higher churn rate at 30 days. The product team added an automated email sequence triggered at hour 24, reducing 30-day churn by 8 percentage points."

19. How to handle conflicting data requests from stakeholders

This question tests prioritization and communication skills. The recommended approach: clarify the business question behind each request, identify shared metrics, propose a single dashboard that serves both needs, and document trade-offs transparently. Saying "no" to a stakeholder is acceptable when backed by data governance concerns or resource constraints, provided an alternative is offered.

20. Explain a complex analysis to a non-technical audience

Interviewers often ask candidates to explain a concept like regression or cohort analysis as if speaking to a marketing manager. The best responses use analogies, avoid jargon, and anchor the explanation in a business outcome. "Cohort analysis groups customers by the month they signed up and tracks how many are still active three months later. The chart shows that January signups retained better than February signups, which aligns with the onboarding changes shipped in late December."

STAR Framework Reference

Situation: set the context. Task: define the objective. Action: describe the specific steps taken. Result: quantify the impact. Interviewers penalize vague answers that skip the Result step.

Data Cleaning and ETL Interview Questions

Data quality issues consume a significant portion of an analyst's time. These questions test real-world data wrangling skills.

21. How to validate data after an ETL pipeline runs

Validation checks should cover row counts (source vs. destination), null rates on critical columns, referential integrity across tables, and value distribution shifts (detecting silent schema changes). Automated checks that run after every pipeline execution prevent bad data from reaching dashboards.

python
# etl_validation.py
import pandas as pd

def validate_pipeline(source_df: pd.DataFrame, target_df: pd.DataFrame) -> dict:
    checks = {}
    # Row count match
    checks['row_count_match'] = len(source_df) == len(target_df)
    # Null rate on critical columns
    for col in ['user_id', 'event_date', 'amount']:
        null_rate = target_df[col].isnull().mean()
        checks[f'{col}_null_rate'] = round(null_rate, 4)
    # Revenue reconciliation
    source_total = source_df['amount'].sum()
    target_total = target_df['amount'].sum()
    checks['revenue_diff_pct'] = round(
        abs(source_total - target_total) / source_total * 100, 2
    )
    return checks

results = validate_pipeline(source, target)
for check, value in results.items():
    print(f'{check}: {value}')

This function returns a dictionary of validation results that can feed into an alerting system. A revenue difference above 0.01% should trigger an investigation.

22. What is the difference between normalized and denormalized data?

Normalized data minimizes redundancy by splitting information across related tables (3NF). Denormalized data combines tables into wider, flatter structures optimized for read-heavy analytical queries. OLTP systems favor normalization for data integrity. Data warehouses and BI tools favor denormalization (star schema, snowflake schema) for query speed. Understanding this trade-off is fundamental to designing efficient data pipelines.

Advanced Analytics Questions

Senior roles require familiarity with experimentation, forecasting and cross-functional data strategy.

23. How to design an A/B test from scratch

An A/B test requires five components: a clear hypothesis ("Changing the CTA button color from grey to green will increase click-through rate by 5%"), a primary metric (click-through rate), a sample size calculation based on the expected effect size and desired power (typically 80%), random assignment of users to control and treatment groups, and a predetermined test duration that accounts for weekly seasonality. Running the test for at least two full business cycles prevents day-of-week bias from skewing results.

Common A/B Testing Mistakes

Peeking at results before the test reaches statistical significance inflates the false positive rate. Pre-register the analysis plan, sample size and duration before launching the test.

24. Explain funnel analysis and where drop-offs matter most

Funnel analysis tracks user progression through a sequence of steps (e.g., homepage visit, product view, add to cart, checkout, purchase). The highest-impact optimization targets the step with the largest absolute drop-off, not the largest percentage drop-off. A 40% drop from 1,000 users at product view to 600 at add-to-cart represents 400 lost users. A 50% drop from 100 users at checkout to 50 at purchase represents only 50 lost users. Prioritizing the first bottleneck delivers more total conversions.

25. How to build a KPI dashboard that stakeholders actually use

Dashboards fail when they display too many metrics without hierarchy. An effective KPI dashboard follows the inverted pyramid: one or two north-star metrics at the top (e.g., monthly recurring revenue, net promoter score), supporting metrics in the middle (conversion rate, churn rate, average order value), and diagnostic details accessible through drill-downs. Limiting the top-level view to seven or fewer metrics reduces cognitive load. Adding automated annotations ("Revenue dropped 12% on March 5 due to a payment gateway outage") provides context that raw numbers lack.

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Conclusion

  • SQL window functions, CTEs and cohort queries appear in nearly every data analyst technical screen. Writing them from memory is a baseline expectation.
  • Python Pandas proficiency covers 80% of ad-hoc analysis tasks. Focus on merge, groupby, transform and null handling.
  • Statistical reasoning (p-values, correlation vs. causation, experiment design) separates analysts who report numbers from those who interpret them correctly.
  • Power BI and visualization questions test chart selection, DAX fundamentals and dashboard performance optimization.
  • Behavioral answers that follow the STAR framework and include quantified results outperform vague narratives.
  • Practice data analytics interview modules on SharpSkill to drill these concepts with timed exercises.
  • For related preparation, the Data Science interview guide covers machine learning and statistical modeling questions that overlap with senior data analyst roles.

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Tags

#data-analytics
#interview
#sql
#python
#power-bi
#statistics

Share

Related articles