Advanced SQL for Data Analyst Interviews: Subqueries, Pivots and Query Optimization

Master advanced SQL techniques tested in data analyst interviews — correlated subqueries, pivot queries with conditional aggregation, EXPLAIN plans, and indexing strategies with real-world examples.

Advanced SQL queries and database optimization for data analyst interviews

Advanced SQL interview questions for data analysts go far beyond SELECT and GROUP BY. Hiring managers at companies like Meta, Stripe, and Airbnb test candidates on correlated subqueries, pivot transformations, and query optimization — skills that separate junior analysts from senior ones. This guide covers the exact patterns that appear in technical screens, with production-grade examples running on PostgreSQL 17.

What Interviewers Actually Test

SQL interviews for data analyst roles in 2026 focus on three areas: breaking complex logic into readable subqueries or CTEs, transforming row-level data into pivot reports using conditional aggregation, and demonstrating awareness of query performance through EXPLAIN plans and indexing. Most candidates fail on optimization — knowing it exists gives a significant edge.

Correlated Subqueries vs. Regular Subqueries in SQL Interviews

A correlated subquery references a column from the outer query, forcing the database engine to re-evaluate the subquery for every row in the outer result set. A regular (non-correlated) subquery runs once and returns a fixed result. Interviewers specifically test whether candidates understand when each approach is appropriate — and when a JOIN performs better than either.

A regular subquery works well for threshold comparisons. The inner query executes once, and the outer query filters against that single value.

sql
-- regular_subquery_threshold.sql
-- Find all orders above the average order value
SELECT
  order_id,
  customer_id,
  total_amount
FROM orders
WHERE total_amount > (
  -- Executes once, returns a single scalar value
  SELECT AVG(total_amount)
  FROM orders
);

A correlated subquery is necessary when the comparison depends on each row's context. The classic interview question: "Find employees who earn more than the average salary in their department."

sql
-- correlated_subquery_department.sql
-- Employees earning above their department average
SELECT
  e.employee_id,
  e.name,
  e.department,
  e.salary
FROM employees e
WHERE e.salary > (
  -- Re-evaluated for each employee's department
  SELECT AVG(e2.salary)
  FROM employees e2
  WHERE e2.department = e.department
);

The correlated version re-runs the AVG calculation for each row's department. On large tables, this becomes expensive. The optimized approach uses a CTE or a window function to compute department averages once.

sql
-- optimized_with_cte.sql
-- Same result, single pass over the data
WITH dept_avg AS (
  SELECT
    department,
    AVG(salary) AS avg_salary
  FROM employees
  GROUP BY department
)
SELECT
  e.employee_id,
  e.name,
  e.department,
  e.salary
FROM employees e
JOIN dept_avg d ON e.department = d.department
WHERE e.salary > d.avg_salary;

Interviewers look for candidates who can write the correlated version and then refactor it. Demonstrating both approaches — and explaining the performance tradeoff — signals senior-level SQL fluency.

Subquery Patterns Frequently Asked in Data Analyst Screens

Three subquery patterns appear repeatedly across data analyst interviews: EXISTS for membership checks, scalar subqueries in SELECT clauses, and derived tables in FROM clauses.

EXISTS vs. IN — EXISTS short-circuits as soon as a match is found, making it faster than IN for large result sets. IN materializes the entire subquery result before comparing.

sql
-- exists_vs_in.sql
-- Customers who placed at least one order in 2026 (EXISTS — preferred)
SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
    AND o.order_date >= '2026-01-01'
);

-- Equivalent with IN (slower on large datasets)
SELECT customer_id, name
FROM customers
WHERE customer_id IN (
  SELECT customer_id
  FROM orders
  WHERE order_date >= '2026-01-01'
);

Scalar subquery in SELECT — useful for adding a computed column without a full JOIN.

sql
-- scalar_subquery_select.sql
-- Each product with its category's total revenue
SELECT
  p.product_id,
  p.product_name,
  p.category,
  (
    SELECT SUM(oi.quantity * oi.unit_price)
    FROM order_items oi
    JOIN products p2 ON oi.product_id = p2.product_id
    WHERE p2.category = p.category
  ) AS category_total_revenue
FROM products p;

This pattern is clean for adding one or two columns. For more, a CTE with a JOIN is more readable and often faster.

Ready to ace your Data Analytics interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Pivot Queries with Conditional Aggregation in SQL

Pivot queries transform row-level data into columnar summaries — turning months stored as rows into columns like jan_revenue, feb_revenue. While some databases offer a native PIVOT operator, the portable approach uses CASE statements inside aggregate functions. This is the version interviewers expect.

A pivot query combines conditional aggregation with CASE expressions inside aggregate functions. Each CASE expression filters for a specific category, and the aggregate (SUM, COUNT, AVG) collapses the filtered rows into a single value per group.

sql
-- pivot_monthly_revenue.sql
-- Monthly revenue pivot: one row per product, one column per month
SELECT
  product_id,
  product_name,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 1
    THEN quantity * unit_price ELSE 0 END) AS jan_revenue,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 2
    THEN quantity * unit_price ELSE 0 END) AS feb_revenue,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 3
    THEN quantity * unit_price ELSE 0 END) AS mar_revenue,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 4
    THEN quantity * unit_price ELSE 0 END) AS apr_revenue,
  -- Repeat for remaining months
  SUM(quantity * unit_price) AS total_revenue
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2026-01-01'
GROUP BY product_id, product_name
ORDER BY total_revenue DESC;

The same conditional aggregation technique handles multi-dimensional pivots. A common interview question asks candidates to build a report showing user activity by day of week and status.

sql
-- pivot_user_activity.sql
-- User activity: sessions by day of week and device type
SELECT
  user_id,
  COUNT(CASE WHEN device_type = 'mobile' THEN 1 END) AS mobile_sessions,
  COUNT(CASE WHEN device_type = 'desktop' THEN 1 END) AS desktop_sessions,
  COUNT(CASE WHEN device_type = 'tablet' THEN 1 END) AS tablet_sessions,
  ROUND(
    COUNT(CASE WHEN device_type = 'mobile' THEN 1 END) * 100.0 / COUNT(*),
    1
  ) AS mobile_pct
FROM user_sessions
WHERE session_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id
HAVING COUNT(*) >= 5  -- Only users with meaningful activity
ORDER BY mobile_pct DESC;

Conditional aggregation with CASE is a fundamental data analyst skill. The PostgreSQL documentation on conditional expressions covers additional patterns like FILTER clauses that simplify this syntax.

Dynamic Pivot with CROSSTAB in PostgreSQL

PostgreSQL's tablefunc extension provides CROSSTAB for cases where the pivot categories are not known at query time. While less common in interviews, it appears in take-home assignments.

sql
-- crosstab_dynamic_pivot.sql
-- Enable the extension (once per database)
CREATE EXTENSION IF NOT EXISTS tablefunc;

-- Revenue by product and quarter using CROSSTAB
SELECT *
FROM crosstab(
  $$
    SELECT
      product_name,
      'Q' || EXTRACT(QUARTER FROM order_date)::TEXT AS quarter,
      SUM(quantity * unit_price) AS revenue
    FROM order_items oi
    JOIN orders o ON oi.order_id = o.order_id
    JOIN products p ON oi.product_id = p.product_id
    WHERE EXTRACT(YEAR FROM o.order_date) = 2026
    GROUP BY product_name, quarter
    ORDER BY product_name, quarter
  $$,
  $$ VALUES ('Q1'), ('Q2'), ('Q3'), ('Q4') $$
) AS pivot_table(
  product_name TEXT,
  q1_revenue NUMERIC,
  q2_revenue NUMERIC,
  q3_revenue NUMERIC,
  q4_revenue NUMERIC
);

CROSSTAB requires that the source query returns exactly three columns: row identifier, category, and value. The second argument lists all possible category values. Missing combinations produce NULL rather than 0.

SQL Query Optimization: EXPLAIN Plans and Cost Analysis

Query optimization questions test whether a candidate can diagnose slow queries — not just write correct ones. The primary diagnostic tool is EXPLAIN ANALYZE, which executes the query and reports actual execution times, row counts, and the operations the planner chose.

sql
-- explain_analyze_example.sql
-- Analyze a slow query to identify bottlenecks
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT
  c.customer_id,
  c.name,
  COUNT(o.order_id) AS order_count,
  SUM(o.total_amount) AS lifetime_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2025-01-01'
GROUP BY c.customer_id, c.name
HAVING SUM(o.total_amount) > 1000
ORDER BY lifetime_value DESC;

Key elements in the EXPLAIN output that interviewers ask about:

  • Seq Scan — a full table scan. Acceptable on small tables; problematic on millions of rows. Indicates a missing index.
  • Index Scan / Index Only Scan — the query uses an index. Index Only Scan is faster because it reads data directly from the index without touching the table.
  • Nested Loop / Hash Join / Merge Join — the join strategy. Hash Join suits large unsorted datasets. Merge Join is optimal when both sides are pre-sorted. Nested Loop works for small outer tables joined to indexed inner tables.
  • Rows — estimated vs. actual row counts. Large discrepancies suggest stale statistics (run ANALYZE) or poorly selective predicates.
  • Buffers: shared hit / read — cache hit ratio. High read counts mean the data is not in memory.

The PostgreSQL EXPLAIN documentation covers all available output formats and options.

Indexing Strategies for Interview-Level SQL Optimization

Indexing is the most impactful optimization technique and a frequent interview topic. The question usually takes the form: "Given this slow query, what indexes would improve it?"

Three indexing patterns cover most interview scenarios:

1. Single-column index on high-selectivity WHERE clauses

sql
-- indexing_strategies.sql
-- Index on order_date: filters a small percentage of rows
CREATE INDEX idx_orders_date ON orders (order_date);

-- Composite index for queries filtering on both columns
CREATE INDEX idx_orders_customer_date
  ON orders (customer_id, order_date);

-- Covering index: includes columns needed in SELECT
-- Enables Index Only Scan (no table access)
CREATE INDEX idx_orders_covering
  ON orders (customer_id, order_date)
  INCLUDE (total_amount, status);

2. Composite index column order matters. The leftmost column must match the query's WHERE clause. An index on (customer_id, order_date) accelerates queries filtering by customer_id alone or by customer_id AND order_date, but NOT queries filtering only by order_date.

3. Partial indexes reduce index size and maintenance cost when only a subset of rows is frequently queried.

sql
-- partial_index.sql
-- Only index active orders (smaller, faster index)
CREATE INDEX idx_orders_active
  ON orders (customer_id, order_date)
  WHERE status = 'active';

-- Only index recent data for dashboard queries
CREATE INDEX idx_orders_recent
  ON orders (order_date)
  WHERE order_date >= '2026-01-01';
Index Anti-Patterns

Avoid indexing every column — each index slows down INSERT, UPDATE, and DELETE operations. A table with 10+ indexes on heavy-write workloads will experience noticeable degradation. Also avoid indexing low-cardinality columns (like boolean flags) unless part of a composite or partial index.

Common Query Anti-Patterns That Kill Performance

Interviewers present intentionally slow queries and ask candidates to identify the problem. These five anti-patterns account for most real-world performance issues.

Functions on indexed columns — applying a function to an indexed column prevents index usage.

sql
-- anti_patterns.sql
-- BAD: function on indexed column disables the index
SELECT * FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2026;

-- GOOD: range comparison uses the index
SELECT * FROM orders
WHERE order_date >= '2026-01-01'
  AND order_date < '2027-01-01';

SELECT * in production queries — fetches all columns including large TEXT or BYTEA fields. Specify only needed columns.

NOT IN with NULLs — if the subquery returns any NULL, NOT IN returns no rows at all. Replace with NOT EXISTS.

sql
-- not_in_null_trap.sql
-- BAD: returns empty if any customer_id is NULL in subquery
SELECT * FROM customers
WHERE customer_id NOT IN (
  SELECT customer_id FROM orders
);

-- GOOD: NOT EXISTS handles NULLs correctly
SELECT * FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM orders o
  WHERE o.customer_id = c.customer_id
);

Implicit type casting — comparing a VARCHAR column with an integer forces a cast on every row, bypassing the index. Always match types explicitly.

Missing LIMIT on exploratory queries — scanning millions of rows when only the first 10 are needed. Always add LIMIT during development and in paginated API queries.

Practice These Patterns

To build fluency with SQL subqueries and CTEs, work through real interview scenarios. Combine subquery refactoring with EXPLAIN analysis to develop the diagnostic intuition that interviewers look for.

Ready to ace your Data Analytics interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Conclusion

  • Correlated subqueries re-execute per row — refactor to CTEs or window functions for better performance on large datasets
  • EXISTS outperforms IN for membership checks, especially when the subquery returns many rows or contains NULLs
  • Conditional aggregation with CASE is the portable pivot technique — learn it before database-specific PIVOT syntax
  • EXPLAIN ANALYZE reveals the actual execution plan; focus on Seq Scan, row estimate accuracy, and join strategy
  • Composite index column order must match query filter patterns — leftmost prefix rule applies
  • Partial and covering indexes reduce I/O for targeted query patterns without the overhead of full-table indexes
  • Avoid functions on indexed columns, NOT IN with NULLs, and SELECT * in production queries
  • Practice combining window functions with subquery optimization to handle multi-step analytical questions

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Tags

#sql
#data-analytics
#interview
#query-optimization
#subqueries

Share

Related articles