SQL for Data Analysts: Window Functions, CTEs and Advanced Queries

Master SQL window functions (ROW_NUMBER, RANK, LAG/LEAD), Common Table Expressions, and advanced query techniques essential for data analyst interviews and daily work.

SQL window functions and CTEs for data analytics with dashboards and query results

SQL window functions, CTEs (Common Table Expressions), and advanced query patterns form the backbone of analytical SQL. Whether preparing for a data analyst interview or tackling complex reporting queries, these techniques turn verbose, hard-to-read subqueries into clean, performant SQL.

Quick Reference

Window functions perform calculations across a set of rows related to the current row — without collapsing them into a single output row like GROUP BY does. Combined with CTEs, they make complex analytical queries readable and maintainable.

Understanding SQL Window Functions and OVER Clause

A window function applies a calculation across a defined "window" of rows. The OVER clause controls which rows fall within that window and how they are ordered. Unlike aggregate functions with GROUP BY, window functions retain every individual row in the result set.

The general syntax follows a consistent pattern across all major databases — PostgreSQL, MySQL 8+, BigQuery, SQL Server, and Snowflake.

sql
-- window_function_syntax.sql
SELECT
  employee_id,
  department,
  salary,
  -- Ranks employees within each department by salary
  RANK() OVER (
    PARTITION BY department
    ORDER BY salary DESC
  ) AS dept_salary_rank,
  -- Running total of salaries within department
  SUM(salary) OVER (
    PARTITION BY department
    ORDER BY salary DESC
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_salary_total
FROM employees;

PARTITION BY splits the data into groups (similar to GROUP BY but without collapsing rows). ORDER BY defines the sequence within each partition. The optional frame clause (ROWS BETWEEN ...) narrows the window to a specific range of rows.

ROW_NUMBER, RANK, and DENSE_RANK Compared

These three ranking functions look similar but behave differently when ties exist. Choosing the wrong one is a frequent source of bugs in analytical queries.

sql
-- ranking_comparison.sql
SELECT
  product_name,
  category,
  revenue,
  ROW_NUMBER() OVER (ORDER BY revenue DESC) AS row_num,
  RANK()       OVER (ORDER BY revenue DESC) AS rank_val,
  DENSE_RANK() OVER (ORDER BY revenue DESC) AS dense_rank_val
FROM products;

| product_name | revenue | row_num | rank_val | dense_rank_val | |-------------|---------|---------|----------|----------------| | Widget A | 50000 | 1 | 1 | 1 | | Widget B | 50000 | 2 | 1 | 1 | | Widget C | 40000 | 3 | 3 | 2 | | Widget D | 30000 | 4 | 4 | 3 |

ROW_NUMBER always assigns a unique sequential integer — even for ties, one row arbitrarily gets the lower number. RANK assigns the same number to ties but skips subsequent numbers (1, 1, 3). DENSE_RANK also handles ties but never skips numbers (1, 1, 2). For top-N queries where duplicates should share a position, DENSE_RANK is usually the correct choice.

LAG and LEAD for Period-over-Period Analysis

LAG and LEAD access data from previous or subsequent rows without self-joins. These functions are essential for calculating period-over-period growth, detecting trends, and identifying changes.

sql
-- period_over_period.sql
SELECT
  month,
  revenue,
  -- Previous month revenue
  LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
  -- Month-over-month growth rate
  ROUND(
    (revenue - LAG(revenue, 1) OVER (ORDER BY month)) * 100.0
    / NULLIF(LAG(revenue, 1) OVER (ORDER BY month), 0),
    2
  ) AS mom_growth_pct,
  -- Next month revenue (forward-looking)
  LEAD(revenue, 1) OVER (ORDER BY month) AS next_month_revenue
FROM monthly_sales
ORDER BY month;

The second argument to LAG/LEAD specifies the offset (default is 1). A third optional argument provides a default value when no row exists at that offset — useful for avoiding NULLs in the first/last rows. The NULLIF in the growth calculation prevents division-by-zero errors.

NTILE and Percentile Bucketing

NTILE distributes rows into a specified number of roughly equal groups. Data analysts use it for quartile analysis, decile scoring, and customer segmentation.

sql
-- customer_segmentation.sql
SELECT
  customer_id,
  total_spend,
  -- Split customers into 4 quartiles by spend
  NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile,
  -- Decile scoring for finer granularity
  NTILE(10) OVER (ORDER BY total_spend DESC) AS spend_decile
FROM (
  SELECT
    customer_id,
    SUM(order_total) AS total_spend
  FROM orders
  WHERE order_date >= '2025-01-01'
  GROUP BY customer_id
) customer_totals;

Quartile 1 contains the highest spenders, quartile 4 the lowest. This pattern directly maps to RFM (Recency, Frequency, Monetary) segmentation used in marketing analytics.

Ready to ace your Data Analytics interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Common Table Expressions: Replacing Nested Subqueries

CTEs (WITH clauses) break complex queries into named, readable steps. Each CTE acts as a temporary named result set that exists only for the duration of the query. Beyond readability, CTEs make debugging easier — each step can be tested independently.

sql
-- cte_sales_analysis.sql
WITH monthly_revenue AS (
  -- Step 1: Aggregate raw orders into monthly totals
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    product_category,
    SUM(amount) AS revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
  FROM orders
  WHERE order_date >= '2025-01-01'
  GROUP BY DATE_TRUNC('month', order_date), product_category
),
ranked_categories AS (
  -- Step 2: Rank categories within each month
  SELECT
    month,
    product_category,
    revenue,
    unique_customers,
    RANK() OVER (
      PARTITION BY month
      ORDER BY revenue DESC
    ) AS category_rank
  FROM monthly_revenue
)
-- Step 3: Final output — top 3 categories per month
SELECT
  month,
  product_category,
  revenue,
  unique_customers,
  category_rank
FROM ranked_categories
WHERE category_rank <= 3
ORDER BY month, category_rank;

This three-step approach replaces what would otherwise be a deeply nested subquery. Each CTE has a clear responsibility: aggregation, ranking, and filtering.

Recursive CTEs for Hierarchical Data

Recursive CTEs solve problems involving hierarchical or graph-like data — org charts, category trees, bill-of-materials, and path-finding queries.

sql
-- recursive_org_chart.sql
WITH RECURSIVE org_hierarchy AS (
  -- Base case: top-level managers (no manager above them)
  SELECT
    employee_id,
    employee_name,
    manager_id,
    1 AS depth,
    employee_name AS management_chain
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  -- Recursive case: join employees to their managers
  SELECT
    e.employee_id,
    e.employee_name,
    e.manager_id,
    oh.depth + 1,
    oh.management_chain || ' > ' || e.employee_name
  FROM employees e
  INNER JOIN org_hierarchy oh ON e.manager_id = oh.employee_id
)
SELECT
  employee_id,
  employee_name,
  depth,
  management_chain
FROM org_hierarchy
ORDER BY management_chain;

The base case selects root nodes (employees with no manager). The recursive case joins back to the CTE itself, building the hierarchy level by level. The management_chain column concatenates names to show the full reporting path. Most databases limit recursion depth to prevent infinite loops — PostgreSQL defaults to 100 iterations.

Recursive CTE Performance

Recursive CTEs can be slow on large datasets. Always include a depth limit (WHERE depth < 10) and ensure the join column (manager_id) is indexed. For very deep hierarchies, consider materialized path or nested set patterns instead.

Advanced Analytical Patterns: Gaps, Islands, and Running Totals

The gaps-and-islands pattern identifies consecutive sequences in data — active subscription periods, consecutive login days, or uninterrupted production runs. This technique combines ROW_NUMBER with date arithmetic.

sql
-- consecutive_login_streaks.sql
WITH login_days AS (
  SELECT DISTINCT
    user_id,
    DATE(login_timestamp) AS login_date
  FROM user_sessions
),
streaks AS (
  SELECT
    user_id,
    login_date,
    -- Subtracting row_number from date creates a constant for consecutive days
    login_date - (ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY login_date
    ))::int AS streak_group
  FROM login_days
)
SELECT
  user_id,
  MIN(login_date) AS streak_start,
  MAX(login_date) AS streak_end,
  COUNT(*) AS streak_length
FROM streaks
GROUP BY user_id, streak_group
HAVING COUNT(*) >= 3
ORDER BY streak_length DESC;

The core insight: for consecutive dates, subtracting an incrementing row number from the date always yields the same value. When a gap occurs, the resulting value shifts, creating a new group. This query finds all login streaks of 3+ consecutive days per user.

Combining Window Functions with CASE for Conditional Analytics

Real-world analytical queries frequently combine window functions with CASE expressions to compute conditional metrics within the same query.

sql
-- conditional_analytics.sql
SELECT
  order_date,
  product_category,
  amount,
  -- 7-day moving average
  AVG(amount) OVER (
    PARTITION BY product_category
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS moving_avg_7d,
  -- Cumulative sum resetting each quarter
  SUM(amount) OVER (
    PARTITION BY product_category, DATE_TRUNC('quarter', order_date)
    ORDER BY order_date
  ) AS qtd_cumulative,
  -- Flag if current row exceeds 2x the category average
  CASE
    WHEN amount > 2 * AVG(amount) OVER (PARTITION BY product_category)
    THEN 'OUTLIER'
    ELSE 'NORMAL'
  END AS outlier_flag
FROM daily_sales
ORDER BY product_category, order_date;

This single query computes a rolling 7-day average, a quarter-to-date cumulative sum that resets each quarter, and an outlier flag — all without subqueries or self-joins.

Window Function Execution Order

Window functions execute after WHERE, GROUP BY, and HAVING — but before ORDER BY and LIMIT. This means window functions cannot be used in WHERE clauses directly. To filter on a window function result, wrap the query in a CTE or subquery first.

Performance Optimization for Analytical Queries

Window functions and CTEs are powerful but can become bottlenecks on large tables. Several techniques help maintain performance at scale.

First, index the columns used in PARTITION BY and ORDER BY clauses. A composite index matching the window definition eliminates sort operations.

Second, filter data before applying window functions. Placing conditions in the WHERE clause (before window function evaluation) reduces the dataset the window must process.

Third, avoid redundant window definitions. Named windows reduce repetition and signal to the optimizer that multiple functions share the same frame:

sql
-- named_window.sql
SELECT
  department,
  employee_id,
  salary,
  RANK() OVER dept_window AS dept_rank,
  SUM(salary) OVER dept_window AS dept_total,
  AVG(salary) OVER dept_window AS dept_avg
FROM employees
WINDOW dept_window AS (
  PARTITION BY department
  ORDER BY salary DESC
);

The WINDOW clause (supported in PostgreSQL, MySQL 8+, and BigQuery) defines the window once and reuses it across multiple functions. This improves both readability and potential query plan optimization.

Ready to ace your Data Analytics interviews?

Practice with our interactive simulators, flashcards, and technical tests.

Conclusion

  • Window functions (ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE) operate across row sets without collapsing them — unlike GROUP BY aggregates
  • CTEs break complex queries into named, testable steps and replace deeply nested subqueries
  • Recursive CTEs handle hierarchical data like org charts and category trees, but require depth limits and indexed join columns
  • The gaps-and-islands technique (ROW_NUMBER + date arithmetic) identifies consecutive sequences in time-series data
  • Named windows (WINDOW clause) reduce code duplication and can improve query plan optimization
  • Filter data in WHERE clauses before window function evaluation to maintain performance on large tables
  • These patterns appear frequently in data analytics interviews and directly apply to daily reporting, segmentation, and trend analysis

Start practicing!

Test your knowledge with our interview simulators and technical tests.

Tags

#sql
#data-analytics
#window-functions
#cte
#interview

Share

Related articles