SQL is the de-facto standard language when we need to deal with data definition, manipulation and refinement. We discussed in the past a few topics on that, but one of most important topics were not there, so let fix that.
TL;DR a window is a way to perform aggregate operations into the tables being consulted by an analytic query.
SQL queries can extract lines (or tuples) from one or several tables from a database. It also can perform aggregation operations, slightly changing the meaning of those lines.
For example, if i want to know the highest salary from employee table, i can either do this:
select salary as max_salary from employees order by salary desc limit 1;
Or this:
select max(salary) as max_salary from employees;
For both queries the result would look like this:
MAX_SALARY|
----------+
205.00|
The information requested is quite the same, but there is important differences in the query and in the result:
In fact, if we need to produce a result which permit us to compare aggregate results with analytic results, we could end up with some strange join clauses.
Let's say that we want to compare employees salaries with the biggest one. We could write something like this:
select name, salary, max_salary
from employees
join (select max(salary) as max_salary from employees)
order by salary desc;
The result would be like this:
NAME |SALARY|MAX_SALARY|
-----+------+----------+
rick |205.00| 205.00|
jane |200.00| 205.00|
bob |100.55| 205.00|
eve |100.00| 205.00|
alice|100.00| 205.00|
That way we combine analytic and aggregate information into a single result, but our query became a little complex. we have a subquery now just to extract info from the same table being worked on the main query. that's no good.
A window function can perform the same with a less verbose syntax:
select name, salary, max(salary) over() as max_salary
from employees
order by salary desc;
The result is the same as the previous query.
The over()
statement is the window to the aggregate information we need
for our analysis.
The window itself is simpler than a subquery, but it's also a more powerful tool.
For example, let's see our sales and rank them by total cost in each category:
SELECT
ID,
PRODUCT_CODE,
PRODUCT_CATEGORY_CODE,
TOTAL_COST,
RANK() over(PARTITION BY PRODUCT_CATEGORY_CODE
ORDER BY TOTAL_COST desc) AS sale_position
FROM sales
ORDER BY PRODUCT_CATEGORY_CODE , TOTAL_COST;
The result would be:
ID|PRODUCT_CODE|PRODUCT_CATEGORY_CODE|TOTAL_COST|SALE_POSITION|
--+------------+---------------------+----------+-------------+
11| 1| 1| 10.00| 5|
6| 1| 1| 10.00| 5|
7| 2| 1| 20.00| 3|
12| 2| 1| 20.00| 3|
13| 1| 1| 30.00| 1|
8| 1| 1| 30.00| 1|
9| 3| 2| 10.00| 2|
14| 3| 2| 10.00| 2|
10| 3| 2| 10.00| 2|
15| 3| 2| 40.00| 1|
17| 4| 3| 40.00| 2|
16| 4| 3| 80.00| 1|
What does it mean? It show how good each sale is compared with our best sale on their product category.
Rank operation jumps numbers when there is a tie between results, you can avoid
that by simply using DENSE_RANK()
instead of RANK()
.
Another good example: get the cumulative distribution of sales by total cost and by product quantity in the sale. useful to understand what sells more and what worths more.
SELECT
ID,
PRODUCT_CODE,
PRODUCT_CATEGORY_CODE,
TOTAL_COST,
CUME_DIST () over(order BY PRODUCT_QUANTITY) AS sold_items,
CUME_DIST () over(order BY TOTAL_COST) AS value_items
FROM sales
ORDER BY PRODUCT_CATEGORY_CODE , TOTAL_COST;
The result would be:
ID|PRODUCT_CODE|PRODUCT_CATEGORY_CODE|TOTAL_COST|SOLD_ITEMS |VALUE_ITEMS |
--+------------+---------------------+----------+------------------+------------------+
6| 1| 1| 10.00|0.5833333333333334|0.4166666666666667|
11| 1| 1| 10.00|0.5833333333333334|0.4166666666666667|
12| 2| 1| 20.00|0.5833333333333334|0.5833333333333334|
7| 2| 1| 20.00|0.5833333333333334|0.5833333333333334|
8| 1| 1| 30.00| 0.75| 0.75|
13| 1| 1| 30.00| 0.75| 0.75|
9| 3| 2| 10.00|0.5833333333333334|0.4166666666666667|
10| 3| 2| 10.00|0.5833333333333334|0.4166666666666667|
14| 3| 2| 10.00|0.5833333333333334|0.4166666666666667|
15| 3| 2| 40.00|0.9166666666666666|0.9166666666666666|
17| 4| 3| 40.00|0.9166666666666666|0.9166666666666666|
16| 4| 3| 80.00| 1.0| 1.0|
Queries start to become more and more funky, but the insight extracted from them become more and more precious to make strategic decisions.
The sample data for this article can be found here;