Let’s Talk about SQL — Part 6

Amandaspotter
Analytics Vidhya
Published in
6 min readOct 30, 2020

--

Window Function Fun!

This is the sixth post in a series on SQL.

My fifth post on SQL was more on joins and query structure. You can read that here.

To recap, in the last post, we discussed INNER and LEFT joins and how to go about building a query. We also rewrote the query from part 4 using a join instead of a subquery.

Written in SQL:

SELECT c.state, c.county, c.cases, c.deaths, 
c.cases/c.deaths as ‘case/death ratio’, m.ALWAYS
FROM counties as c
JOIN mask_use as m on c.fips = m.countyfp
JOIN election as e on c.fips = e.fips
ORDER BY e.clf_unemploy_pct DESC
LIMIT 25

In this post, we are going to talk about window functions and use them to calculate values across rows instead of columns. Sounds fun, right?

We are going to use the same data from the NY Times and MIT that we have been using throughout this series.

So, what exactly is a window function? Basically, a window function allows you to perform calculations across rows instead of columns.

WINDOW FUNCTION — a function that operates on a set of rows to return a single value for the row. The window refers to the set of rows on which the function will operate.

So, given our counties table, we want to see the number of new cases in each county daily. Let’s take a quick look at our table to see what we have.

SELECT *
FROM counties
WHERE date = ‘2020–09–07’
Sample of our counties table

At first glance, it might look like we have that information in the table already, but when we consult the data dictionary we find that the cases and deaths reported each day are cumulative. In order for us to find the daily number of new cases and deaths, we need to do some math. For illustration purposes, we are going to look at just one county — Cook county, IL.

Partial counties table data for Cook County, IL

In order for us to get the number of new cases on Sept 10, we need to subtract the total on Sept 9 (132,401) from the total on Sept 10 (132,966). This gives us 565 new cases in Cook county, IL on Sept 10. So, our window function will need to tell SQL to create a new column (we will call it ‘new_cases’) that subtracts the previous day total from the current day total. In other words, subtract the row above from the current row. How do we write that?

We use LAG!

LAG — LAG is a window function that provides a row at a specified offset from the current row.

It sounds far more complicated than it really is, so let’s go ahead and write this out.

SELECT date, state, county, cases,
cases - LAG (cases,1) OVER (ORDER BY date) as ‘new_cases'
FROM counties
WHERE state = ‘Illinois’
AND county = ‘Cook’
AND date >= ‘2020-09-01’
ORDER BY date
New column ‘new_cases’ created using LAG

Let’s break down my window function using the highlighted cells above.
cases -LAG (cases,1) OVER (ORDER BY date ASC)

The very first thing here tells SQL which field to start with. In this instance, SQL is going to use the ‘cases’ column for the current row (in our highlighted example, this is 9/10).

The next part sets the window — LAG (cases,1) by telling SQL that I want to use the cases column in the row above (the offset of 1) and subtract it from the current row.

We then tell SQL how we want the rows ordered when we do our calculations. In this case we are ordering by date in ascending order OVER (ORDER BY date ASC). And we want to put the result of the calculation in a new column that we will call ‘new_cases’.

What if I wanted to know the total number of new cases over the last 7 days? We would change the offset! Let’s give it a try.

SELECT date, state, county, cases,
cases - LAG (cases,1) OVER (ORDER BY date) as ‘new_cases’,
cases - LAG(cases,7) OVER (ORDER BY date)
as ‘7-day_new_cases’
FROM counties
WHERE state = ‘Illinois’
AND county = ‘Cook’
AND date >= ‘2020-09-01’
ORDER BY date
We changed the offset to calculate a 7-day new case total

There is also a LEAD function, which works like LAG, but it is rows following rather than rows behind the current row. You write it just like a LAG function, using the field for the current row, the LEAD keyword, the column and the offset followed by the order you want the rows. Let’s give it a try.

SELECT date, state, county, cases,
cases - LEAD (cases,1) OVER (ORDER BY date DESC)
as ‘new_cases’,
cases - LEAD (cases,7) OVER (ORDER BY date DESC)
as ‘7-day_new_cases’
FROM counties
WHERE state = ‘Illinois’
AND county = ‘Cook’
AND date >= ‘2020-09-01’
ORDER BY date
We used LEAD instead of LAG, by changing date to DESC order we get the same results

The results are the same, but the order is different, this is because I indicated that I wanted the rows ordered by date in descending order, and that made the calculations the same as with the LAG function. Neat!

Now, what if I wanted to look at all the counties in Illinois, but still wanted the daily new cases? We can group our counties and then apply window functions! To to this, we use the PARTITION BY keyword.

SELECT date, state, county, cases,
cases - LAG (cases,1) OVER (PARTITION BY county
ORDER BY date) as ‘new_cases’,
cases - LAG (cases,7) OVER (PARTITION BY county
ORDER BY date)
as ‘7-day_new_cases’
FROM counties
WHERE state = ‘Illinois’
AND date >= ‘2020-10-15’
ORDER BY date
PARTITION BY allows us to group by county and then calculate the cases

Now, we have the new cases daily for each county in Illinois. You could also use more than one column in the PARTITION BY clause, for example PARTITION BY state, county would group the states and then the counties so you would have all the states and counties.

Now, what if I want to find the AVERAGE new cases on a rolling 7-day basis? We can’t create a column with a window function that uses a window function. So, we need to create a cte. What is a cte? Glad you asked!

CTE — Common Table Expression. A cte is a temporary result set, unlike a temp table, it is only available for the scope of your query, where a temp table can be used for the duration of your SQL session. CTEs are defined using the WITH statement.

Cool, cool. Lot’s of jargon there that isn’t terribly useful. Let’s look at it in action. I am going to find the 7-day rolling average new cases.

WITH cte as (SELECT date, state, county,
cases - LAG (cases,1) OVER
(PARTITION BY fips ORDER BY date)
as ‘new_cases’,
cases as ‘cumulative_cases’
FROM counties)
SELECT date, state, county, new_cases, cumulative)_cases,
AVG(new_cases) OVER (PARTITION BY state,
county ORDER BY date ASC
rows 6 PRECEDING) as ‘7_day_avg_new’
FROM cte
ORDER BY state, county, date
by querying the cte, we can use the calculated values in a new window function!

So, our cte is just a way to create a very temporary table that we can query in the next step. We could also do this using a temp table, but since we only need to query the cte one time, this is a more efficient option. We will talk about temp tables later, but they are very similar to a cte, just available for a whole SQL session.

Nice! But, this is currently calculating the average with less than 7 days of new cases data. If we look at 3–25, we show 3 new cases, 4 total cases, and it shows that we are averaging 3 new over the 7-day rolling window. We need to update this so that the average display until we have 7 days of data. We’ll talk about that in the next post.

--

--