Let’s Talk about SQL — Part 7

Amandaspotter
Analytics Vidhya
Published in
6 min readNov 17, 2020

--

Temp Tables, CTEs, and Subqueries

In my last post, I walked you through some simple window functions. You can read that here. The final query in SQL:

WITH CTE as (SELECT date, state, county,
cases — LAG (cases,1) OVER(PARTITION BY
fips ORDER BY date)as ‘new_cases’,
cases as ‘cumulative_cases’FROM counties)
SELECT date, state, county, new_cases,cumulative_cases,
AVG(new_cases) OVER (PARTITION BY state,county
ORDER BY date ASC
rows 6 PRECEDING) as ‘7_day_avg_new’
FROM CTE
ORDER BY state, county, date

We used a CTE in this query in order to give us a sort of temporary table to query, so this seems like a good time to discuss the differences between CTEs, temp tables, and subqueries. We will be using the same NYT covid and MIT elections data that was used for the last couple of posts.

First, let’s start with some definitions.

CTE or COMMON TABLE EXPRESSION — a type of temporary data source that houses the results of a query. CTEs are only stored for the duration of a query.

SUBQUERY — just like CTEs and temp tables, a subquery is a way to generate a temporary result set to use in a main query.

TEMP TABLES — Like CTEs, temporary tables are a temporary data set stored as a table. The temp table is available for your whole SQL session.

To try to illustrate the differences, we are going to generate the same result set using each of these temporary data storage solutions. If that doesn’t sound like a good time, well then I don’t know what does! Here we go.

We already have a set of results using a CTE, this is what we used in the last post.

Results using CTE, first 20 rows

We can rewrite the query using the CTE as a subquery:

SELECT date, state, county, new_cases, cumulative_cases,
AVG(new_cases) OVER (PARTITION BY state, county
ORDER BY date ASC
rows 6 PRECEDING) as ‘7-day_avg_new’
FROM (SELECT date, state, county,
cases — LAG (cases,1) OVER (PARTITION BY fips
ORDER BY date)as new_cases,
cases as cumulative_cases
FROM counties)
ORDER BY state, county, date;

This will give us the same result:

Results using subquery (only head/tail of the results shown)

Is there any reason that we might choose to use a CTE over a subquery? In terms of performance, they are pretty much the same. Remember from our talk on the order of operations in SQL that a subquery will run before the main query, and that is the same with the CTE, so in either case you are basically querying a query.

If the performance of the two options are the same, why would you choose a CTE over a subquery? For a simple query like my example, it’s probably going to come down to personal preference. However, for more complex queries that require multiple subqueries, using CTEs can make your query easier to understand. This is especially important when you are writing queries that will need to be used or edited by multiple users. With the subquery structure, it isn’t always easy to see what the author intended.

You can create multiple CTEs to use in a query, just as you can create multiple subqueries. You can also name your CTE what ever you like (I used CTE before to make it clear which part was the CTE, however it is better to use names that are descriptive), this will also make it easier to understand what your query is doing. Here’s an example:

WITH mask AS (SELECT DISTINCT state, county, never, rarely,
sometimes, frequently, always
FROM mask_use m
JOIN counties c on c.fips = m.countyfp),
new AS (SELECT date, state, county,
cases — LAG (cases,1) OVER
(PARTITION BY fips ORDER BY date) as new_cases,
cases as cumulative_cases
FROM counties)
SELECT date, n.state, n.county, new_cases, cumulative_cases,
AVG(new_cases) OVER (PARTITION BY n.state, n.county
ORDER BY date ASC rows 6 PRECEDING)
as ‘7-day_avg_new’,
always as ‘pct_always_mask’
FROM new n
JOIN mask m on n.state = m.state AND n.county = m.county
ORDER BY n.state, n.county, date

We could also write this using multiple subqueries:

SELECT date, n.state, n.county, new_cases, cumulative_cases,
AVG(new_cases) OVER (PARTITION BY n.state, n.county
ORDER BY date ASC rows 6 PRECEDING)
as ‘7-day_avg_new’,
always as ‘pct_always_mask’
FROM (SELECT date, state, county,
cases — LAG (cases,1) OVER
(PARTITION BY fips ORDER BY date) as new_cases,
cases as cumulative_cases
FROM counties) n
JOIN (SELECT DISTINCT state, county, never, rarely, sometimes,
frequently, always
FROM mask_use m
JOIN counties c on c.fips = m.countyfp) m
ON n.state = m.state AND n.county = m.county
ORDER BY n.state, n.county, date

The query using multiple CTEs is easier to read, but you get the same results no matter which way you write the query.

Both CTEs and subqueries will give you the same result set!

One item to point out here, when using multiple CTE, you only need to use the WITH keyword once, you separate the individual CTEs with a comma.

So, all this to say that CTE and subqueries will accomplish the same thing. What about temp tables? There is one major difference between CTE/subquery and temp tables. A temp table can be accessed by multiple queries in the same SQL session. A CTE/subquery is only available for a single query.

What does that mean? Let’s say that I had multiple queries that needed to use the same ‘mask’ CTE, I could put that CTE at the beginning of each query, but that will require a lot of extra typing (which is not high on my list of ways to waste time) and the performance would deteriorate — the same temporary result set would be run for each query, and that takes extra time. This is why we temp tables. Let’s take a look at how this would work.

CREATE TABLE temp.mask AS
SELECT distinct state, county, never, rarely, sometimes,
frequently, always
FROM mask_use m
JOIN counties c on c.fips = m.countyfp

Once you create and populate your temp table (the SELECT clause is populating the table), you can query it multiple times until you disconnect your SQL session. Note that this syntax is for SQLite, SQL Server and other instances of SQL support the SELECT INTO syntax.

SELECT distinct state, county, never, rarely, sometimes,
frequently, always
INTO #mask
FROM mask_use m
JOIN counties c on c.fips = m.countyfp

So, now we can query like normal using our temp.mask table.

WITH new AS (SELECT date, state, county,
cases — LAG (cases,1) OVER
(PARTITION BY fips ORDER BY date) as new_cases,
cases as cumulative_cases
FROM counties)
SELECT date, n.state, n.county, new_cases, cumulative_cases,
AVG(new_cases) OVER (PARTITION BY n.state, n.county
ORDER BY date ASC rows 6 PRECEDING) as ‘7-day_avg_new’,
always as ‘pct_always_mask’
FROM mask m
JOIN new n on m.state = n.state AND m.county = n.county
ORDER BY n.state, n.county, n.date
Temp tables are queried just like any other table

Notice that we still used a CTE, your temp table works just like any other table when querying, you just have to run it once for any session. There are some things to keep in mind, your temp table will be stored, since it is not run for each query (like a CTE/subquery) this can significantly improve performance if you are running multiple queries using the same temporary data. However, because the temp table is stored if you are only using it for a single query, the performance will be worse using a temp table.

In summary, we can use CTE/subqueries interchangeably, but the CTE is easier to read and see what is going on, so it is best used when a query will be used/edited by other users. The use case for these is single queries — the information will not need to be accessed by multiple queries in a single SQL session. A temp table is also a temporary data source, but it will be available for multiple queries during the same SQL session.

--

--