Let’s Talk about SQL — Part 4

Amandaspotter
Analytics Vidhya
Published in
7 min readOct 13, 2020

--

Basic subquery and order of operations

This is the fourth post in a series on SQL.

My third post on SQL was an overview of basic aggregate functions. You can read that here.

To recap, in the last post, I walked you through building a simple query that aggregated data in selected columns from 2 tables and filtered for conditions.

Written in SQL:

SELECT state, COUNT (county), MIN (deaths), MAX (deaths), 
AVG (deaths),SUM (cases)/SUM (deaths) as ‘case/death ratio’,
AVG (ALWAYS)
FROM counties
JOIN mask_use ON fips = countyfp
WHERE deaths <= ‘500’
AND date = ‘2020–09–07’
GROUP BY state
HAVING COUNT (county) >= 5

The query above had one join condition using a match on the fips codes to return only the counties that we were interested in. This could also be rewritten as a subquery. Many people find a subquery easier to read, however a subquery usually is slower to execute. The example we are going to walk through today is pretty straight-forward and you are unlikely to notice any performance difference.

Note that you may also see a subquery referred to as a nested query, this is because at the most basic level, a subquery is just a query inside a query. Don’t worry, we are going to break this down into some really simple, easily digestible parts.

In this post, we are again going to use data obtained from the New York Times to explore subqueries. In addition, we are going to add a new table from the MIT Election Data and Science Lab The original code can be viewed here. So, here we go!

First, let’s take a quick look at the new table that has been added to our database — the election table. Remember from our first lesson, we can use the asterisk (*) to return all fields.

SELECT *
FROM election
Election table — lots of new data to play with

There are a few items to take notice of here — we have a fips code, just like in our counties table. The data looks to be raw vote totals, and we have some data that is percentages of the population — basic voter demographic data. We are only going to be concerned with a few of these columns: state, county, fips, and clf_unemploy_pct.

SELECT state, county, fips, elf_unemploy_pct
FROM election

Returns

Now, let’s take a look at the counties where the unemployment rate was the highest. We can do this by returning our data in descending order and limiting the number of rows that are returned using the ORDER BY and LIMIT keywords.

ORDER BY — The ORDER BY keyword defines the order in which the selected rows should be returned by indicating the column that is to be ordered. By default, ORDER BY will return ascending order, unless DESC is specified.

LIMIT — The LIMIT keyword indicates the maximum number of rows to be returned.

Combining ORDER BY and LIMIT is just one way to return an ordered list, like a top-10 or lowest-10 list. For our purposes, we are interested in the 25 counties with the highest unemployment rate.

Our query is written:

SELECT state, county, fips, elf_unemploy_pct
FROM election
ORDER BY elf_unemploy_pct DESC
LIMIT 25

Note that the ORDER BY keyword indicates DESC — we want our data to be returned in descending (highest to lowest) order. Because we want the top 25, we also use the LIMIT keyword.

Our query returns:

Rows are zero-indexed, so the top 25 are rows 0–24

Cool, but I did say we were going to talk about subqueries. I want to use the above query to determine the number of Covid-19 cases and deaths in the counties above. I could run the above query and then use the fips codes returned in my WHERE clause. Previously, we have used WHERE to filter on single values in a column, but I want multiple values in a column — is there a way to do that? YES! SQL has an operator IN that can be used in a WHERE clause.

IN — the IN operator allows you to use multiple values in the same column to filter with a WHERE clause.

So, let’s take a look at what our query would look like if we just added the fips codes of interest to the WHERE clause:

SELECT state, county, cases, deaths, 
cases/deaths as ‘case/death ratio’, ALWAYS
FROM counties
JOIN mask_use ON fips = countyfp
WHERE fips IN (46031,46137,8025,28119,28053,46041,28125,1099,46121,
28027,28021,1035,1131,28151,38085,28133,13061,
28051,51595,21159,13239,4017,48247,47069,1063)
AND date == ‘2020–09–07’

Returns:

The query used the fips codes obtained from the election table as a filter condition.

That’s a really convoluted way to get the data we are interested in. Run a query on one table, then transfer the codes we want into a WHERE clause to filter our final query. This is a really simple query, so sure, you could do it this way, but there are a number of reasons why you don’t want to — there is a better way. What if I could have SQL run my first query and use the results in the WHERE? That is a subquery!

How do we write this? Well, we already have (sort of) — we have all the pieces. I am going to simply replace the fips codes in the WHERE clause with the query of the election table!

SELECT state, county, cases, deaths, 
cases/deaths as ‘case/death ratio’, ALWAYS
FROM counties
JOIN mask_use ON fips = countyfp
WHERE fips IN
(SELECT fips
FROM election
ORDER BY clf_unemploy_pct DESC
LIMIT 25)
AND date == ‘2020–09–07’

Returns:

Same data, less typing! WIN

Pretty sweet (in my opinion, anyway). But, what actually happened here? I think a brief overview of SQL order of operations will help us understand.

It is important to note that SQL is declarative in nature. A query indicates the information we want to retrieve, but doesn’t say how SQL should go about retrieving it. SQL will optimize the query to determine the most efficient way to execute commands.

In general, the first thing SQL executes is the subquery — so our query in the WHERE clause will be executed first. Remember that our subquery instructs SQL to return the fips codes for the 25 counties with the highest unemployment. Once that is executed, our WHERE clause is now the list of fips codes.

Our subquery, once executed, becomes our list of codes to filter on

Next, the FROM and JOIN will be executed, followed by WHERE and finally SELECT. Its not really all that intuitive since you write the last part first. Let’s sketch it out visually just like we did with the above subquery.

Once the subquery is executed, SQL will execute the FROM clause. Remember that FROM identifies the table that we want to retrieve data from. In the example query, we are retrieving data from the counties table. So, SQL will locate that table in the database. I think the easiest way to visualize this is as an Excel workbook — the whole database is the workbook and the tables are the individual sheets.

you can think of a database like an excel workbook, with each tab representing a table

JOIN is really just an extension of a FROM clause that tells SQL that we are going to use more than one table. In our example, we join the mask_use table by matching rows that share a fips code. SQL now has all the columns from both the counties and mask_use table, but only rows where the fips codes matched are evaluated further.

Tables get combined, but only rows that match are evaluated further

The next step is to evaluate the WHERE clause. Remember that our WHERE clause includes a subquery that was executed first to act as a filter. SQL will now evaluate all the rows and only those with the fips codes in our WHERE statement will be evaluated further.

FINALLY, SQL will execute the SELECT clause and only the columns indicated will be returned in our results.

The next post in this series will take this same query, and I will walk you through how we can use a JOIN instead of a subquery.

--

--