Let’s Talk about SQL — Part 5

Amandaspotter
Analytics Vidhya
Published in
12 min readOct 19, 2020

--

Joins and building a query.

My fourth post on SQL was an introduction to subqueries. You can read that here.

To recap, in the last post, I walked you through building a simple subquery that functioned as a filter for our data.

Written in SQL:

SELECT state, county, cases, deaths, 
cases/deaths as ‘case/death ratio’, ALWAYS
FROM counties
JOIN mask_use ON fips = countyfp
WHERE fips IN
(SELECT fips
FROM election
ORDER BY clf_unemploy_pct DESC
LIMIT 25)
AND date = ‘2020–09–07’

If you remember, in the last post the query of the employment table was run, and then we took the results of that query and used them in the WHERE clause as a filter. This was so that we could see how our subquery worked.

SQL will run a subquery first, and those results become your WHERE clause

We also talked a bit about order of operations in SQL noting that the subquery will be run first, followed by the FROM and JOIN clauses. This is important to understand and gives you a blueprint for creating more complex queries. So, how would we build a query that yields the same results as above, but does not use a subquery?

Let’s break this down — these are the same basic steps I use when I start to write any query.

1. Define the problem, what are you being asked to do?

It would be very rare for someone to tell you that they need 4 specific columns from 3 tables, and if you could please create an additional column by combining these 2 fields. Because, let’s be honest, if they know that much they can write the query themselves. You are more likely to encounter the following:

“Hey, do we have any data that could show me what the rates of death from Covid-19 are in counties with high unemployment?”

You think on this for a brief second and respond, ‘I think so, but when you say ‘high unemployment’ is there a specific threshold you are looking for (e.g., above 15%) or are you looking for X counties with the highest unemployment?’ Another brief moment of thinking and you realize that you also need to define what they mean by death rate — there are many ways that this can be measured (cases/deaths, deaths/total population, etc.). Once these questions are clarified, you can start to build your query.

2. Determine tables needed to find the information.

This one is really straight-forward. After asking a few clarifying questions, we know that we are being asked to find the name of the county, the number of covid-19 cases, the number of covid-19 deaths, a ratio of cases/deaths, and the percentage of people who responded that they always wear a mask in the 25 counties with the highest unemployment rates. We also are going to limit this to a specific date, since we know that the death and case numbers are cumulative and, since we know that county names are unique by state, we are going to throw in the state as well.

If we look at the tables available to us, we see that the counties table has columns for date, county, state, fips, cases, and deaths.

Sample of our counties table

We next look at the mask_use table, and see that it has columns for countyfp, never, rarely, sometimes, frequently, and always.

Sample of our mask_use table

Finally, we take a look at our elections table and it has state, county, fips, total_population, and clf_unemploy_pct.

Sample of our election table

Great! Now we know what tables we need to complete our query — the counties, mask_use, and election tables. Note that there are some repeated fields between these tables, and there are some data elements that are the same, though the column is labeled differently.

The bulk of the data we are looking for will come from the counties table — from this table we are going to pull out the state, county, cases, and deaths. In addition, we need to have the mask_use table to find the percentage of people who report that they always wear a mask. We were not asked to report the unemployment number, but we are going to use this to order and limit our results.

3. Determine how the tables are related.

This can be really easy sometimes. You might actually have a diagram that shows you which columns should be used to join the tables — but, we don’t have that here. So, how do I know how to join the tables? This is where you need to understand your data, or consult a data dictionary.

When we look at tables, there doesn’t appear to be a column that all three share. The counties and election tables both have state, county, and fips columns, but the mask_use table doesn’t have any of these. Or does it? There is a column labeled ‘countyfp’ and when the data dictionary is consulted, we see that in all three tables there is a column with a code that uniquely identifies a county, it is just labeled as countyfp in the mask_use table. This looks like it will be the most straight forward way to connect the tables.

Congratulations! You are ready to write the FROM and JOIN clauses (which are going to run first in our query)! I like to write the table with the most data first, but any order will be fine since we are using an INNER JOIN. Order of the tables is important with other types of joins, and we will get into that later.

INNER JOIN — this is a join where both tables must have the same value in a given field to be evaluated further. In most cases, SQL uses INNER JOIN as a default.

FROM counties as c
JOIN mask_use as m on c.fips = m.countyfp
JOIN election as e on c.fips = e.fips

This part is executed first by SQL — we will now have all the columns from all three tables with rows that have a match on the fips code in all three tables.

Sample of rows to be evaluated further

At this point, SQL has pulled in 496,085 rows, and 17 columns. This represents all rows that have the same fips code across the three tables.

4. Filters.

The next step for SQL is to evaluate the filters we need for the WHERE clause. In last week’s example, we used a subquery and the date to filter our results. Because we are rewriting that query using a join, we only need to add in the date.

FROM counties as c
JOIN mask_use as m on c.fips = m.countyfp
JOIN election as e on c.fips = e.fips
WHERE c.date == ‘2020–09–07’

Date only shows up in the counties table. However, to be consistent I am going to use the table alias. At this point, SQL will filter to find the date that we are interested and only rows that meet that criteria will be evaluated further.

Sample of filtered rows to be evaluated further

After filtering by date, SQL has 3,093 rows and 17 columns that matched on the fips column across all three tables, but now we have filtered to only those rows that have the date of September 7, 2020 in the counties table.

5. Ordering and limiting your results

Now we are starting to get somewhere and have cut down our rows significantly, but remember that we are looking for the counties with the highest unemployment rate, and we only want the top 25. Note that we were not asked to return the actual unemployment rate, we are only using that to order our results. We use the ORDER BY clause to do this, and we can order by any column across the three tables — it does not have to be one that we return in our results.

So, the ORDER BY will be our unemployment rate in descending order.

FROM counties as c
JOIN mask_use as m on c.fips = m.countyfp
JOIN election as e on c.fips = e.fips
WHERE date == ‘2020–09–07’
ORDER BY e.clf_unemploy_pct DESC
Same data, just ordered by unemployment rate

We still have 3,093 rows and 17 columns, but the order is now different with the highest unemployment percentage at the top (Corson county, SD at 29.93%)

Because we only want the top 25, we will add in a LIMIT clause.

FROM counties as c
JOIN mask_use as m on c.fips = m.countyfp
JOIN election as e on c.fips = e.fips
WHERE date == ‘2020–09–07’
ORDER BY e.clf_unemploy_pct DESC
LIMIT 25

Returns:

Final set of rows to be evaluated further

Now, our results are only the top 25!

6. Select the columns to be returned.

Finally, we are ready to write the SELECT clause. We already know we want the state, county, cases, deaths, a new column that shows us the number of cases for each death, and the percentage of people who report they ALWAYS wear a mask. So, this is pretty straight forward. From the counties table we want state, county, cases, and deaths. We used a table alias of ‘c’ in our from and join clauses, so we will use that here as well. Remember that the alias is a shorthand of the table name to save you some typing. That gives us c.state, c.county, c.cases, and c.deaths. so far. Let’s put them in our SELECT.

SELECT c.state, c.county, c.cases, c.deaths
FROM counties as c
JOIN mask_use as m on c.fips = m.countyfp
JOIN election as e on c.fips = e.fips
ORDER BY e.clf_unemploy_pct DESC
LIMIT 25

Next, we have a calculated field that will divide the number of cases by the number of deaths. SQL uses normal math operators, so this new column can be written as c.cases/c.deaths as ‘case/death ratio’ — we are using ‘as’ to give this column a name in our results. So, we now have:

SELECT c.state, c.county, c.cases, c.deaths, 
c.cases/c.deaths as ‘case/death ratio’
FROM counties as c
JOIN mask_use as m on c.fips = m.countyfp
JOIN election as e on c.fips = e.fips
ORDER BY e.clf_unemploy_pct DESC
LIMIT 25

Finally, we will add in the ALWAYS column from the mask_use table, giving us a final finished query:

SELECT c.state, c.county, c.cases, c.deaths, 
c.cases/c.deaths as ‘case/death ratio’,
m.ALWAYS
FROM counties as c
JOIN mask_use as m on c.fips = m.countyfp
JOIN election as e on c.fips = e.fips
ORDER BY e.clf_unemploy_pct DESC
LIMIT 25
Final Query Results

Our query from part 4 will give us the same results, but they won’t be ordered by unemployment rate. This is because we did our ORDER BY and LIMIT in the subquery. In addition, because the unemployment rate was in the subquery and not a joined table, we are unable to use that column to order our results. If you add the ORDER BY statement to the query, you will get an error that no such column exists.

Results from Part 4 — same data, different order.

I promised more on joins, so let’s keep going with our current example. Let’s pretend that the election table doesn’t have the fips code column. Is there a way we could join it to the counties table to get our desired results?

First, we need to know what the fips code represents. It appears to be a unique identifier for each county.

The Federal Information Processing Standard Publication 6–4 (FIPS 6–4) was a five-digit Federal Information Processing Standards code which uniquely identified counties and county equivalents in the United States, certain U.S. possessions, and certain freely associated states.”

How does that help us if our table doesn’t have a fips code? Both tables have the state and county. We can use that information to join the tables. In effect, we are creating a compound key for our tables by identifying 2 columns that must match. Let’s try it.

SELECT c.state, c.county, c.fips, e.state, 
e.county,e.total_polulation, e.clf_unemploy_pct
FROM counties as c
JOIN election as e ON c.state = e.state AND c.county = e.county
WHERE date == ‘2020–09–07’
ORDER BY clf_unemploy_pct DESC
LIMIT 25

Notice that the JOIN now has an AND keyword. This is how we create a compound key, we instructed SQL to only return rows where the combination of state and county match in both tables.

Results from using state/county as the join condition

So, this looks mostly the same, but you will notice that Emporia city County, Virginia is missing from our results and we now have Hardin County, IL. What happened? Let’s look at the tables separately to see what is going on.

SELECT state, county, fips
FROM counties
WHERE state = ‘Virginia’ and county = ‘Emporia city’
AND date = ‘2020–09–07’
In the counties table we have Emporia city for the county name
SELECT state, county, fips
FROM election
WHERE fips = ‘51595.0’
The election table has Emporia as the county name

This makes sense now — our tables are from different sources, the counties table is from the NYT covid data and the election table is from the MIT elections data, and it looks like the county name was entered differently between the two. If this were a real scenario and we didn’t actually have the fips code, we wouldn’t know that the data was different.

I suppose this is as good a time as any to mention Primary Keys and Foreign Keys.

PRIMARY KEY — a unique identifier for records in a table. A table will only have one Primary Key, and it should not repeat. You can think of your social security or driver’s license numbers as Primary Keys — they identify you, and only you.

FOREIGN KEY — a column that references a Primary Key from another table.

In the tables we are using as examples, the mask_use and election table use the fips code as a primary key and the counties table references that code in the countyfp column where it is a foreign key.

What is important to note here is that SQL can join tables on any column, you do not have to use a primary/foreign key relationship.

Now, what if I wanted to take all the state county combinations in the counties table and see where there was a match in the elections table? This is where we use a LEFT JOIN.

LEFT JOIN — A left join will take every row in the left table and include matching records from the right table while leaving cells empty that do not match.

An important note here, LEFT JOIN requires that the tables be written in the correct order. The table that you want all the records from must be written first. There is also a RIGHT JOIN in some instances of SQL, but really you only need LEFT since you would just reverse the table order to get a RIGHT join. That’s not super intuitive when written, so here’s a fun diagram.

A LEFT JOIN — all rows from Table A are included

Time to see a LEFT JOIN in action! We are again going to use the state/county combo as our join condition.

SELECT c.state as ‘c.state’, c.county as ‘c.county’, 
e.state as ‘e.state’, e.county as ‘e.county’
FROM counties as c
LEFT JOIN election as e on c.state = e.state AND c.county = e.county
WHERE c.state = ‘Virginia’
AND date = ‘2020–09–07’

I limited our query to the state of Virginia so that we can see what is happening.

The counties table has counties named a bit differently from the election table

From this, we can see that Virginia has counties named X city. Note that ‘None’ appears in the entries from the election table, this is because there is no matching state/county combination in our election table. The results you see here are rendered in a Jupyter Notebook, in SQL server, they would say NULL.

In general, if you see a lot of NULL results from one table, this is a good indication that a query used a LEFT or RIGHT join. If the results look to have the NULL from the wrong table, you may have written your join with the tables in the wrong order.

I hope that gave you a better understanding of JOINS and how to go about writing basic queries. The next post will dive into window functions, and we will do some calculations across dates to get rolling averages and other fun stuff.

--

--