Let’s Talk about SQL — Part 2

Amandaspotter
Analytics Vidhya
Published in
8 min readOct 7, 2020

--

SELECT, FROM, JOIN, and WHERE statements

My first post on SQL was a very high-level overview of relational databases, queries, and what SQL is. You can read that here.

To recap, in the last post, I included an example where we asked the database to find the name ‘Mutt Barkely’ in the customer table and then match the CustomerID from that row to the CustomerID in the order table and return any orderID where there was a match.

Written in SQL:

SELECT OrderId
FROM Orders
JOIN Customers on Customers.CustomerID = Orders.CustomerID
WHERE Customers.Name = ‘Mutt Barkley’

In this post, we are going to use data obtained from the New York Times to explore the keywords SELECT, FROM, JOIN, and WHERE. The original code can be viewed here. So, let’s get started!

SELECT — This is the first keyword in a query, and indicates that we are asking the database to retrieve some information. This first keyword identifies the fields to be retrieved. Note that column names are listed, but the fields are the actual cells that hold the data.

FROM — This is usually the second keyword in a query, and it indicates the table we are asking the database to retrieve the information from. This identifies the tables used to retrieve the fields identified with SELECT.

Using only SELECT and FROM, you can query a database! So, before going into JOIN and WHERE clauses, let’s take a look at some very simple queries. I have created a SQLite database from the New York Times data. This database has 5 tables:

  1. Deaths
  2. Mask_use
  3. Counties
  4. States
  5. US

If I want to see all the data in the counties table, I can use the asterisk (*) to indicate that I want all fields to be returned. This will be written as:

SELECT *
FROM counties

This query will return:

Query results!

We see that the counties table has 6 columns:

  1. Date
  2. County
  3. State
  4. fips
  5. Cases
  6. Deaths

But, there is a seventh column that isn’t labeled — this is a row index. At the bottom, we get the shape of the result table that was returned. In our example, we have 515,062 rows and 6 columns (3,090,372 fields). Pretty neat! But, it’s unlikely that you really want to see the whole data table.

We can indicate which columns we want returned using the SELECT clause. For example, what if we are only concerned with the state and the cases? The SQL would look like this:

SELECT state, cases
FROM counties

The query will return:

Still have index, but now only the state and case columns are returned

We now only have the state name and the number of cases. Note that the number of rows (515,062) did not change, only the number of columns. This is because our query only reduced the columns by using the SELECT keyword. Again, neat, but probably not all that helpful.

What if I only want to know which states are in the table? How would I query that? Glad you asked! To return the unique values in a column, we use the keyword DISTINCT. Let’s give it a try.

SELECT DISTINCT state
FROM counties

The query will return:

Partial results, the query will give us each unique entry in the state column

The query returns 55 rows (the data includes US territories and the District of Columbia). You can get any combination of columns using distinct, but note that this will give you the rows where all the columns combined are distinct. In other words, if there is a Yuma County in more than one state, Yuma will show up in the county column for each state that has a Yuma County.

SELECT DISTINCT state, county
FROM counties

Returns:

Partial results, each unique combination is returned

The new query has 3257 rows — and we see duplicate state names (e.g., Texas) and duplicate county names (e.g., Yuma), but no duplicate combinations of state and county. Both Colorado and Arizona have a Yuma county, and we see that only Idaho has an Ada county. But, what if you are only interested in a specific state? That brings us to the WHERE keyword.

WHERE — This keyword acts as a filter, identifying a specific condition that must be met for a row to be returned in the query.

Back to our counties table. What if I want to return only the unique counties in Texas? I will still need my SELECT, DISTINCT, and FROM keywords, but now, I will need to add in my filter, WHERE. Our query will look like this:

SELECT DISTINCT state, county
FROM counties
WHERE state = ‘Texas’

Returns:

Returns each unique county name in Texas

Our query results show that our data has 252 distinct counties in the state of Texas! We can filter on any column, even ones that are not being returned from our SELECT. Let’s give this a try.

SELECT DISTINCT state, county
FROM county
WHERE deaths = ’10’

Returns:

We did not ask to retrieve the deaths column, but we filtered it so that our results are only showing counties with exactly 10 deaths

What if we want to filter on more than one criteria? We can do that too! Let’s see how many counties in Texas had exactly 10 reported deaths.

SELECT DISTINCT state, county, deaths
FROM counties
WHERE state = ‘Texas’ AND deaths = ’10’

Returns:

Partial result — the query returns 58 rows

The full results show that 58 counties in Texas have reported exactly 10 deaths. What if I want to see how many times counties reported less than 5 deaths, or more than 500? We can combine operators to get different filters.

SELECT DISTINCT state, county, deaths
FROM counties
WHERE state = ‘Texas’ AND (deaths < 5 OR deaths > 500)

Here we have to put the second condition (number of deaths) in parentheses because we want to combine Texas and 2 conditions in the deaths column.

Result:

Texas counties with less than 5 or greater than 500 deaths

What’s going on with Harris county? Let’s take a look with all the columns included (remember in our first query, we used the asterisk to return ALL the columns in a table).

SELECT *
FROM counties
WHERE state = ‘Texas’ and (deaths < 5 or deaths < 50)

Returns

Ah, so we see that this is a cumulative total of deaths reported each day. That makes sense. What if I am only interested in the number of deaths on a specific date? We can filter on that as well.

SELECT *
FROM counties
WHERE state = ‘Texas’
AND deaths > ‘500’
AND date = ‘2020–09–07’

Returns:

So, on Sept 7, exactly 6 counties in Texas (Bexar, Cameron, Dallas, Harris, Hidalgo, and Tarrant) reported over 500 deaths.

With just one table and three keywords, you can query to find some interesting stuff. But, let’s be real, the whole reason we use relational databases is so that we only have to record information one time. In the real world, it is pretty unlikely that we would query a single table. This is where JOINs come into play. I will do future posts on JOINs, because there is quite a bit to cover, but for our purposes here, we will only talk about one type of JOIN, the INNER JOIN.

JOIN — The JOIN keyword is used to connect 2 tables. In most SQL programs, the default for a JOIN is the INNER JOIN. An INNER JOIN simply returns requested rows where a specific field matches in BOTH tables.

If you’re anything like me, you need a visual here to understand what we are talking about. Usually we see this represented as a VENN diagram and we are able to understand that the INNER JOIN returns records in the overlapping section, like this

I prefer to look at it with the actual data tables. My query is going to ask for the ‘state’, ‘county’, ‘countyfp’ columns from the counties table and the ‘countyfp’, ‘FREQUENTLY’, ‘ALWAYS’ columns from the mask_use table. We want the rows returned to match on the ‘countyfp’ column in both tables.

And similar to filters (WHERE clause), I can join on columns that are not selected.

SELECT state, county, FREQUENTLY, ALWAYS
FROM counties
JOIN mask_use on counties.fips = mask_use.countyfp

Note: FREQUENTLY and ALWAYS are capitalized to make it easier to see they are from the mask_use table. Also note that in the original data tables, the fips codes are labeled differently. The counties table uses fips, while mask_use uses countyfp.

Returns:

FUN! Let’s put it all together, and see what percentage of people frequently or always use masks in counties that reported over 500 deaths on September 7, 2020.

SELECT date, state, county, deaths, FREQUENTLY, ALWAYS
FROM counties
JOIN mask_use on fips = countyfp
WHERE state = ‘Texas’
AND deaths > ‘500’
AND date = ‘2020–09–07’

Returns:

combined county and mask_use table to find that Texans are pretty good about masking!

We successfully pulled in the information on mask use from one table and matched that to a second table by matching the fips codes (which indicate the county), filtered for the state of Texas, the number of deaths, and a specific date! Well done!

Using only four keywords (SELECT, FROM, JOIN, WHERE) you are ready to get querying! The next post in this series will examine aggregate functions — items like MIN, MAX, AVG — that evaluate groups of items before returning a value.

--

--