Let’s Talk about SQL — Part 3

Amandaspotter
Analytics Vidhya
Published in
6 min readOct 9, 2020

--

Basic Aggregate Functions

This is the third post in a series on SQL.

My second post on SQL was an overview of SELECT, FROM, WHERE, and JOIN. You can read that here.

To recap, in the last post, I walked you through building a simple query that selected columns from 2 tables and filtered for conditions

Written in SQL:

SELECT date, state, county, deaths, FREQUENTLY, ALWAYS
FROM counties
JOIN mask_use on fips = countyfp
WHERE state = ‘Texas’
AND deaths > ‘500’
AND date = ‘2020–09–07’

In this post, we are again going to use data obtained from the New York Times to explore the some basic aggregate functions. The original code can be viewed here. So, here we go!

Your first question is likely ‘what is an aggregate function?’ It’s pretty simple, an aggregate function returns a single value from a set of values. The simplest of all the aggregate functions is COUNT.

COUNT — the COUNT keyword will return the number of rows that meet your criteria.

Let’s go back to our NYT data from the last post. What if I just want to know the number of counties that reported 500 or more deaths on September 7? The COUNT keyword will return just that.

SELECT COUNT (county)
FROM counties
WHERE deaths >= ‘500’
AND date = ‘2020–09–07’

This query returns:

Simple. A count of counties.

On September 7, 73 counties reported 500 or more deaths. What if I want to know the minimum number of deaths reported on September 7, or the maximum or average, maybe I want to know the total number? SQL has those built in!

MIN — the keyword MIN will return the lowest value from the rows that meet your criteria.

MAX — the keyword MAX will return the largest value from the rows that meet your criteria.

AVG — the keyword AVG will return the average value from the rows that meet your criteria.

SUM — the keyword SUM will return the sum of all values from the rows that meet your criteria.

Let’s take a look at these in action. I am going to query to find the number of counties (COUNT), the lowest number of deaths reported (MIN), the highest number of deaths (MAX) and the average number of deaths reported in counties that had 500 or more reported deaths on September 7.

SELECT COUNT (county), MIN (deaths), MAX (deaths), AVG (deaths), SUM (deaths)
FROM counties
WHERE deaths >= ‘500’
AND date = ‘2020–09–07’

Returns:

Look at all that aggregate data!

Nice! We see that of the 73 counties that reported 500 or more deaths on September 7, the lowest number of deaths in a single county was 502, the highest number of deaths was 23,736, and the average number of deaths in those 73 counties was 1519.685.

Let’s take this a step further, because while this is interesting, it isn’t really all that useful in any real-world situation. What if I want to see all the same data (count, min, max, avg, and sum) but I want to see how that breaks out by state? That is where the GROUP BY clause comes into play.

GROUP BY — the GROUP BY keyword states how aggregate calculations should be performed.

Using GROUP BY, we are going to aggregate our data by state. That sounds fancy, but all we are doing is telling SQL that we want to perform the same calculations on rows that have the same state.

SELECT state, COUNT (county), MIN (deaths), MAX(deaths), 
AVG (deaths), SUM (deaths)
FROM counties
WHERE deaths >= ‘500’
AND date == ‘2020–09–07’
GROUP BY state

Returns:

Using GROUP BY we see our data by individual state.

Now we’re getting to some much more useful stuff — with this query, we can see that New Jersey had the most counties reporting 500 or more deaths (12) and New York had the county with the highest number of reported deaths (23,736).

We can also use aggregate functions to filter data using the HAVING clause. Remember, in the last post, WHERE was used as a filter that identified a specific condition in a field to determine if a row would be returned. For example, the above query has a WHERE clause to specify the number of deaths we are interested in (500 or more) and the specific date we want to look at (September 7, 2020). This differs from the HAVING clause in one significant way — the data will be filtered on an aggregate function.

HAVING — the HAVING keyword filters on aggregated data.

That’s a lot of words, let’s just take a look at HAVING in action. I am going to use the same query as above, but will add a HAVING clause to only return states with more than five counties reporting 500 or more deaths.

SELECT state ,COUNT (county) ,MIN (deaths) ,MAX (deaths) ,AVG (deaths)
FROM counties
WHERE deaths >= ‘500’
AND date = ‘2020–09–07’
GROUP BY state
HAVING COUNT (county) >= 5

Returns:

Sweet! Data is still broken out by state, but now only states with 5 or more counties are returned.

Exactly 6 states had 5 or more counties that reported 500 or more deaths on September 7, 2020. What if we want to add mask_use to this query, like we did in our second lesson? We can again use a JOIN to get the mask_use table, and we can use the same aggregators! So, we are going to add in columns from the mask_use table to find the average percentage of people reporting that they always use a mask.

SELECT state, COUNT (county), MIN (deaths), MAX (deaths), AVG (deaths), AVG (ALWAYS)
FROM counties
JOIN mask_use ON fips = countyfp
WHERE deaths >= ‘500’
AND date = ‘2020–09–07’
GROUP BY state
HAVING COUNT (county) >= 5

Returns:

Look at us, aggregating data across multiple tables!

And just like WHERE, you can use the HAVING clause on columns that are not in your SELECT clause.

SELECT state, MIN (deaths), MAX (deaths), AVG (deaths), AVG (ALWAYS)
FROM counties
JOIN mask_use ON fips = countyfp
WHERE deaths >= ‘500’
AND date = ‘2020–09–07’
GROUP BY state
HAVING COUNT (county) >= 5

Returns:

You don’t have to filter on items in your SELECT clause!

The query still filtered on the count of counties, but we do not have that listed in our results.

In addition to using the keywords (COUNT, MIN, MAX, AVG, SUM) you can also create calculated fields using normal math operators. For example, what if we want to see the ratio of cases to deaths, how would we write that? We would create a new column that has the value deaths/cases.

SELECT state, COUNT (county), MIN (deaths), MAX (deaths), 
AVG (deaths), SUM (cases)/SUM (deaths) as ‘case/death ratio’,
AVG (ALWAYS)
FROM counties
JOIN mask_use ON fips = countyfp
WHERE deaths <= ‘500’
AND date = ‘2020–09–07’
GROUP BY state
HAVING COUNT (county) >= 5

Returns:

You can get some real insights into your dataset with just a few aggregate functions!

As you can see, aggregate functions can really help us see what is going on in a dataset! You are now armed with some basic, but powerful querying tools. In the next installment, we will be talking a bit about subqueries! I hope you are just as excited as I am!

--

--