More SQL Please: GROUP BY

The GROUP BY statement is used when you want to group results together so you can find things out about each group.

GROUP BY is most useful when paired with an aggregate function like sum(), which in this case will operate once for each group instead of on all the rows together.

SELECT category, sum(remaining_stock)
    AS total_remaining
FROM products
GROUP BY category;

| category | total_remaining | |------------|-----------------| | Food | 66 | | Pet Care | 4 | | Toiletries | 74 |

You can use multiple aggregate functions with a GROUP BY to find more information about each group side by side.

SELECT category,
    min(price) AS min_price,
    max(price) AS max_price,
    round(avg(price), 2) AS avg_price
FROM products
GROUP BY category;

| category | min_price | max_price | avg_price | |------------|-----------|-----------|-----------| | Food | 3.99 | 8.99 | 6.66 | | Pet Care | 30 | 30 | 30 | | Toiletries | 1.99 | 49.99 | 25.99 |

The HAVING clause can be applied to a GROUP BY, which will only include rows in the group that match the clause. It is similar to a WHERE clause but with key differences:

WHERE filters rows before they are grouped and HAVING filters rows after groups are formed. Because of this, only the HAVING clause can accept aggeregated values.

SELECT category, sum(remaining_stock)
    AS total_remaining
FROM products
GROUP BY category
HAVING total_remaining < 5;

| category | total_remaining | |----------|-----------------| | Pet Care | 4 |

GROUP BY can be used alongside other clauses like JOIN.

In logistics, lead time is how long it takes to receive something after ordering it. In this example we group by vendor then use the avg() function with the lead time of each order to find the average lead time for each vendor.

We group by a vendor's id instead of name because vendors might have the same name as one another but they always have a unique id. This ensures that our groups contain the correct order data for each unique vendor.

As an aside, this example uses the built-in SQLite function julianday(...) which is based on the Julian day and helps perform arithmetic on dates. It takes a date as its input and returns the fractional number of days since noon in Greenwich on November 24, 4714 B.C. The difference between two Julian days gives us the elapsed amount of time between two dates. This method of date arithmetic is popular with astronomers.

Another similar function you could use here is unixepoch(...) which is based on Unix time and returns the number of seconds since 1970-01-01 00:00:00 UTC (known as the "Unix epoch"). This system is more commonly used for date arithmetic in software engineering.

SELECT
    v.name AS vendor,
    avg(
        round(
            julianday(o.date_arrived) - 
            julianday(o.date_ordered), 
            0
        )
    ) AS average_lead_time_days
FROM vendors v
INNER JOIN orders o
ON v.id = o.vendor_id
GROUP BY v.id;

| vendor | average_lead_time_days | |-------------------------|------------------------| | Quicky Express Food Co | 5 | | Good Times Distributors | 12 | | Natural Goodness | 7 | | Hyper Logistics | 6 |

If you supply multiple columns to a GROUP BY, groups are formed based on all the unique combinations of data from those columns.

Here we group by both the product category and the month we made an order for that category. count(*) will then count how many rows are in each group, telling us how many orders there are in each group.

SELECT strftime('%m', o.date_ordered) AS month,
    p.category,
    count(*) AS total_orders
FROM orders o
INNER JOIN products p
    ON o.product_id = p.id
GROUP BY month, p.category;

| month | category | total_orders | |-------|------------|--------------| | 06 | Food | 1 | | 06 | Toiletries | 1 | | 07 | Food | 2 | | 07 | Toiletries | 1 | | 08 | Food | 1 | | 09 | Pet Care | 2 |

Now lets say we want to see the most recent order totals first. ORDER BY can be applied at the end of the query to sort the months in descending order.

SELECT strftime('%m', o.date_ordered) AS month,
    p.category,
    count(*) AS total_orders
FROM orders o
INNER JOIN products p
    ON o.product_id = p.id
GROUP BY month, p.category
ORDER BY month DESC;

| month | category | total_orders | |-------|------------|--------------| | 09 | Pet Care | 2 | | 08 | Food | 1 | | 07 | Food | 2 | | 07 | Toiletries | 1 | | 06 | Food | 1 | | 06 | Toiletries | 1 |

The DISTINCT clause can be added to an aggregate function like count(). In this case we want to see how many different vendors we have for each product category to make sure we can negotiate the best prices.

Without the DISTINCT clause in this query it would return a row count of all the products in each category even if the same vendor appears multiple times. We would have thought we had two Pet Care vendors when really we only have one.

SELECT category, count(DISTINCT vendor_id)
    AS vendors
FROM products
GROUP BY category;

| category | vendors | |------------|---------| | Food | 2 | | Pet Care | 1 | | Toiletries | 2 |

NULL values form their own group when they are found by a GROUP BY.

In this schema an order has a NULL arrival date if it hasn't arrived yet. This query will show us unfulfilled orders as a NULL arrival month alongside fulfilled orders.

If we wanted something more informative to display like "Not Arrived" or "Pending", we could use the coalesce() function.

SELECT strftime('%m', date_arrived) AS month_arrived,
    count(*) AS orders
FROM orders
GROUP BY month_arrived
ORDER BY month_arrived ASC;

| month_arrived | orders | |---------------|--------| | NULL | 2 | | 06 | 2 | | 07 | 3 | | 08 | 1 |

As usual, a WHERE clause can be applied to any of these queries. This time, we filter out unfulfilled orders before they are grouped by month.

This example also shows WHERE and HAVING being used together.

SELECT count(*) AS fulfilled_orders,
    strftime('%m', date_arrived) AS month
FROM orders
WHERE month IS NOT NULL
GROUP BY month
HAVING fulfilled_orders > 2
ORDER BY month ASC;

| fulfilled_orders | month | |------------------|-------| | 3 | 07 |

Any selected columns that are not part of an aggregate function or included in the GROUP BY clause are called "bare" columns. Many databases will treat bare columns as an error because they create ambiguity in which value to return.

In this example, "name" is the bare column. It returns a single arbitrary product name from the associated category, which isn't very useful to us.

SQLite does allow bare columns but it's generally recommended not to include them in your queries.

SELECT name, category
FROM products
GROUP BY category;

| name | category | |-------------------------|------------| | More Spice Please Chips | Food | | Big Bark Dog Food | Pet Care | | Cool Splash Cologne | Toiletries |