SQL provides the ability to perform a SELECT
statement inside of another statement. This is what is known as a subquery. Subqueries can help you structure each part of your query into its own isolated statement.
Think of a subquery as a way to create an ad hoc table of data on the fly. Each subquery must be surrounded by parentheses.
Like any query, subqueries can return a scalar (a single value), a column, a row, or an entire table (rows of multiple columns).
The outputs of a subquery must match the inputs being requested from the outer query. In this case, a scalar is returned and our WHERE
clause performs a comparison against that value as if it was a constant number.
Subqueries are often used when you want to filter your data using criteria that requires computation on the entire data set, for example the result of an aggregate function.
In this example the subquery calculates the average price of a product and returns it as a scalar, then the outer query filters for products that are more expensive than the average. This is an example of a comparison that can't be done with a join.
SELECT name, price
FROM products
WHERE price > (
SELECT avg(price)
FROM products
);
When a subquery returns a single column of values the results can be used with the IN
or NOT IN
operator. This operator checks whether the value from the outer query is present in the results of the subquery and returns TRUE
or FALSE
.
Note however that in the case where your subquery results contain NULL
values, IN
and NOT IN
will also return NULL
!
Note also that a subquery can be used to retrieve data from the same table as its outer query or a different table altogether, as in this example.
In this example we find products that have never been ordered. The subquery finds any products that are associated with an order and then the outer query ensures we filter those products out.
Note that this example could be rewritten to use a LEFT JOIN
instead. Go back to JOIN to see the technique for an "anti-JOIN".
The choice of whether to use a subquery or a JOIN
depends on a number of factors and there is usually not an absolute reason to pick one over the other. However for tasks where you need to combine and return data from multiple tables, JOIN
s generally perform better most of the time.
On the other hand, you may find subqueries to be more readable if they logically align better with your specific task.
SELECT name
FROM products
WHERE id NOT IN (
SELECT product_id
FROM orders
);
Subqueries allow you to reference tables selected outside of the scope of the subquery.
This is known as a correlated subquery and it can cause the inner query to be executed for each row processed by the outer query instead of just once. This can have performance considerations so be mindful if you need to process a lot of data.
This example finds the most expensive product in each category. The subquery references the table of the outer query to correlate the category of each set of products. This effectively "groups" the data by category in the subquery allowing it to find the maximum price in each category.
Notice how this example could be rewritten using a GROUP BY
. Try to see if you can replicate the results using this method. Again, determining when to use which strategy has a number of factors and you may find cases where only one strategy may work.
SELECT name, category, price
FROM products p1
WHERE price = (
SELECT max(price)
FROM products p2
WHERE p1.category = p2.category
)
ORDER BY price DESC;
You can check to see if a subquery returned any results at all by using the EXISTS (...)
or NOT EXISTS (...)
operators.
This is similar to IN
except instead of checking whether a value is present in a result, it checks whether any results exist at all and only returns either TRUE
or FALSE
. This can be useful when the result you are checking may contain NULL
values.
In this example we find the name of any products that were ordered in the past month. The subquery finds any orders from the past month that match the product id supplied in the outer query.
Since we only care about the product names and not about any information about the orders themselves, EXISTS
is a good choice because it discards the results of the subquery and only returns a boolean value. Most database engines optimize the subquery so it doesn't matter if you use SELECT *
, SELECT 1
, etc.
When deciding whether to use EXISTS
, IN
or other operators to solve your problem, there can be differences in performance. Be mindful of performance but ensure that your query is returning the correct data first. See Performance and EXPLAIN for more guidance on how to analyze your queries for performance.
SELECT name as product_ordered_past_month
FROM products p
WHERE EXISTS (
SELECT *
FROM orders o
WHERE p.id = o.product_id
AND o.date_ordered
BETWEEN date('now', '-1 month')
AND date('now')
);
You can perform comparisons using subqueries that return more than one column of information.
In this example we find repeat orders of a particular order we made.
Subqueries that return multiple values can utilize operators like IN
, BETWEEN ... AND
and even =
or >
.
Comparison operators can accept these multi-column values which are known as "row values" since they represent a row of data. They are also sometimes called "tuples".
Database engines may have their own rules for how to compare row values with each other. See Row Valuesfor SQLite's documentation on how it deals with using and comparing row values.
SELECT p.name AS product, o.quantity, o.date_ordered
FROM orders o
INNER JOIN products p
ON o.product_id = p.id
WHERE (p.id, o.quantity) IN (
SELECT product_id, quantity
FROM orders
WHERE id = 1
);
So far we've seen subqueries used as part of filtering a query. Subqueries can return entire tables of data so they can be used anywhere a table would normally be used.
This example uses a technique called a "derived table" and it can be used to logically separate parts of your query, or perform pre-computations on data before incorporating it into the outer query.
This example shows how we can track products that are low in stock alongside how much new stock we are expecting to receive from our pending orders.
SELECT products.name AS product,
products.remaining_stock,
pending_orders.q AS incoming_stock
FROM (
SELECT product_id AS pid, quantity AS q
FROM orders
WHERE date_arrived IS NULL
) AS pending_orders
INNER JOIN products
ON pending_orders.pid = products.id
WHERE products.remaining_stock < 5;
Subqueries can be nested multiple times. There is no standard limit for the number of subqueries you can nest but be mindful that the query can become harder to read with each level of nesting.
If you find yourself nesting queries deeply, try to spot the pattern you are using that requires nesting and see if there is an alternative query possible.
An example of this is seen here with multiple correlated queries that could be rewritten with multiple joins instead.
In cases where you want to arbitrarily nest a query whose input depends on its own output, SQL offers the ability to create recursive queries using the WITH RECURSIVE
expression.
SELECT
vendor_sales.name AS vendor,
round(vendor_sales.total_spend, 2) AS total_2mo_spend
FROM (
SELECT
v.name,
SUM(product_orders.quantity * product_orders.price) AS total_spend
FROM vendors v
INNER JOIN (
SELECT
p.vendor_id,
p.id,
p.price,
o.quantity
FROM products p
INNER JOIN orders o
ON p.id = o.product_id
WHERE o.date_ordered BETWEEN
date('now', '-2 months')
AND date('now')
) product_orders
ON v.id = product_orders.vendor_id
GROUP BY v.name
) AS vendor_sales
ORDER BY total_2mo_spend DESC;
-- Equivalent:
/*
SELECT v.name AS vendor,
round(sum(o.quantity * p.price), 2) AS total_2mo_spend
FROM vendors v
INNER JOIN products p
ON v.id = p.vendor_id
INNER JOIN orders o
ON p.id = o.product_id
WHERE o.date_ordered BETWEEN
date('now', '-2 month')
AND date('now')
GROUP BY v.name
ORDER BY total_2mo_spend DESC;
*/