Every SQL query we had written till now was only finding data from 1 table. Most of the queries we had written in the previous classes were on the film
table where we applied multiple filters etc. But do you think being able to query data from a single table is enough? Let’s take a scenario of school. Let’s say we have 2 tables as follows in the school database:
Suppose, someone asks you to print the name of every student, along with the name of their batch. The output should be something like:
Will you be able to get all of this data by querying over a single table? No. The student_name
is there in the students table, while the batch_name
is in the batches table! We somehow need a way to combine the data from both the tables. This is where joins come in. What does the word join
mean to you?
Joins, as the name suggests, are a way to combine data from multiple tables. For example, if I want to combine the data from the students
and batches
table, I can use joins for that. Think of joins as a way to stitch rows of 2 tables together, based on the condition you specify. Example: In our case, we would want to stitch a row of students table with a row of batches table based on what? Imagine that every row of students
I try to match with every row of batches
. Based on what condition to be true between those will I stitch them?
We would want to stitch a row of students table with a row of batches table based on the batch_id
column. This is what we call a join condition
. A join condition is a condition that must be true between the rows of 2 tables for them to be stitched together. Let’s see how we can write a join query for our example.
SELECT students.first_name, batches.batch_name
FROM students
JOIN batches
ON students.batch_id = batches.batch_id;
Let’s break down this query. The first line is the same as what we have been writing till now. We are selecting the first_name
column from the students
table and the batch_name
column from the batches
table. The next line is where the magic happens. We are using the JOIN
keyword to tell SQL that we want to join the students
table with the batches
table. The next line is the join condition. We are saying that we want to join the rows of students
table with the rows of batches
table where the batch_id
column of students
table is equal to the batch_id
column of batches
table. This is how we write a join query.
Let’s take an example of this on the Sakila database. Let’s say for every film, we want to print its name and the language. How can we do that?
SELECT film.title, language.name
FROM film
JOIN language
ON film.language_id = language.language_id;
Now, sometimes typing name of tables in the query can become difficult. For example, in the above query, we have to type film
and language
multiple times. To make this easier, we can give aliases to the tables. For example, we can give the alias f
to the film
table and l
to the language
table. We can then use these aliases in our query. Let’s see how we can do that:
SELECT f.title, l.name
FROM film AS f
JOIN language AS l
ON f.language_id = l.language_id;
Self Join
Let’s say at school, for every student we assign a Buddy. For this we have a students
table, which looks as follows:
id | name | buddy_id
This buddy_id
will be an id of what?
Correct. Now, let’s say we have to print for every student, their name and their buddy’s name. How will we do that? Here 2 rows of which tables would we want to stitch together to get this data?
Correct, an SQL query for the same shall look like:
SELECT s1.name, s2.name
FROM students s1
JOIN students s2
ON s1.buddy_id = s2.id;
This is an example of SELF join. A self join is a join where we are joining a table with itself. In the above query, we are joining the students
table with itself. In a self joining, aliasing tables is very important. If we don’t alias the tables, then SQL will not know which row of the table to match with which row of the same table (because both of them have same names as they are the same table only).
SQL query as pseudocode
As we have been doing since the CRUD class, let’s also see how Joins can be represented in terms of pseudocode.
SELECT s1.name, s2.name
FROM students s1
JOIN students s2
ON s1.buddy_id = s2.id;
In pseudocode, it shall look like:
ans = []
for row1 in students:
for row2 in students:
if row1.buddy_id == row2.id:
ans.add(row1 + row2)
for row in ans:
print(row.name, row.name)
More problems on JOIN
Joining multiple tables
Till now, we had only joined 2 tables. But what if we want to join more than 2 tables? Let’s say we want to print the name of every film, along with the name of the language and the name of the original language. How can we do that? If you have to add 3 numbers, how do you do that?
To get the name of the language, we would first want to combine film and language table over the language_id
column. Then, we would want to combine the result of that with the language table again over the original_language_id
column. This is how we can do that:
SELECT f.title, l1.name, l2.name
FROM film f
JOIN language l1
ON f.language_id = l1.language_id
JOIN language l2
ON f.original_language_id = l2.language_id;
Let’s see how this might work in terms of pseudocode:
ans = []
for row1 in film:
for row2 in language:
if row1.language_id == row2.id:
ans.add(row1 + row2)
for row in ans:
for row3 in language:
if row.language_id == row3.language_id:
ans.add(row + row3)
for row in ans:
print(row.name, row.language_name, row.original_language_name)
Joins with multiple conditions in ON clause
Till now, whenever we did a join, we joined based on only 1 condition. Like in where clause we can combine multiple conditions, in Joins as well, we can have multiple conditions.
Let’s see an example. For every film, name all the films that were released in the range of 2 years before or after that film and there rental rate was more than the rate of the movie.
SELECT f1.name, f2.name
FROM film f1
JOIN film f2
ON (f2.year BETWEEN f1.year - 2 AND f1.year + 2) AND f2.rental > f1.rental;
- Join does not need to happen on equality of columns always.
- Join can also have multiple conditions.
A Compound Join is one where Join has multiple conditions on different columns.
Inner vs Outer Joins
While we have pretty much discussed everything that is mostly important to know about joins, there are a few nitty gritties that we should know about.
Let’s take the join query we had written a bit earlier:
SELECT s1.name, s2.name
FROM students s1
JOIN students s2
ON s1.buddy_id = s2.id;
Let’s say there is a student that does not have a buddy, i.e., their buddy_id
is null. What will happen in this case? Will the student be printed?
If you remember what we discussed about CRUD , is NULL equal to anything? Nope. Thus, the row will never match with anything and not get printed. The join that we discussed earlier is also called inner join. You could have also written that as:
SELECT s1.name, s2.name
FROM students s1
INNER JOIN students s2
ON s1.buddy_id = s2.id
The keyword INNER is optional. By default a join is INNER join.
As you see, an INNER JOIN doesn’t include a row that didn’t match the condition for any combination.
Opposite of INNER JOIN is OUTER JOIN. Outer Join will include all rows, even if they don’t match the condition. There are 3 types of outer joins:
- Left Join
- Right Join
- Full Join
As the names convey, left join will include all rows from the left table, right join will include all rows from the right table and full join will include all rows from both the tables.
Let’s take an example to understand these well:
Assume we have 2 tables: students and batches with following data:
Now let’s write queries to do each of these joins:
SELECT s.name, b.batch_name
FROM students s
LEFT JOIN batches b
ON s.batch_id = b.batch_id;
SELECT s.name, b.batch_name
FROM students s
RIGHT JOIN batches b
ON s.batch_id = b.batch_id;
SELECT s.name, b.batch_name
FROM students s
FULL OUTER JOIN batches b
ON s.batch_id = b.batch_id;
Now let’s use different types of joins and tell me which row do you think will not be a part of the join.
Output of LEFT JOIN (Go row by row in left table – which is students and then look for match/matches):
John batchA
Jane batchA
Jim NULL
Ram NULL
Sita batchB
Output of RIGHT JOIN (Go row by row in right table – which is batches table and then look for match/matches): batchA has 2 matches – John and Jane batchB has 1 match – Sita batchC has 0 match – NULL
John batchA
Jane batchA
Sita batchB
NULL batchC
Output of FULL JOIN (Do the left join. Then look at every row of right table which is batches
and figure out rows which were not printed yet – print them with null match)
John batchA
Jane batchA
Jim NULL
Ram NULL
Sita batchB
NULL batchC
Join with WHERE v/s ON
Let’s take an example to discuss this. If we consider a simple query:
SELECT *
FROM A
JOIN B
ON A.id = B.id;
In pseudocode, it will look like:
ans = []
for row1 in A:
for row2 in B:
if (ON condition matches):
ans.add(row1 + row2)
for row in ans:
print(row.id, row.id)
Here, the size of intermediary table (ans
) will be less than n*m
because some rows are filtered.
We can also write the above query in this way:
SELECT *
FROM A, B
WHERE A.id = B.id;
The above query is nothing but a CROSS JOIN behind the scenes which can be written as:
SELECT *
FROM A
CROSS JOIN B
WHERE A.id = B.id;
Here, the intermediary table A CROSS JOIN B
is formed before going to WHERE condition.
In pseudocode, it will look like:
ans = []
for row1 in A:
for row2 in B:
ans.add(row1 + row2)
for row in ans:
if (WHERE condition matches):
print(row.id, row.id)
The size of ans
is always n*m
because table has cross join of A and B. The filtering (WHERE condition) happens after the table is formed.
From this example, we can see that:
- The size of the intermediary table (
ans
) is always greater or equal when using WHERE compared to using the ON condition. Therefore, joining with ON uses less internal space. - The number of iterations on
ans
is higher when using WHERE compared to using ON. Therefore, joining with ON is more time efficient.
In conclusion,
- The ON condition is applied during the creation of the intermediary table, resulting in lower memory usage and better performance.
- The WHERE condition is applied during the final printing stage, requiring additional memory and resulting in slower performance.
- Unless you want to create all possible pairs, avoid using CROSS JOINS.
UNION and UNION ALL
Sometimes, we want to print the combination of results of multiple queries. Let’s take an example of the following tables:
You are asked to print the names of everyone associated with school. So, in the result we will have one column with all the names.
We can’t have 3 SELECT name queries because it will not produce this singular column. We basically need SUM of such 3 queries. Join is used to stitch or combine rows, here we need to add the rows of one query after the other to create final result.
UNION allows you to combine the output of multiple queries one after the other.
SELECT name FROM students
UNION
SELECT name FROM employees
UNION
SELECT name FROM investors;
Now, as the output is added one after the other, there is a constraint: Each of these individual queries should output the same number of columns.
Note that, you can’t use ORDER BY for the combined result because each of these queries are executed independently.
UNION outputs distinct values of the combined result. It stores the output of individual queries in a set and then outputs those values in final result. Hence, we get distinct values. But if we want to keep all the values, we can use UNION ALL. It stores the output of individual queries in a list and gives the output, so we get all the duplicate values.
If you want to perform any operation on the combined result, you put them in braces and give it an alias. For example,
SELECT
first_name, last_name
FROM
(SELECT first_name, last_name
FROM customer
UNION
SELECT first_name, last_name
FROM actor) AS some_alias
ORDER BY first_name, last_name
LIMIT 10