SQL left join vs multiple tables on FROM line?
The JOIN syntax keeps conditions near the table they apply to. This is especially useful when you join a large amount of tables.
By the way, you can do an outer join with the first syntax too:
WHERE a.x = b.x(+)
Or
WHERE a.x *= b.x
Or
WHERE a.x = b.x or a.x not in (select x from b)
The old syntax, with just listing the tables, and using the WHERE
clause to specify the join criteria, is being deprecated in most modern databases.
It's not just for show, the old syntax has the possibility of being ambiguous when you use both INNER and OUTER joins in the same query.
Let me give you an example.
Let's suppose you have 3 tables in your system:
Company
Department
Employee
Each table contain numerous rows, linked together. You got multiple companies, and each company can have multiple departments, and each department can have multiple employees.
Ok, so now you want to do the following:
List all the companies, and include all their departments, and all their employees. Note that some companies don't have any departments yet, but make sure you include them as well. Make sure you only retrieve departments that have employees, but always list all companies.
So you do this:
SELECT * -- for simplicity
FROM Company, Department, Employee
WHERE Company.ID *= Department.CompanyID
AND Department.ID = Employee.DepartmentID
Note that the last one there is an inner join, in order to fulfill the criteria that you only want departments with people.
Ok, so what happens now. Well, the problem is, it depends on the database engine, the query optimizer, indexes, and table statistics. Let me explain.
If the query optimizer determines that the way to do this is to first take a company, then find the departments, and then do an inner join with employees, you're not going to get any companies that don't have departments.
The reason for this is that the WHERE
clause determines which rows end up in the final result, not individual parts of the rows.
And in this case, due to the left join, the Department.ID column will be NULL, and thus when it comes to the INNER JOIN to Employee, there's no way to fulfill that constraint for the Employee row, and so it won't appear.
On the other hand, if the query optimizer decides to tackle the department-employee join first, and then do a left join with the companies, you will see them.
So the old syntax is ambiguous. There's no way to specify what you want, without dealing with query hints, and some databases have no way at all.
Enter the new syntax, with this you can choose.
For instance, if you want all companies, as the problem description stated, this is what you would write:
SELECT *
FROM Company
LEFT JOIN (
Department INNER JOIN Employee ON Department.ID = Employee.DepartmentID
) ON Company.ID = Department.CompanyID
Here you specify that you want the department-employee join to be done as one join, and then left join the results of that with the companies.
Additionally, let's say you only want departments that contains the letter X in their name. Again, with old style joins, you risk losing the company as well, if it doesn't have any departments with an X in its name, but with the new syntax, you can do this:
SELECT *
FROM Company
LEFT JOIN (
Department INNER JOIN Employee ON Department.ID = Employee.DepartmentID
) ON Company.ID = Department.CompanyID AND Department.Name LIKE '%X%'
This extra clause is used for the joining, but is not a filter for the entire row. So the row might appear with company information, but might have NULLs in all the department and employee columns for that row, because there is no department with an X in its name for that company. This is hard with the old syntax.
This is why, amongst other vendors, Microsoft has deprecated the old outer join syntax, but not the old inner join syntax, since SQL Server 2005 and upwards. The only way to talk to a database running on Microsoft SQL Server 2005 or 2008, using the old style outer join syntax, is to set that database in 8.0 compatibility mode (aka SQL Server 2000).
Additionally, the old way, by throwing a bunch of tables at the query optimizer, with a bunch of WHERE clauses, was akin to saying "here you are, do the best you can". With the new syntax, the query optimizer has less work to do in order to figure out what parts goes together.
So there you have it.
LEFT and INNER JOIN is the wave of the future.
Basically, when your FROM clause lists tables like so:
SELECT * FROM
tableA, tableB, tableC
the result is a cross product of all the rows in tables A, B, C. Then you apply the restriction WHERE tableA.id = tableB.a_id
which will throw away a huge number of rows, then further ... AND tableB.id = tableC.b_id
and you should then get only those rows you are really interested in.
DBMSs know how to optimise this SQL so that the performance difference to writing this using JOINs is negligible (if any). Using the JOIN notation makes the SQL statement more readable (IMHO, not using joins turns the statement into a mess). Using the cross product, you need to provide join criteria in the WHERE clause, and that's the problem with the notation. You are crowding your WHERE clause with stuff like
tableA.id = tableB.a_id
AND tableB.id = tableC.b_id
which is only used to restrict the cross product. WHERE clause should only contain RESTRICTIONS to the resultset. If you mix table join criteria with resultset restrictions, you (and others) will find your query harder to read. You should definitely use JOINs and keep the FROM clause a FROM clause, and the WHERE clause a WHERE clause.