Why are joins bad when considering scalability?

Scalability is all about pre-computing (caching), spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.

In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.

But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.

The good news is modern relational databases are good at joins. You shouldn't really think of joins as slow with a good database used well. There are a number of scalability-friendly ways to take raw joins and make them much faster:

Join on a surrogate key (autonumer/identity column) rather than a natural key. This means smaller (and therefore faster) comparisons during the join operation
Indexes
Materialized/indexed views (think of this as a pre-computed join or managed de-normalization)
Computed columns. You can use this to hash or otherwise pre-compute the key columns of a join, such that what would be a complicated comparison for a join is now much smaller and potentially pre-indexed.
Table partitions (helps with large data sets by spreading the load out to multiple disks, or limiting what might have been a table scan down to a partition scan)
OLAP (pre-computes results of certain kinds of queries/joins. It's not quite true, but you can think of this as generic denormalization)
Replication, Availability Groups, Log shipping, or other mechanisms to let multiple servers answer read queries for the same database, and thus scale your workload out among several servers.
Use of a caching layer like Redis to avoid re-running queries which need complex joins.

I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently^*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations).

If you have a slow join, you're probably not using your database correctly.

De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.

_{* That is, exist as entities distinct from mere collections of tables. An additional reason for a real rdbms is safe concurrent access.}

Joins can be slower than avoiding them through de-normalisation but if used correctly (joining on columns with appropriate indexes an so on) they are not inherently slow.

De-normalisation is one of many optimisation techniques you can consider if your well designed database schema exhibits performance problems.

Why are joins bad when considering scalability?

Tags:

Sql

Join

Related

Recent Posts