Should a multi tenant system with SQL Server 2016, Shard or have Tenant isolation via separate database per tenant?
The gotcha with sharding is that the application has to know which shard to query. Generally, this is done by sharding on something like client. I'll adapt one of my old blog posts to use as my answer.
When you’re building an application for lots of clients, there’s two common ways to design the database(s):
- Option A: Put all clients in the same database
- Option 2: Build one database per client
Putting All the Clients in the Same Database
It’s simple: just add a Client table at the top of the schema, add a ClientUsers table to make sure people only see their own data, and away we go.
Benefits of this approach:
Easier schema management. When developers deploy a new version of the application, they only have to make schema changes in one database. There’s no worries about different customers being out of sync or on the wrong version.
Easier performance tuning. We can check index usage and statistics in just one place, implement improvements easily, and see the effects immediately across all our clients. With hundreds or thousands of databases, even the smallest change can be difficult to coordinate. We can check our procedure cache contents and know for certain which queries or stored procedures are the most intensive across our entire application, whereas if we’re using separate databases per client, we may have a tougher time aggregating query use across different execution plans.
Easier to build an external API. If we need to grant access to our entire database for outsiders to build products, we can do that easier if all of the data is in a single database. If the API has to deal with grouping data from multiple databases on multiple servers, it adds development and testing time. (On the other hand, that “multiple servers” thing starts to hint at a restriction for the one-database-to-rule-them-all scenario: one database usually means all our load impacts just one database server.) In your case, with PowerBI, having everyone in one database will make managing connections much easier.
Easier high availability & disaster recovery. It’s really, really simple to manage database mirroring, log shipping, replication, and clustering if all we have to worry about is just one database. We can build a heck of an infrastructure quickly.
Putting Each Client in its Own Database or Shard
You still need a client listing, but now it becomes a directory - for each client, you also track the shard it lives in. On startup, your app queries this table, and caches it in RAM. When it needs data for a client, it connects directly to that shard (database & server).
Benefits of this approach:
Easier single-client restores. Clients are unreliable meatbags. (Except mine – they’re reliable meatbags.) They have all kinds of “oops” moments where they want to retrieve all of their data back to a point in time, and that’s a huge pain in the rear if their data is intermingled with other client data in the same tables. Restores in a single-client-database scenario are brain-dead easy: just restore the client’s database. No one else is affected.
Easier data exports. Clients love getting their hands on their data. They want the security of knowing they can get their data out anytime they want, avoiding the dreaded vendor lock-in scenario, and they want to do their own reporting. With each client’s data isolated into their own database, we can simply give them a copy of their own database backup. We don’t have to build data export APIs.
Easier multi-server scalability. When our application needs more power than we can get from a single server, we can divide up the databases between multiple servers. We can also spread out the load geographically, putting servers in Asia or Europe to be closer to clients.
Easier per-client performance tuning. If some clients use different features or reports, we can build a specialized set of indexes or indexed views just for those clients without growing everyone’s data size. Granted, there’s some risk here – by allowing schema differences between clients, we’ve just made our code deployments a little riskier and our performance management more difficult.
Easier security management. As long as we’ve properly locked down security with one user per database, we don’t have to worry about Client X accessing Client Y’s data. However, if we just use a single login for everyone, then we haven’t really addressed this concern.
Easier maintenance windows. In a global environment where customers are scattered around the globe, it’s easier to take customers offline for maintenance if we can do it in groups or zones.
Which one is right for you?
There’s no one right choice: you have to know your own company’s strengths and weaknesses. Let’s take two of my clients as examples.
Company A excels at hardware performance tuning. They’re really, really good at wringing the very last bit of performance out of hardware, and they don’t mind replacing their SQL Server hardware on a 12-18 month cycle. (They refresh web servers every 4-6 months!) Their Achilles’ heel is extreme compliance and security requirements. They have incredible auditing needs, and it’s just easier for them to implement bulletproof controls on a single server, single database than it is to manage those requirements across thousands of databases on dozens of servers. They chose one database, one server, many clients.
Company 2 excels at development practices. Managing schema changes and code deployments across thousands of databases just isn’t an issue for them. They have clients around the world, and they’re processing credit card transactions for those clients around the clock. They need the ability to spread load geographically, and they don’t want to replace servers around the world every 12-18 months. They chose one database for each client, and it’s paying off as they start to put SQL Servers in Asia and Europe for their offshore clients.
One further consideration I haven't yet seen in other answers.
Having a design that allows for many tenants in a single database will give flexibility later. Should load/ scale out/ security/ geo location demands later suggest a tenant should have a separate database it can be created by restoring the currect DB on the new instance. The other tenants' data is still protected by whatever mechanisms were in place. The now-obsolete data can be removed piecemeal from both old and new databases as time permits.
The reverse is not true. Consolidating many one-tenant databases will require considerably more work.
One practice that makes multi-tenant models much easier, even though it breaks normalization*, is to include a column on every table for the tenant. You could call it TenantID. That way every query run against the database can filter on TenantID on every table, and you can use database partitioning to isolate the data for each tenant and speed up queries by having aligned partitions. Much easier to have all tenants in one database this way.
* It doesn't always break normalization, but it can. For example, if you have a Person
and a PersonAddress
table. The Person
table will have TenantID, PersonID
as the Primary Key. The PersonAddress
table will have TenantID, PersonID, AddressTypeID
as the Primary Key with what I am suggesting.
Normally just PersonID
would be enough, because you could join that back to the Person
table to find the Tenant
. I am suggesting you carry TenantID
forward to every subsequent table, even when a thinner key would work.
It was my understanding that carrying forward any information to a table that could be derived from other data was considered breaking normalization. But perhaps using thin keys is just a best practice.