Find uncompressed size of all tables in a database

I need the actual size the uncompressed data will have.
...
I would prefer to have size as correct as possible.

While the desire for this information is certainly understandable, getting this information, especially in the context of "correct as possible" is trickier than everyone is expecting due to faulty assumptions. Whether doing the uncompressed shadow table idea mentioned in the question, or @sp_BlitzErik's suggestion in a comment about restoring the DB and uncompressing there to check, it should not be assumed that the size of the uncompressed table == the size of said data in memory on the app server:

  1. Are all rows in the table being cached? Or just within a range? The assumption here is that it is all, and that might be correct, but I figured it should at least be mentioned that this might not be the case (unless documentation states otherwise, but this is a minor point anyway, just didn't want it to not be mentioned).

    Question was updated to state: yes, all rows are being cached.

  2. Structure overhead

    1. On the DB side:
      Page and row-overhead on the DB side: How many rows fit on a page is determined by many factors that could throw off estimates. Even with a FILLFACTOR of 100 (or 0), there is still likely to be some unused space left over on the page due to it not being enough for an entire row. And that is in addition to the page header. Also, if any Snapshot Isolation functionality is enabled, there will be, I believe, an extra 13 bytes per row taken up by the version number, and that will throw off estimates. There is other minutia related to the actual size of the row (NULL bitmap, variable length columns, etc) but the items mentioned thus far should alone make the point.
    2. On the app server side:
      What type of collection is being used to store the cached results? I assume this is a .NET app, so is it a DataTable? A generic list? A SortedDictionary? Each type of collection has a different amount of overheard. I would not expect any of the options to necessarily mirror the Page and Row overheads on the DB side, especially at scale (I'm sure a small amount of row might not have enough various to matter, but you aren't looking for differences in hundreds of bytes or just a few kB).
  3. Datatypes
    1. On the DB side:
      CHAR / VARCHAR data is stored at 1 byte per character (ignoring double-byte characters for the moment). XML is optimized to not take up nearly as much space as the text representation would imply. This datatype creates a dictionary of element and attribute names and replaces the actual references to them in the document with their respective IDs (kinda nice, actually). Otherwise, the string values are all UTF-16 (2 or 4 bytes per "character"), just like NCHAR / NVARCHAR. DATETIME2 is between 6 and 8 bytes. DECIMAL is between 5 and 17 bytes (depending on the precision).
    2. On the app server side:
      Strings (again, assuming .NET) are always UTF-16. There is no optimization for 8-bit strings such as what VARCHAR holds. BUT, strings can also be "interned" which is a shared copy that can be referenced many times (but I don't know if this works for strings in collections, or if so, if it works for all types of collections). XML may or may not be stored the same way in memory (I will have to look that up). DateTime is always 8 bytes (like T-SQL DATETIME, but not like DATE, TIME, or DATETIME2). Decimal is always 16 bytes.

All of that to say: there is pretty much nothing you can do on the DB side to gain even fairly accurate memory footprint size on the app server side. You need to find a way to interrogate the app server itself, after being loaded with a particular table, so know how big it is. And I am not sure if a debugger would let you see the runtime size of a filled collection. If not, then the only way to get close would be to go through all rows of a table, multiplying each column by the appropriate .NET size (e.g. INT = * 4, VARCHAR = DATALENGTH() * 2, NVARCHAR = DATALENGTH(), XML = , etc), but that still leaves the question of the overhead of the collection plus each element of the collection.

Given some new definition in the question, one could probably do the following query to get rather close. And it doesn't matter whether the table is compressed or not, though it's up to each person to determine if scanning all rows is appropriate on Production (maybe do from a restore or during off-peak hours):

SELECT
   SUM( DATALENGTH([NVarcharColumn_1]) + DATALENGTH([NVarcharColumn_N]) ) + 
   SUM( (DATALENGTH([VarcharColumn_1]) + DATALENGTH([VarcharColumn_N])) * 2 ) + 
   SUM(4 * [number_of_INT_columns]) +
   SUM(8 * [number_of_BIGINT_and_DATETIME_columns]) +
   SUM(16 * [number_of_DECIMAL/NUMERIC_and_UNIQUEIDENTIFIER_columns]) +
   etc..
FROM [SchemaName].[TableName] WITH (NOLOCK) -- assuming no Snapshot Isolation

But remember, this doesn't account for collection or collection element overhead. And not sure if we can get that value without a debugger (or possibly something like ILSpy, but I am not recommending that as it might violate the EULA depending on local laws).


From your question it seems as if you have a maximum cache size S and you don't want to load tables into the cache that exceed that size. If that's true then you don't need to know the exact size of each table. You just need to know if a table is bigger or smaller than the maximum cache size S. That is a significantly easier problem depending on your tables' column definitions and row counts.

I agree with Solomon Rutzky's great answer in that looking at uncompressed data isn't the way to go and it might be difficult to come up with a good approximation for the true size of a table in cache. However, I'm going to work within the framework of the question and assume that you can develop a formula that's close enough based on column definitions for static data types and the actual length of your dynamic columns.

If you have that mapping of data types to cache size then you should be able to evaluate some tables without even looking at the data in them:

  1. If a table only has static data types (no strings or blobs) then you could approximate the number of rows by looking at sys.partitions and calculate the size of the table using column definitions.
  2. If a table with lots of rows has enough static data type columns then you may be able to eliminate it as too large without looking at its data. For example, a table with 10 million rows and 5 BIGINT columns could have the size of that data sized as 10000000 * (8+8+8+8+8) = 400 M bytes which could be larger than your cache size limit S. It doesn't matter if it has a bunch of string columns as well.
  3. If a table with few rows is small enough then you may be able to confirm that it's below the limit simply by assuming that each dynamic data type has the maximum possible size. For example, a 100 row table with a BIGINT column and an NVARCHAR(20) column might not exceed 100 * (8 + 2 * 20) = 4800 bytes.
  4. It might be true that if a table has a compressed size in SQL Server that's larger by some factor of S that it's extremely unlikely to fit in the cache. You'd have to do testing to figure out if such a value exists.
  5. You could get lucky in that all of the dynamic columns happen to have statistics on them. Statistics contain information about the average length and that may be accurate enough for your purposes.

You may have to query the data of tables that don't fit any of the above criteria. There are some tricks that you can use to minimize the performance impact of this. I would say that you have two competing prioritizes here: you value accuracy but also don't want to scan all of the data in your database. It may be possible to add some kind of buffer to your calculations. I don't know if it's more acceptable to exclude a table that's slightly under the maximum cache size of S or to include a table that's slightly above the maximum cache size.

Here are some ideas for making the queries that look at table data faster:

  1. For large tables you may be able to use TABLESAMPLE as long as your sample size is large enough.
  2. For large tables with a clustered key it may be useful to process them in batches on the clustered key. Unfortunately I don't know of a way to compute a SUM() that quits early based on the value of that aggregate. I've only ever seen that work for ROW_NUMBER(). But you could scan the first 10% of the table, save off the calculated data size, scan the next 10%, and so on. For tables that are too large for the cache you may be able to save a significant amount of work with this approach by quitting early.
  3. For some tables you may be lucky enough to have covering indexes on all of the dynamic columns. Depending on row size or other factors scanning each index at a time could be faster than doing a table scan. You could also quit this process early if the table size is too large after reading an index on a single column.
  4. The average lengths of your dynamic columns might not be changing very much over time. It might be practical to save off the average lengths that you calculate and to use those values in your calculations for a while. You can reset these values based on DML activity in the tables or based on some other metric.
  5. If it's possible to run tests over all tables to develop an algorithm then you may be able to take advantage of patterns in the data. For example, if you process tables starting with the smallest first you might find that once you process 10 (I made this number up) tables in a row that are too large for the cache then it's very unlikely that any larger tables would fit in the cache. This might be acceptable if it's okay to exclude a few tables that could have possibly fit in the cache.

I realize that I didn't include any SQL code in this answer. Let me know if it would be helpful to write up demo code for any of the ideas that I discussed here.