SQL Server 2012 compress column for whole table
Is it possible to have column dictionary compression for the whole table, and not just 1 page?
For the situation you are describing SQL Server's built-in compression probably won't do anything at all. I assume that the documents are stored in NVARCHAR(MAX)
columns to allow for non-ASCII characters and to allow values longer than 4,000 characters. Off-page values are not compressed by any of SQL Server's methods (row compression, page compression, or the unicode compression that comes with them) so most likely all of your documents won't be touched.
So in short: no there is nothing SQL Server can do automatically to help you here.
Possible Options: (that are a bit more manual and hacky)
If you are using 2016 or later or Azure SQL[1][2] then you could use the COMPRESS
/DECOMPRESS
functions to store the data as smaller blobs in VARBINARY columns instead. To do this in a way that is transparent to your application(s) you could create a backing table that stores the data, replacing the real table with a view that selects from this and has INSTEAD OF
triggers to compress the incoming data. IF you already have a lot of data this could take a long time to roll out. To remove the need for a many hours maintenance window to compress existing content have both NVARCHAR(MAX)
and VARBINARY(MAX)
columns in the backing table and have the view decide with DocumentText = ISNULL(DECOMPRESS(CompressedText), UncompressedText)
and then you can have a process slowly work through the table compressing a batch of rows at a time,
keeping the batches small enough to avoid locking issues so you can let the process trundle along in production.
Of course if you can affect the application(s) not just the data layer than it would be more efficient to implement the compression of the documents there instead because you save network transfer as well as storage & memory on the database servers.
If a lot of your documents are truly identical then you may see greater gains from deduplication than from compressing each individually, which you can do using similar techniques: store the documents in their own table with a hash of them as a key/index and store a reference to them in the main table, using the backing-table-plus-view-and-trigger method or elsewhere in the application's data manipulation layer. In 2016/Azure (or if using CLR is an option) you can compress the stored documents too for extra space saving.
Another option if many of your documents are almost identical because they are constructed from templates is to store templates and diffs, though this would likely need to be implemented in the application layer or via CLR as I can't imagine doing string diffs and patches in TSQL would be anywhere near efficient enough. This will only work if the documents are stored in an uncompressed form (i.e. RTF files, old office formats, markdown text; not recent MS Office formats or PDFs) because with compressed formats everything is likely to be different after the first change.
These ideas add complexity and result in drops in write performance, so you'd need to judge them with those negatives in mind. I'm playing with the ideas myself for a similar situation where we have a large collection of pickled data (several tens of Gb in total, a few thousand characters average per item) in text format in a legacy application. I'll see if I can get permission to share the results when I'm done.
--
[1] The question was later updated to note that in their case SQL2012 is being used - the deduplication method will still work, and I've left the compression method in the answer as it could be helpful for others with a similar situation. In fact it would be possible in SQL Server 2012 with a custom CLR module to implement the compression, http://aboutsqlserver.com/2015/04/07/compressing-lob-xml-data-in-the-database/ is the first good looking example from a quick search but there are many more out there.
[2] As noted by Paul in the comments below, with 2017 LOB support is coming to compressed columnstore indexes, so that would be another avenue to research for people using a recent enough version of SQL Server.