Are duplicated NVARCHAR values stored as copies in SQL Server?
Under "normal" conditions, no, data in VARCHAR
and NVARCHAR
columns is not de-duped (although duplicate attribute and/or element names in a single XML
value are reduced to a unique instance).
Using one of the Data Compression options is probably your best bet. Here are some things to consider:
- Unicode Compression (part of Row Compression) only works on
NVARCHAR(1 - 4000)
, notNVARCHAR(MAX)
(please vote for / support: Unicode compression NVARCHAR(MAX)). - Page Compression can work with
NVARCHAR(MAX)
, but only for in-row data. Off-row data (LOB pages) is not compressed.
Since the data won't really be changing, you should look into the Columnstore Index options (also available in Azure SQL Database):
- Columnstore indexes: Overview
- Columnstore indexes - Design guidance
Columnstore compression should be better than Page compression.
Also, you should probably avoid using the SPARSE
option, due to:
- The
SPARSE
option offers no benefit compared to Data Compression. - It is mainly intended for columns sets / wide tables (i.e. up to 30,000 columns).
- It mostly helps with fixed-length datatypes (e.g.
INT
,DATETIME
). So, prior to Data Compression being available, it was useful forCHAR
andNCHAR
, but not forVARCHAR
orNVARCHAR
as they don't take up space whenNULL
. - It only benefits columns set to
NULL
. - It slightly hurts non-
NULL
values by adding 2 bytes to each one. - You should probably have
NULL
for 50% - 60% of the rows in order to get enough savings for it to be worth using this option.
For more details on working with character data, please see the following post of mine:
How Many Bytes Per Character in SQL Server: a Completely Complete Guide
Yes duplicated data is stored as copies in SQL Server
To change this behavior, you would need to implement PAGE COMPRESSION feature - create or rebuild indexes with (data_compression = on) option
It is a great feature that helps to save space. Once you enable it, SQL Server is pointing to the same reference of the string behind the scenes
Beware that PAGE COMPRESSION is not available in every SQL Server Edition, and it can have some CPU overhead
So you might want to make a lookup table if your edition does not allow for page compression
Compression is fine and probably useful in your case, but you should normalize your tables anyway and try to achieve a structure where you're only storing one copy of the same address. This will lead to less redundancy, and lighten the primary table you're currently asking about.
Another two things to consider too are:
NVARCHAR can use twice as much space as a VARCHAR of equally used length (Solomon Rutzky's answer links a good article regarding this), so it can be more data heavy. If you can use VARCHAR instead, you can probably drop the length down to something much more reasonable for an address field and save a lot more space. Here's an article comparing the two: SQL varchar data type deep dive
You should also look into sparse columns which can save you a significant amount of space too when used correctly. Here's the Microsoft docs: Use Sparse Columns
(Important to note that a minimum amount of NULL values need to be present for sparse columns to be worth looking into.)