When can I save JSON or XML data in an SQL Table

I'll wave my magic wand. Poof! Golden Rules on use of JSON:

  • If MySQL does not need to look inside the JSON, and the application simply needs a collection of stuff, then JSON is fine, possibly even better.

  • If you will be searching on data that is inside and you have MariaDB 10.0.1 or MySQL 5.7 (with a JSON datatype and functions), then JSON might be practical. MariaDB 5.3's "Dynamic" columns is a variant on this.

  • If you are doing "Entity-Attribute-Value" stuff, then JSON is not good, but it is the least of several evils. http://mysql.rjweb.org/doc.php/eav

  • For searching by an indexed column, not having the value buried inside JSON is a big plus.

  • For searching by a range on an indexed column, or a FULLTEXT search or SPATIAL, JSON is not possible.

  • For WHERE a=1 AND b=2 the "composite" index INDEX(a,b) is great; probably can't come close with JSON.

  • JSON works well with "sparse" data; INDEXing works, but not as well, with such. (I am referring to values that are 'missing' or NULL for many of the rows.)

  • JSON can give you "arrays" and "trees" without resorting to extra table(s). But dig into such arrays/trees only in the app, not in SQL.

  • JSON is worlds better than XML. (My opinion)

  • If you do not want to get into the JSON string except from the app, then I recommend compressing (in the client) it an storing into a BLOB. Think of it like a .jpg -- there's stuff in there, but SQL does not care.

State your application; maybe we can be more specific.


The main questions are

  • What are you going to do with this data? and
  • How are you filtering/sorting/joining/manipulating this data?

JSON (like XML) is great for data exchange, small storage and generically defined structures, but it cannot participate in typical actions you run within your RDBMS. In most cases it will be better to transfer your JSON data into normal tables and re-create the JSON when you need it.

XML / JSON and 1.NF

The first rule of normalisation dictates, never to store more than one bit of information into one column. You see a column "PersonName" with a value like "Mickey Mouse"? You point to this and cry: Change that immediately!

What about XML or JSON? Are these types breaking 1.NF? Well, yes and no... 

It is perfectly okay to store a complete structure as one bit of information if it is one bit of information actually. You get a SOAP response and want to store it because you might need this for future reference (but you will not use this data for your own processes)? Just store it as is!

Now imagine a complex structure (XML or JSON) representing a person (with its address, further details...). Now you put this into one column as PersonInCharge. Is this wrong? Shouldn't this rather live in properly designed related tables with a foreign key reference instead of the XML/JSON? Especially if the same person might occur in many different rows it is definitely wrong to use an XML/JSON approach.

But now imagine the need to store historical data. You want to persist the person's data for a given moment in time. Some days later the person tells you a new address? No problem! The old address lives in an XML/JSON if you ever need it...

Conclusion: If you store the data just to keep it, it's okay. If this data is a unique portion, it's okay...
But if you need the internal parts regularly or if this would mean redundant duplicate storage it's not okay...

Physical storage

The following is for SQL Server and might be different on other RDBMs.

XML is not stored as the text you see, but as a hierarchy tree. Querying this is astonishingly well performing! This structure is not parsed on string level!
JSON in SQL Server (2016+) lives in a string and must be parsed. There is no real native JSON type (like there is a native XML type). This might come later, but for now I'd assume, that JSON will not be as performant as XML on SQL Server (see section UPDATE 2). Any need to read a value out of JSON will need a hell of lot of hidden string method calls...

What does this mean for you?

your lovable DB artist :-D knows, that storing JSON as is, is against common principles of RDBMs. He knows,

  • that a JSON is quite probably breaking 1.NF
  • that a JSON might change in time (same column, differing content).
  • that a JSON is not easy to read, and it is very hard to filter/search/join or sort by it.
  • that such operations will shift quite some extra load onto poor little DB server

There are some workarounds (depending on the RDBMS you are using), but most of them don't work the way you'd like it...

The answer to your question in short

YES

  • If you do not want to use data, which is stored within your JSON for expensive operations (filter/join/sort).
    You can store this just as any other exists only content. We are storing many pictures as BLOBs, but we would not try to filter for all images with a flower...
  • If you do not bother at all what's inside (just store it and read it as one bit of information)
  • If the structures are variable, which would make it harder to create physical tables then to work with JSON data.
  • If the structure is deeply nested, that the storage in physical tables is to much overhead

NO

  • If you want to use the internal data like you'd use a relational table's data (filter, indexes, joins...)
  • If you would store duplicates (create redundancy)
  • In general: If you face performance problems (for sure you will face them in many typical scenarios!)

You might start with the JSON within a string column or as BLOB and change this to physical tables when you need it. My magic crystal ball tells me, this might be tomorrow :-D

UPDATE

Find some ideas about performance and disc space here: https://stackoverflow.com/a/47408528/5089204

UPDATE 2: More about performance...

The following addresses JSON and XML support in SQL-Server 2016

User @mike123 pointed to an article on an official microsoft blog which seems to proof in an experiment, that querying a JSON is 10 x faster then querying an XML in SQL-Server.

Some thoughts about that:

Some cross-checks with the "experiment":

  • the "experiment" measures a lot, but not the performance of XML vs. JSON. Doing the same action agaist the same (unchanged) string repeatedly is not a realistic scenario
  • The tested examples are far to simple for a general statement!
  • The value read is always the same and not even used. The optimizer will see this...
  • Not a single word about the mighty XQuery support! Find a product with a given ID within an array? JSON needs to read the whole lot and use a filter afterwards using WHERE, while XML would allow an internal XQuery predicate. Not to speak about FLWOR...
  • the "experiments" code as is on my system brings up: JSON seems to be 3x faster (but not 10x).
  • Adding /text() to the XPath reduces this to less than 2x. In the related article user "Mister Magoo" pointed this out already, but the click-bait title is still unchanged...
  • With such an easy JSON as given in the "experiment" the fastest pure T-SQL approach was a combination of SUBSTRING and CHARINDEX :-D

The following code will show a more realistic experiment

  • Using a JSON and an identical XML with more than one Product (a JSON array vs. sibling nodes)
  • JSON and XML are slightly changing (10000 running numbers) and inserted into tables.
  • There is an initial call agaist both tables to avoid first-call-bias
  • All 10000 entries are read and the values retrieved are inserted to another table.
  • Using GO 10 will run through this block ten times to avoid first-call-bias

The final result shows clearly, that JSON is slower than XML (not that much, about 1.5x on a still very simple example).

The final statement:

  • With an overly simplified example under undue circumstances JSON can be faster than XML
  • Dealing with JSON is pure string action, while XML is parsed and transformed. This is rather expensive in the first action, but will speed up everything, once this is done.
  • JSON might be better in a one-time action (avoids the overhead of creating an internal hierarchical representation of an XML)
  • With a still very simple but more realistic example XML will be faster in simple reading
  • Whenever there is any need to read a specific element out of an array, to filter all entries where a given ProductID is included in the array, or to navigate up and down the path, JSON cannot hold up. It must be parsed out of a string completely - each time you have to grab into it...

The test code

USE master;
GO
--create a clean database
CREATE DATABASE TestJsonXml;
GO
USE TestJsonXml;
GO
--create tables
CREATE TABLE TestTbl1(ID INT IDENTITY,SomeXml XML);
CREATE TABLE TestTbl2(ID INT IDENTITY,SomeJson NVARCHAR(MAX));
CREATE TABLE Target1(SomeString NVARCHAR(MAX));
CREATE TABLE Target2(SomeString NVARCHAR(MAX));
CREATE TABLE Times(Test VARCHAR(10),Diff INT)
GO
--insert 10000 XMLs into TestTbl1
WITH Tally AS(SELECT TOP 10000 ROW_NUMBER() OVER(ORDER BY (SELECT NULL))*2 AS Nmbr FROM master..spt_values AS v1 CROSS APPLY master..spt_values AS v2)
INSERT INTO TestTbl1(SomeXml)
SELECT 
N'<Root>
    <Products>
    <ProductDescription>
        <Features>
            <Maintenance>' + CAST(Nmbr AS NVARCHAR(10)) + ' year parts and labor extended maintenance is available</Maintenance>
            <Warranty>1 year parts and labor</Warranty>
        </Features>
        <ProductID>' + CAST(Nmbr AS NVARCHAR(10)) + '</ProductID>
        <ProductName>Road Bike</ProductName>
    </ProductDescription>
    <ProductDescription>
        <Features>
            <Maintenance>' + CAST(Nmbr + 1 AS NVARCHAR(10)) + ' blah</Maintenance>
            <Warranty>1 year parts and labor</Warranty>
        </Features>
        <ProductID>' + CAST(Nmbr + 1 AS NVARCHAR(10)) + '</ProductID>
        <ProductName>Cross Bike</ProductName>
    </ProductDescription>
    </Products>
</Root>'
FROM Tally;

--insert 10000 JSONs into TestTbl2
WITH Tally AS(SELECT TOP 10000 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr FROM master..spt_values AS v1 CROSS APPLY master..spt_values AS v2)
INSERT INTO TestTbl2(SomeJson)
SELECT 
N'{
    "Root": {
        "Products": {
            "ProductDescription": [
                {
                    "Features": {
                        "Maintenance": "' + CAST(Nmbr AS NVARCHAR(10)) + ' year parts and labor extended maintenance is available",
                        "Warranty": "1 year parts and labor"
                    },
                    "ProductID": "' + CAST(Nmbr AS NVARCHAR(10)) + '",
                    "ProductName": "Road Bike"
                },
                {
                    "Features": {
                        "Maintenance": "' + CAST(Nmbr + 1 AS NVARCHAR(10)) + ' blah",
                        "Warranty": "1 year parts and labor"
                    },
                    "ProductID": "' + CAST(Nmbr + 1 AS NVARCHAR(10)) + '",
                    "ProductName": "Cross Bike"
                }
            ]
        }
    }
}'
FROM Tally;
GO

--Do some initial action to avoid first-call-bias
INSERT INTO Target1(SomeString)
SELECT SomeXml.value('(/Root/Products/ProductDescription/Features/Maintenance/text())[1]', 'nvarchar(4000)')
FROM TestTbl1;
INSERT INTO Target2(SomeString)
SELECT JSON_VALUE(SomeJson, N'$.Root.Products.ProductDescription[0].Features.Maintenance')
FROM TestTbl2;
GO

--Start the test
DECLARE @StartDt DATETIME2(7), @EndXml DATETIME2(7), @EndJson DATETIME2(7);

--Read all ProductNames of the second product and insert them to Target1
SET @StartDt = SYSDATETIME();
INSERT INTO Target1(SomeString)
SELECT SomeXml.value('(/Root/Products/ProductDescription/ProductName/text())[2]', 'nvarchar(4000)')
FROM TestTbl1
ORDER BY NEWID();
--remember the time spent
INSERT INTO Times(Test,Diff)
SELECT 'xml',DATEDIFF(millisecond,@StartDt,SYSDATETIME());

--Same with JSON into Target2
SET @StartDt = SYSDATETIME();
INSERT INTO Target2(SomeString)
SELECT JSON_VALUE(SomeJson, N'$.Root.Products.ProductDescription[1].ProductName')
FROM TestTbl2
ORDER BY NEWID();
--remember the time spent
INSERT INTO Times(Test,Diff)
SELECT 'json',DATEDIFF(millisecond,@StartDt,SYSDATETIME());

GO 10 --do the block above 10 times

--Show the result
SELECT Test,SUM(Diff) AS SumTime, COUNT(Diff) AS CountTime
FROM Times
GROUP BY Test;
GO
--clean up
USE master;
GO
DROP DATABASE TestJsonXml;
GO

The result (SQL Server 2016 Express on an Acer Aspire v17 Nitro Intel i7, 8GB Ram)

Test    SumTime 
------------------
json    2706    
xml     1604    

If it were "absolutely wrong", then most databases would not support it. Okay, most databases support commas in the FROM clause and I view that as "absolutely wrong". But support for JSON is new development, not a backward-compatible "feature".

One obvious case is when the JSON struct is simply a BLOB that is passed back to the application. Then there is no debate -- other then the overhead of storing JSON, which is unnecessarily verbose for structured data with common fields in every record.

Another case is the "sparse" columns case. You have rows with many possible columns, but these vary from row to row.

Another case is when you want to store "nested" records in a record. JSON is powerful.

If the JSON has common fields across records that you want to query on, then you are usually better off putting these in proper database columns. However, data is complicated and there is a place for formats such as JSON.