How would you structure your entity model for storing arbitrary key/value data with different data types?

First, relational databases were not designed to store arbitrary data. The fundamentals of the relational model revolve around getting specifications for the nature of the data that will be stored.

Second, what you are suggesting is a variant of an Entity-Attribute-Value (EAV). The problem with EAVs comes in data integrity, reporting, performance and maintenance. They have their place but they are analogous to drugs: used in limited quantity and narrow circumstances they can beneficial; too much will kill you.

Writing queries against an EAV is a bear. Thus, if you are going to use an EAV, the only circumstance under which I've seen them be successful is to restrict their use such that no one is permitted to write a query that filters for a specific attribute. I.e., no one is ever permitted to write a query akin to Where AttributeName = 'Foo'. That means you can never filter, sort, calculate on nor place a specific attribute in a specific place on a report. The EAV data is just a bag of categorized data which can spewed out en masse on a report but that's it. I've even seen people implement EAVs as Xml blobs.

Now, if you use the EAV in this light and since it is just a blob of data, I would use the single table approach. A significant benefit to the single table approach is that you can add a check constraint that ensures that you have one and only one value in the IntValue, StringValue or DateValue columns. The nulls will not cost you much and if this is just wad of data, it will not make any difference in performance. Furthermore, it will make your queries simpler in that you can use a simple case statement to return the String, Integer or DateValue.

I can see many problems with the multi-table approach not the least of which is that there is nothing to prevent the same attribute from having multiple types of values (e.g. a row in IntegerValues and a row in StringValues). In addition, to get the data, you will always have to use three left joins which will make your queries more cumbersome to write.

The cost of EAVs is discipline and vigilance. It requires discipline in your development team to never, ever, under any circumstances write a report or query against a specific attribute. Developers will get a lot of pressure from management to "just this one time" write something that filters for a specific attribute. Once you go down the dark path forever does it dominate your development and maintenance. The EAVs must remain a wad of data and nothing more. If you cannot maintain that kind of discipline in your development team, then I would not implement an EAV. I would require specification for any new column in the interest of avoiding a maintenance nightmare later. Once users do want to filter, sort, calculate on or put an attribute in a special place on a report, the attribute must become a first class column. If you are able to maintain discipline on their use, EAVs can work well for letting users store whatever information they want and deferring the time when you need to get specifications on the data elements until users want to use the attribute in a way mentioned earlier.

i prefer to keep keys and values all together in a single table. the database i am building right now collects data points about chinese characters in simple subject / predicate / object phrases; both subject and predicate are strings, but objects can have any type. there is some additional structural information in the table (such as type of predicates) but not much.

a special feature of my db structure is that the predicate is actually split into several partial keys. to see how that can be useful, let’s consider some datapoints for the character 人:

人 / reading / chinese / rén
人 / reading / japanese / on / jin
人 / reading / japanese / on / nin
人 / reading / japanese / kun / hito
人 / used-in / taiwan
人 / reading / prc
人 / reading / japan
人 / meaning / chinese / english / man; person; human
人 / meaning / japanese / english / man; person; human; quantifier for people
人 / form / strokecount / 2
人 / form / strokeorder / 34

each line represents one data point. the first element is the subject, the last the object, and in between are the predicate parts. there is a fixed number of columns (3 to 5 will most probably suffice—flat is better than nested) for the predicate parts; unused parts receive a NULL value. with this schema, it is easy to formulate sql statements that return all the facts about a given character, or all the japanese readings (both on and kun) concerning a number of given characters, or all the characters with at least 13 and at most 24 strokes, and so on and on:

subject  predicate1  predicate2  predicate3 ob_type  ob_int ob_text       ob_bool
人       reading     chinese                text            rén
人       reading     japanese    on         text            jin
人       reading     japanese    on         text            nin
人       reading     japanese    kun        text            hito
人       used-in     taiwan                 bool                          true
人       reading     prc                    bool                          true
人       reading     japan                  bool                          true
人       meaning     chinese     english    text            man; perso...
人       meaning     japanese    english    text            man; perso...
人       form        strokecount            int       2
人       form        strokeorder            int       34

the beauty of this approach is that without too much thinking and upfront planning, you can soon start to feed data into the table. when new facts appear, they will fit into this very general structure most of the time; when you discover that some predicates are awkward, it is not too difficult to collect the offending records and update them to carry your new favorite wording. no more schema migration. yay!

more specifically to answer your question, i’ve thought much about whether to put the values in a separate table and whether to actually represent the predicates in yet another table.

it is perfectly possible, but for my initial version, i found it more important to keep it simple; if at some point it turned out that storing all those repetitve string hurts storage and performance (i mean i have strokecounts for ca. 70000 characters in my db, so that alone are in the order of ( len( 'form' ) + len( 'strokecount' ) ) * 7e4 == 1e6 bytes just to spell out the predicate), i believe it will be relatively easy to migrate to a more sophisticated approach. alas, that also means you have to modify your queries.

when i hear people claim that of course you absolutely must keep those repetitive predicates and the disparate value types in separate tables i just smile politely. databases have been optimized for decades to cope with huge amounts of data and organize sparse tables efficiently, dammit. i mean this entire approach is against the very grain of what everyone tells you how to do it, so why not be bold.

in the end, i believe there are three main factors to help you decide how to structure the db:

1) can you come up with reasonable SQL statements that give you the answers you expect? (if not so, you still have to decide whether you’ve hit upon one of SQL’s inherent limitations, one that may or may not be solvable with a different db schema).

2) do those queries perform well? i know from experience that ‘putting it the other way’ (in a few-MB sqlite db) can make a truly huge difference in performance, so even if you chose the one-big-table approach and get unsatisfactory query timings, it might be the schema that is at fault, but it might just as well that choosing another way to query the same data could give you a 10fold speed gain.

3) scalability. scalability. scalability. this is a hard one, but maybe you know for sure all you want to do is collecting the data about your personal marble collection. in that case, it’s hard to do it very wrong. if you promised to deliver data on any book ever published in your country to every desktop in the world in under one second, then it’s hard to do anything right. most real world scenarios are somewhere in between, so scalability means to ask: if this or that tool should turn out to be a performance bottleneck, will i be able to upgrade it, or, that failing, migrate to another product? i’d say the one-big-table approach is so simple, alternatives should be abundant.

ah, and maybe of interest to you: i am currently looking into redis, which is one of those NoSQLish db thingies. it looks quite interesting and easy to use. the one-big-table approach should be quite compatible with one of those CouchDB / MongoDB / whathaveyou ‘document oriented, schema-free’ databases that have become so popular.

How would you structure your entity model for storing arbitrary key/value data with different data types?

Tags:

Database Design

Related

Recent Posts