Implementing subtype of a subtype in type/subtype design pattern with mutually exclusive subclasses
According to Martin Fowler, there are 3 approaches to the problem of table inheritance:
- Single Table Inheritance: One table represents all types. Unused attributes are NULLed.
- Concrete Table Inheritance: One table per concrete type, each table columns for each attributes of the type. No relation between tables.
- Class Table Inheritance: One table per type, each table has attributes only for new, not-inherited attributes. Tables are related reflecting the actual type inheritance hierarchy.
You can start with these as a starting point to search for pros and cons of each approach. The gist of it is that all approaches have major disadvantages, and none has any overwhelming advantage. Better known as the object relational impedance mismatch, this problem is yet to find a solution.
Personally I find that the type of problems a bad relational design can lead to are orders of magnitude more serious than the kind of problems arising from a bad type design. Bad database design leads to slow queries, update anomalies, data size explosion, deadlocks and unresponsive apps, and tens to hundreds of Gigabytes of data sunk in the wrong format. Bad type design leads difficult to maintain and update code, not runtime. Therefore, in my book, correct relational design trumps any OO type purity over and over and over.
Insofar as this Question is a continuation of Is my implementation of type/subtype design pattern (for mutually exclusive subclasses) correct?, which is itself a continuation of Don't know how to transform variable entity into relational table, I would ask: what exactly are you trying to optimize? Storage? The object model? Query complexity? Query performance? There are trade-offs when optimizing one aspect vs another since you can't optimize all aspects at the same time.
I completely agree with Remus's points regarding:
- There are pros and cons to each approach (i.e. the ever-present "it depends" factor), and
- The first priority is the efficiency of the data model (an inefficient data model cannot be corrected for by clean and/or efficient app code)
That said, the choice you face is between the following, arranged in order of least normalization to most normalization:
- promoting property
E
to the base-type Table - keeping it in multiple sub-type Tables
- fully normalizing
E
out to a new, intermediary sub-class table on the same level asC
, thatA
andB
will directly be sub-classes of (@MDCCL's answer)
Let's look at each option:
Move property E
to base-type Table
PROs
- Reduced query complexity for queries that need
E
but notX
,Y
, orZ
. - Potentially more efficient for queries that need
E
but notX
,Y
, orZ
(especially aggregate queries) due to no JOIN. - Potential to create index on
(D, E)
(and if so, potentially a Filtered Index on(D, E)
where EntityType <>C
, if such a condition is allowed)
CONs
- Cannot mark
E
asNOT NULL
- Need extra
CHECK CONSTRAINT
on base-type table to ensure thatE IS NULL
when EntityType =C
(though this is not a huge issue) - Need to educate users of the data model as to why
E
must beNULL
, and should even be ignored entirely, when EntityType =C
. - Slightly less efficient when
E
is a fixed-length type, and a large portion of the rows are for EntityType ofC
(i.e. not usingE
hence it isNULL
), and not using either theSPARSE
option on the column or Data Compression on the Clustered Index - Potentially less efficient for queries that do not need
E
since the presence ofE
in the base-type table will increase the size of each row which in turn decreases the number of rows that can fit on a data page. But this is highly dependent on the exact datatype ofE
, the FILLFACTOR, how many rows are in the base-type table, etc.
Keep property E
in each sub-type Table
PROs
- Cleaner data model (i.e. don't have to worry about educating others as to why column
E
in the base-type table shouldn't be used because "it really isn't there") - Probably more closely resembles the object-model
- Can mark column as
NOT NULL
if this is a required property of the entity - No need for extra
CHECK CONSTRAINT
on base-type table to ensure thatE IS NULL
when EntityType =C
(though this is not a huge gain)
CONs
- Requires JOIN to sub-type Table(s) to get this property
- Potentially slightly less efficient when needing
E
, due to the JOIN, depending on how many rows ofA
+B
you have as opposed to how many rows ofC
there are. - Slightly more difficult / complex for operations that deal solely with entities
A
andB
(and notC
) as being the same "type". Of course, you could abstract this via a View that does aUNION ALL
between aSELECT
of the JOINed tables forA
and anotherSELECT
of the JOINed tables forB
. That will reduce complexity of SELECT queries but not so helpful forINSERT
andUPDATE
queries. - Depending on the specific queries and how frequently they are executed, this could be a potential inefficiency in cases where having an index on
(D, E)
would really help one or more frequently used queries, since they cannot be indexed together.
Normalize E
out to intermediary Table between base-class and A
& B
(Please note that I do like @MDCCL's answer as a viable alternative, depending on circumstances. The following is not meant as a strict criticism of that approach, but as a means of adding some perspective -- mine, of course -- by evaluating it in the same context as the two options I had already proposed. This will make it easier to clarify what I see as the relative difference between full normalization and the current approach of partial normalization.)
PROs
- data model is fully normalized (there can't be anything inherently wrong with this, given it is what RDBMS's are designed to do)
- reduced query complexity for queries needing
A
andB
, but notC
(i.e. no need for two queries joined viaUNION ALL
)
CONs
- slightly more space taken up (the
Bar
table duplicates the ID, and there is a new column,BarTypeCode
) [negligible, but something to be aware of] - slight increase in query complexity as an additional
JOIN
is needed to get to eitherA
orB
- increased surface area for locking, mostly on
INSERT
(DELETE
can be handled implicitly via marking Foreign Keys asON CASCADE DELETE
) since the Transaction will be held open slightly longer on the base-class table (i.e.Foo
) [negligible, but something to be aware of] no direct knowledge of actual type --
A
orB
-- within the base-class Table,Foo
; it only knows of typeBr
which can be eitherA
orB
:Meaning, if you need to do queries over the general base info but need to either categorize by the entity type or filter out one or more entity types, then the base-class table doesn't have enough information, in which case you need to
LEFT JOIN
theBar
table. This will also reduce the effectiveness of indexing theFooTypeCode
column.no consistent approach to interacting with
A
&B
vsC
:Meaning, if each entity relates directly to the base-class table such that there is only ever that one JOIN to get the full entity, then everyone can more quickly and easily build up familiarity in terms of working with the data model. There will be a common approach to queries / Stored Procedures which makes them quicker to develop and less likely to have bugs. A consistent approach also makes it quicker and easier to add new sub-types in the future.
potentially less adaptable to business rules that change over time:
Meaning, things always change, and it is fairly easy to move
E
up to the base-class Table if it becomes common to all sub-types. It is also easy enough to move a common property out to the sub-types if changes in the nature of the entities makes that a worth-while change. It is easy enough to either break a sub-type into two sub-types (just create anotherSubTypeID
value) or to combine two or more sub-types into one. Conversely, what ifE
later on became a common property of all sub-types? Then the intermediary layer of theBar
table would be meaningless, and the added complexity would not be worth it. Of course, it is impossible to know if such a change would happen in 5 or even 10 years, so theBar
table is not necessarily, nor even highly likely to be, a bad idea (which is why I said "potentially less adaptable"). These are just points to consider; it's a gamble in either direction.potentially inappropriate grouping:
Meaning, just because the
E
property is shared between entity typesA
andB
does not mean thatA
andB
should be grouped together. Just because things "look" the same (i.e. same properties) does not mean that they are the same.
Summary
Just like deciding if/when to denormalize, how to best approach this particular situation depends on considering the following aspects of the usage of the data model and making sure that the benefits outweigh the costs:
- how many rows you will have for each EntityType (look at least 5 years down the road, assuming above average growth)
- how many GB will each of these tables (base-type and sub-types) be in 5 years?
- what specific datatype is property
E
- is it only one property or are there a few, or even several, properties
- what queries you will need that require
E
and how often they will be executed - what queries you will need that do not need
E
and how often they will be executed
I think I tend to default to keeping E
in the separate sub-type tables because it is, at the very least, "cleaner". I would consider moving E
to the base-type table IF: most of the rows were not for EntityType of C
; and the number of rows was at least in the millions; and I more-often-than-not executed queries that needed E
and/or the queries that would benefit from an index on (D, E)
either execute very frequently and/or require enough system resources such that having the index reduces overall resource utilization, or at least prevents surges in resource consumption that go above acceptable levels or last long enough to cause excessive blocking and/or increases in deadlocks.
UPDATE
O.P. commented on this answer that:
My employers changed the business logic, removing E altogether!
This change is particularly important because it is exactly what I predicated might happen in the "CONs" subsection of the "Normalize E
out to intermediary Table between base-class and A
& B
" section above (6th bullet point). The specific issue is how easy / difficult it is to refactor the data model when such changes happen (and they always do). Some will argue that any data model can be refactored / changed, so start with the ideal. But while it is true on a technical level that anything can be refactored, the reality of the situation is a matter of scale.
Resources are not infinite, not just CPU / Disk / RAM, but also development resources: time and money. Businesses are constantly setting priorities on projects because those resources are very limited. And quite often (at least in my experience), projects to gain efficiency (even both system performance as well as faster development / fewer bugs) are prioritized below projects that increase functionality. While it is frustrating for us technical folks because we understand what the long-term benefits of refactoring projects are, it is just the nature of business that the less-technical, business folks have an easier time seeing the direct relationship between new functionality and new revenue. What this boils down to is: "we will come back to fix that later" == "that problem will probably be there for the next 5 - 10 years because we will nearly always have more important things to work on (ironically, such as the support cases that keep coming up because we have not fixed it yet)".
With that in mind, if the size of the data is small enough such that changes can be made very query, and/or you have a maintenance window that is long enough to not only make the changes but to also roll-back if something goes wrong, then normalizing E
out to an intermediary Table between the base-class table and the A
& B
sub-class tables could work (though that still leaves you with no direct knowledge of the specific type (A
or B
) in the base-class table). BUT, if you have hundreds of millions of rows in these tables, and an incredible amount of code referencing the tables (code that has to be tested when changes are made), then it usually pays to be more pragmatic than idealistic. And this is the environment that I had to deal with for years: 987 million rows & 615 GB in the base-class table, spread across 18 servers. And so much code hit these tables (base-class and sub-class tables) that there was a lot of resistance -- mostly from management but sometimes from the rest of the team -- to making any changes due to the amount of development and QA resources that would need to be allocated.
So, once again, the "best" approach can only be determined situation-by-situation: you need to know your system (i.e. how much data and how the tables and code all relate), how to accomplish the refactoring, and the people that you work with (your team and possibly management -- can you get their buy-in for such a project?). There are some changes that I had been mentioning and planning for 1 - 2 years, and took multiple sprints / releases to get maybe 85% of them implemented. But if you only have < 1 million rows and not a lot of code tied to these tables, then you are probably able to start out on the more ideal / "pure" side of things.
Just remember, whichever way you choose to go, pay attention to how that model works over the next 2 years at least (if possible). Pay attention to what worked and what caused pain, even if it seemed like the greatest idea at the time (which means you also need to allow yourself to accept screwing up -- we all do -- so that you can honestly evaluate pain-points). And pay attention to why certain decisions worked or didn't so that you can make decisions that are more likely to be "better" next time :-).
According to my interpretation of your specifications, you want to find a method to implement two different (but connected) supertype-subtype structures.
- Dr. E. F. Codd —the originator of the Relational Paradigm— introduces and discusses relationships of this kind in his 1979 paper entitled Extending the Database Relational Model to Capture More Meaning.
In order to expose an approach to achieve the aformentioned task, I am going to add to the scenario at issue the two classic hypothetical entity types called Foo
and Bar
, which I will detail bellow.
Business rules
Here are a few statements that will help me to create a logical model:
A Foo is either one Bar or one C
A Foo is categorized by one FooType
A Bar is either one A or one C
A Bar is classified by one BarType
Logical model
And then, the resulting IDEF1X[1] logical model is shown in Figure 1 (and you can download it from Dropbox as a PDF, as well):
The Foo and Bar addition
I did not add Foo
and Bar
to make the model look better, but to make it more expressive. I deem they are important due to the following:
As
A
andB
share the attribute namedE
, this feature suggests that they are subentity types of a distinct (but related) sort of concept, event, person, measurement, etc., which I represented by means of theBar
superentity type that, in turn, is a subentity type ofFoo
, which holds theD
attribute at the top of the hierarchy.Since
C
only shares one attribute with the rest of the entity types under discussion, i.e.,D
, this aspect insinuates that it is a subentity type of another kind of concept, event, person, measurement, etc., so I depicted this circumstance by virtue of theFoo
super entity type.
However, these are just assumptions, and since a relational database is meant to reflect the semantics of a certain business context accurately, you have to identify and classify all the things of interest in your specific domain so that you can, precisely, capture more meaning.
Important factors at the design phase
It is quite useful to be aware of the fact that, putting all the terminology aside, an exclusive supertype-subtype cluster is an ordinary relationship. Let us describe the situation in the following way:
- Each exclusive superentity type occurrence is related to only one subentity type complement.
Thus, there is a correspondance (or cardinality) of one-to-one (1:1) in these cases.
As you know from your preceding posts, the discriminator attribute (column, when implemented) plays a paramount role when creating an association of this nature, because it indicates the correct subtype instance with which the supertype is connected. The migration of the PRIMARY KEY from (i) the supertype to (ii) the subtypes is also of prime significance.
Concrete DDL structure
And then I wrote a DDL structure that is based on the logical model presented above:
CREATE TABLE FooType -- Look-up table.
(
FooTypeCode CHAR(2) NOT NULL,
Description CHAR(90) NOT NULL,
CreatedDateTime DATETIME NOT NULL,
CONSTRAINT PK_FooType PRIMARY KEY (FooTypeCode),
CONSTRAINT AK_FooType_Description UNIQUE (Description)
);
CREATE TABLE Foo -- Supertype
(
FooId INT NOT NULL, -- This PK migrates (1) to ‘Bar’ as ‘BarId’, (2) to ‘A’ as ‘AId’, (3) to ‘B’ as ‘BId’, and (4) to ‘C’ as ‘CId’.
FooTypeCode CHAR(2) NOT NULL, -- Discriminator column.
D INT NOT NULL, -- Column that applies to ‘Bar’ (and therefore to ‘A’ and ‘B’) and ‘C’.
CreatedDateTime DATETIME NOT NULL,
CONSTRAINT PK_Foo PRIMARY KEY (FooId),
CONSTRAINT FK_from_Foo_to_FooType FOREIGN KEY (FooTypeCode)
REFERENCES FooType (FooTypeCode)
);
CREATE TABLE BarType -- Look-up table.
(
BarTypeCode CHAR(1) NOT NULL,
Description CHAR(90) NOT NULL,
CONSTRAINT PK_BarType PRIMARY KEY (BarTypeCode),
CONSTRAINT AK_BarType_Description UNIQUE (Description)
);
CREATE TABLE Bar -- Subtype of ‘Foo’.
(
BarId INT NOT NULL, -- PK and FK.
BarTypeCode CHAR(1) NOT NULL, -- Discriminator column.
E INT NOT NULL, -- Column that applies to ‘A’ and ‘B’.
CONSTRAINT PK_Bar PRIMARY KEY (BarId),
CONSTRAINT FK_from_Bar_to_Foo FOREIGN KEY (BarId)
REFERENCES Foo (FooId),
CONSTRAINT FK_from_Bar_to_BarType FOREIGN KEY (BarTypeCode)
REFERENCES BarType (BarTypeCode)
);
CREATE TABLE A -- Subtype of ‘Bar’.
(
AId INT NOT NULL, -- PK and FK.
X INT NOT NULL, -- Particular column.
CONSTRAINT PK_A PRIMARY KEY (AId),
CONSTRAINT FK_from_A_to_Bar FOREIGN KEY (AId)
REFERENCES Bar (BarId)
);
CREATE TABLE B -- (1) Subtype of ‘Bar’ and (2) supertype of ‘A’ and ‘B’.
(
BId INT NOT NULL, -- PK and FK.
Y INT NOT NULL, -- Particular column.
CONSTRAINT PK_B PRIMARY KEY (BId),
CONSTRAINT FK_from_B_to_Bar FOREIGN KEY (BId)
REFERENCES Bar (BarId)
);
CREATE TABLE C -- Subtype of ‘Foo’.
(
CId INT NOT NULL, -- PK and FK.
Z INT NOT NULL, -- Particular column.
CONSTRAINT PK_C PRIMARY KEY (CId),
CONSTRAINT FK_from_C_to_Foo FOREIGN KEY (FooId)
REFERENCES Foo (FooId)
);
With this structure you avoid the storage of NULL marks in your base tables (or relations), which would introduce ambiguity to your data base.
Integrity, consistency and other considerations
Once you are implementing your database, you must ensure that (a) each exclusive supertype row is always complemented by its corresponding subtype counterpart and, in turn, guarantee that (b) such subtype row is compatible with the value contained in the supertype discriminator column. Therefore, it is quite convenient to employ ACID TRANSACTIONS
in order to make sure that these conditions are met in your database.
You should not give up the logical soundness, self-expressivity and accuracy of your database, these are aspects that decidedly make your database more solid.
The two previously posted answers already include pertinent points that are certainly worth taking into account when designing, creating and managing your database and its application program(s).
Retrieving data by way of VIEW definitions
You can set up some views that combine columns of the different supertype-subtype groups, so that you can retrieve the data at hand without, e.g., writing the necessary JOIN clauses every time. In this way, you can SELECT directly FROM the VIEW (a derived relation or table) of interest with ease.
As you can see, “Ted” Codd was, undoubtedly, a genius. The tools he bequeathed are quite strong and elegant, and, of course, are well integrated with each other.
Related resources
If you want to analyze some extensive database which involves supertype-subtype relationships, you would find of value the extraordinary answers proposed by @PerformanceDBA to the following Stack Overflow questions:
Historical / auditable database.
[O]ne table or many for many different but interacting events?, which comprises nested subtypes.
Note
1. Integration Definition for Information Modeling (IDEF1X) is a highly recommendable data modeling technique that was established as a standard in december 1993 by the United States National Institute of Standards and Technology (NIST). It is solidly based on (a) the early theoretical material authored by Dr. E. F. Codd; on (b) the Entity-Relationship view of data, developed by Dr. P. P. Chen; and also on (c) the Logical Database Design Technique, created by Robert G. Brown. It is worth noting that IDEF1X was formalized by way of first-order logic.