Implementing versioning system with MySQL
Analyzing the scenario —which presents characteristics associated with the subject known as temporal databases— from a conceptual perspective, one can determine that: (a) a “present” Blog Story Version and (b) a “past” Blog Story Version, although very resembling, are entities of different types.
In addition to that, when working at the logical level of abstraction, facts (represented by rows) of distinct kinds must be retained in distinct tables. In the case under consideration, even when quite similar, (i) facts about “present” Versions are different from (ii) facts about “past” Versions.
Therefore I recommend managing the situation by means of two tables:
one dedicated exclusively for the “current” or “present” Versions of the Blog Stories, and
one that is separate, but also linked with the other, for all the “previous” or “past” Versions;
each with (1) a slightly distinct number of columns and (2) a different group of constraints.
Back to the conceptual layer, I consider that —in your business environment— Author and Editor are notions that can be delineated as Roles that can be played by a User, and these important aspects depend on data derivation (via logical-level manipulation operations) and interpretation (carried out by the Blog Stories readers and writers, at the external level of the computerized information system, with the assistance of one or more application programs).
I will detail all these factors and other relevant points as follows.
Business rules
According to my understanding of your requirements, the following business rules formulations (put together in terms of the relevant entity types and their kinds of interrelationships) are specially helpful in establishing the corresponding conceptual schema:
- A User writes zero-one-or-many BlogStories
- A BlogStory holds zero-one-or-many BlogStoryVersions
- A User wrote zero-one-or-many BlogStoryVersions
Expository IDEF1X diagram
Consequently, in order to expound my suggestion by virtue of a graphical device, I have created a sample IDEF1Xa diagram that is derived from the business rules formulated above and other features that seem pertinent. It is shown in Figure 1:
Why are BlogStory and BlogStoryVersion conceptualized as two different entity types?
Because:
A BlogStoryVersion instance (i.e., a “past” one) always holds a value for an UpdatedDateTime property, while a BlogStory occurrence (i.e., a “present” one) never holds it.
Besides, the entities of those types are uniquely identified by the values of two distinct sets of properties: BlogStoryNumber (in the case of the BlogStory occurrences), and BlogStoryNumber plus CreatedDateTime (in the case of the BlogStoryVersion instances).
a Integration Definition for Information Modeling (IDEF1X) is a highly recommendable data modeling technique that was established as a standard in December 1993 by the United States National Institute of Standards and Technology (NIST). It is based on the early theoretical material authored by the sole originator of the relational model, i.e., Dr. E. F. Codd; on the Entity-Relationship view of data, developed by Dr. P. P. Chen; and also on the Logical Database Design Technique, created by Robert G. Brown.
Illustrative logical SQL-DDL layout
Then, based on the conceptual analysis previously presented, I declared the logical-level design below:
-- You should determine which are the most fitting
-- data types and sizes for all your table columns
-- depending on your business context characteristics.
-- Also you should make accurate tests to define the most
-- convenient index strategies at the physical level.
-- As one would expect, you are free to make use of
-- your preferred (or required) naming conventions.
CREATE TABLE UserProfile (
UserId INT NOT NULL,
FirstName CHAR(30) NOT NULL,
LastName CHAR(30) NOT NULL,
BirthDate DATETIME NOT NULL,
GenderCode CHAR(3) NOT NULL,
UserName CHAR(20) NOT NULL,
CreatedDateTime DATETIME NOT NULL,
--
CONSTRAINT UserProfile_PK PRIMARY KEY (UserId),
CONSTRAINT UserProfile_AK1 UNIQUE ( -- Composite ALTERNATE KEY.
FirstName,
LastName,
BirthDate,
GenderCode
),
CONSTRAINT UserProfile_AK2 UNIQUE (UserName) -- ALTERNATE KEY.
);
CREATE TABLE BlogStory (
BlogStoryNumber INT NOT NULL,
Title CHAR(60) NOT NULL,
Content TEXT NOT NULL,
CoverImageName CHAR(30) NOT NULL,
IsActive BIT(1) NOT NULL,
AuthorId INT NOT NULL,
CreatedDateTime DATETIME NOT NULL,
--
CONSTRAINT BlogStory_PK PRIMARY KEY (BlogStoryNumber),
CONSTRAINT BlogStory_AK UNIQUE (Title), -- ALTERNATE KEY.
CONSTRAINT BlogStoryToUserProfile_FK FOREIGN KEY (AuthorId)
REFERENCES UserProfile (UserId)
);
CREATE TABLE BlogStoryVersion (
BlogStoryNumber INT NOT NULL,
CreatedDateTime DATETIME NOT NULL,
Title CHAR(60) NOT NULL,
Content TEXT NOT NULL,
CoverImageName CHAR(30) NOT NULL,
IsActive BIT(1) NOT NULL,
AuthorId INT NOT NULL,
UpdatedDateTime DATETIME NOT NULL,
--
CONSTRAINT BlogStoryVersion_PK PRIMARY KEY (BlogStoryNumber, CreatedDateTime), -- Composite PK.
CONSTRAINT BlogStoryVersionToBlogStory_FK FOREIGN KEY (BlogStoryNumber)
REFERENCES BlogStory (BlogStoryNumber),
CONSTRAINT BlogStoryVersionToUserProfile_FK FOREIGN KEY (AuthorId)
REFERENCES UserProfile (UserId),
CONSTRAINT DatesSuccession_CK CHECK (UpdatedDateTime > CreatedDateTime) --Let us hope that MySQL will finally enforce CHECK constraints in a near future version.
);
Tested in this SQL Fiddle that runs on MySQL 5.6.
The BlogStory
table
As you can see in the demo design, I have defined the BlogStory
PRIMARY KEY (PK for brevity) column with the INT datatype. In this regard, you may like to fix a built-in automatic process that generates and assigns a numeric value for such a column in every row insertion. If you do not mind leaving gaps occasionally in this set of values, then you can employ the AUTO_INCREMENT attribute, commonly used in MySQL environments.
When entering all your individual BlogStory.CreatedDateTime
data points, you can utilize the NOW() function, which returns the Date and Time values that are current in the database server at the exact INSERT operation instant. To me, this practice is decidedly more suitable and less prone to errors than the use of external routines.
Provided that, as discussed in (now-removed) comments, you want to avoid the possibility of maintaining BlogStory.Title
duplicate values, you have to set up a UNIQUE constraint for this column. Due to the fact that a given Title may be shared by several (or even all of the) “past” BlogStoryVersions, then a UNIQUE constraint should not be established for the BlogStoryVersion.Title
column.
I included the BlogStory.IsActive
column of type BIT(1) (though a TINYINT may as well be used) in case you need to provide “soft” or “logical” DELETE functionality.
Details about the BlogStoryVersion
table
On the other hand, the PK of the BlogStoryVersion
table is composed of (a) BlogStoryNumber
and (b) a column named CreatedDateTime
that, of course, marks the precise instant in which a BlogStory
row underwent an INSERT.
BlogStoryVersion.BlogStoryNumber
, besides being part of the PK, is also constrained as a FOREIGN KEY (FK) that references BlogStory.BlogStoryNumber
, a configuration that enforces referential integrity between the rows of these two tables. In this respect, implementing an automatic generation of a BlogStoryVersion.BlogStoryNumber
is not necessary because, being set as a FK, the values INSERTed into this column must be “drawn from” the ones already enclosed in the related BlogStory.BlogStoryNumber
counterpart.
The BlogStoryVersion.UpdatedDateTime
column should retain, as expected, the point in time when a BlogStory
row was modified and, as a consequence, added to the BlogStoryVersion
table. Hence, you can use the NOW() function in this situation too.
The Interval comprehended between BlogStoryVersion.CreatedDateTime
and BlogStoryVersion.UpdatedDateTime
expresses the entire Period during which a BlogStory
row was “present” or “current”.
Considerations for a Version
column
It can be useful to think of BlogStoryVersion.CreatedDateTime
as the column that holds the value that represents a particular “past” Version of a BlogStory. I deem this much more beneficial than a VersionId
or VersionCode
, since it is user-friendlier in the sense that people tend to be more familiar with time concepts. For instance, the blog authors or readers could refer to a BlogStoryVersion in a fashion similar to the following:
- “I want to see the specific Version of the BlogStory identified by Number
1750
that was Created on26 August 2015
at9:30
”.
The Author and Editor Roles: Data derivation and interpretation
With this approach, you can easily distinguish who holds the “original” AuthorId
of a concrete BlogStory SELECTing the “earliest” Version of a certain BlogStoryId
FROM the BlogStoryVersion
table by virtue of applying the MIN() function to BlogStoryVersion.CreatedDateTime
.
In this way, each BlogStoryVersion.AuthorId
value contained in all the “later” or “succeeding” Versions rows indicate, naturally, the Author identifier of the respective Version at hand, but one can also say that such a value is, at the same time, denoting the Role played by the involved User as Editor of the “original” Version of a BlogStory.
Yes, a given AuthorId
value may be shared by multiple BlogStoryVersion
rows, but this is actually a piece of information that tells something very significant about each Version, so the repetition of said datum is not a problem.
The format of DATETIME columns
As for the DATETIME data type, yes, you are right, “MySQL retrieves and displays DATETIME values in 'YYYY-MM-DD HH:MM:SS
' format”, but you can confidently enter the pertinent data in this manner, and when you have to perform a query you just have to make use of the built-in DATE and TIME functions in order to, among other things, show the concerning values in the appropriate format for your users. Or you could certainly carry out this kind of data formatting via your application programms(s) code.
Implications of BlogStory
UPDATE operations
Every time that a BlogStory
row suffers an UPDATE, you must ensure that the corresponding values that were “present” until the modification took place are then INSERTed into the BlogStoryVersion
table. Thus, I highly suggest fulfilling these operations within a single ACID TRANSACTION to guarantee that they are treated as an indivisible Unit of Work. You may as well employ TRIGGERS, but they tend to make things untidy, so to speak.
Introducing a VersionId
or VersionCode
column
If you opt (because of business circumstances or personal preference) to incorporate a BlogStory.VersionId
or BlogStory.VersionCode
column to distinguish the BlogStoryVersions, you should ponder the following possibilities:
A
VersionCode
could be required to be UNIQUE in (i) the wholeBlogStory
table and also in (ii)BlogStoryVersion
.Therefore, you have to implement a carefully tested and totally reliable method in order to generate and assign each
Code
value.Maybe, the
VersionCode
values could be repeated in differentBlogStory
rows, but never duplicated along with the sameBlogStoryNumber
. E.g., you could have:- a BlogStoryNumber
3
- Version83o7c5c
and, simultaneously, - a BlogStoryNumber
86
- Version83o7c5c
and - a BlogStoryNumber
958
- Version83o7c5c
.
- a BlogStoryNumber
The later possibility opens another alternative:
Keeping a
VersionNumber
for theBlogStories
, so there could be:- BlogStoryNumber
23
- Versions1, 2, 3…
; - BlogStoryNumber
650
- Versions1, 2, 3…
; - BlogStoryNumber
2254
- Versions1, 2, 3…
; - etc.
- BlogStoryNumber
Holding “original” and “subsequent” versions in a single table
Although maintaining all the BlogStoryVersions in the same individual base table is possible, I suggest not to do it because you would be mixing two distinct (conceptual) types of facts, which thus has undesirable side-effects on
- data constraints and manipulation (at the logical level), along with
- the related processing and storage (at the physical tier).
But, on condition that you choose to follow that course of action, you can still take advantage of many of the ideas detailed above, e.g.:
- a composite PK consisting of an INT column (
BlogStoryNumber
) and a DATETIME column (CreatedDateTime
); - the usage of server functions in order to optimize the pertinent processes, and
- the Author and Editor derivable Roles.
Seeing that, by proceeding with such an approach, a BlogStoryNumber
value will be duplicated as soon as “newer” Versions are added, an option that and that you could evaluate (which is very alike to those mentioned in the previous section) is establishing a BlogStory
PK composed of the columns BlogStoryNumber
and VersionCode
, in this manner you would be able to uniquely identify each Version of a BlogStory. And you can try with a combination of BlogStoryNumber
and VersionNumber
too.
Similar scenario
You may find my answer to this question of help, since I as well propose enabling temporal capabilities in the concerning database to deal with a comparable scenario.
One option is to use Version Normal Form (vnf). The advantages include:
- The current data and all past data reside in the same table.
- The same query is used to retrieve current data or data that was current as of any particular date.
- Foreign key references to versioned data work the same as for unversioned data.
An additional benefit in your case, as versioned data is uniquely identified by making the effective date (the date the change was made) part of the key, a separate version_id field is not required.
Here is an explanation for a very similar type of entity.
More details can be found in a slide presentation here and a not-quite-completed document here