Versioning in SQL Tables - how to handle it?

An approach that I've designed for a recent database is to use revisions as follows:

  • Keep your entity in two tables:

    1. "employee" stores a primary key ID and any data that you do not want to be versioned (if there is any).

    2. "employee_revision" stores all the salient data about the employee, with a foreign key to the employee table and a foreign key, "RevisionID" to a table called "revision".

  • Make a new table called "revision". This can be used by all the entities in your database, not just employee. It contains an identity column for the primary key (or AutoNumber, or whatever your database calls such a thing). It also contains EffectiveFrom and EffectiveTo columns. I also have a text column on the table - entity_type - for human readability reasons which contain the name of the primary revision table (in this case "employee"). The revision table contains no foreign keys. The default value for EffectiveFrom is 1-Jan-1900 and the default value for EffectiveTo is 31-Dec-9999. This allows me to not simplify the date querying.

I make sure that the revision table is well indexed on (EffectiveFrom, EffectiveTo, RevisionID) and also on (RevisionID, EffectiveFrom, EffectiveTo).

I can then use joins and simple <> comparisons to select an appropriate record for any date. This also means that relations between entities are also fully versioned. In fact, I find it useful to use SQL Server table-valued functions to allow very simply querying of any date.

Here's an example (assuming that you don't want to version employee names so that if they change their name, the change is effective historically).

--------
employee
--------
employee_id  |  employee_name
-----------  |  -------------
12351        |  John Smith

-----------------
employee_revision
-----------------
employee_id  |  revision_id  |  department_id  |  position_id  |  pay
-----------  |  -----------  |  -------------  |  -----------  |  ----------
12351        |  657442       |  72             |  23           |  22000.00
12351        |  657512       |  72             |  27           |  22000.00
12351        |  657983       |  72             |  27           |  28000.00

--------
revision
--------
revision_id  |  effective_from  |  effective_to  |  entity_type
-----------  |  --------------  |  ------------  |  -----------
657442       |  01-Jan-1900     |  03-Mar-2007   |  EMPLOYEE
657512       |  04-Mar-2007     |  22-Jun-2009   |  EMPLOYEE
657983       |  23-Jun-2009     |  31-Dec-9999   |  EMPLOYEE

One advantage of storing your revision metadata in a separate table is that it's easy to apply it consistently to all your entities. Another is that it's easier to expand it to include other things, such as branches or scenarios, without having to modify every table. My principal reason is that it keeps your main entity tables clear and uncluttered.

(The data and example above are fictional - my database does not model employees).


What you have here is called a Slowly Changing Dimension (SCD). There are some proven methods for dealing with it:

http://en.wikipedia.org/wiki/Slowly_changing_dimension

Thought I'd add that since no one seems to call it by name.


I think you've started down the wrong path.

Typically, for versioning or storing historical data you do one of two (or both) things.

  1. You have a separate table that mimics the original table + a date/time column for the date it was changed. Whenever a record is updated, you insert the existing contents into the history table just prior to the update.

  2. You have a separate warehouse database. In this case you can either version it just like in #1 above OR you simply snapshot it once every so often (hourly, daily, weekly..)

Keeping your version number in the same table as your normal one has several problems. First, the table size is going to grow like crazy. This will put constant pressure on normal production queries.

Second, it's going to radically increase your query complexity for joins etc in order to make sure the latest version of each record is being used.