Database design for point in time "snapshot" of data?

This is NOT easy.

You're essentially asking for a Temporal Database (What Christopher Date calls Sixth Normal Form, or 6NF).

To be 6NF, a schema must also be 5NF, and, basically, for each datum, you need to attach a time range for which the datum at that value is applicable. Then in joins, the join must include only the rows that are within the time range being considered.

Temporal modeling is hard -- it's what 6th Normal Form addresses -- and not well supported in current RDBMSes.

The problem is the granularity. 6th Normal Form (as I understand it) supports temporal modeling by making every non-key (non-key:, i.e., anything "on" the entity that can change without the entity losing its identity) a separate relation. To this, you add a timestamp or time range or version number. Making everything a join solves the granularity problem, but it also means your queries are going to be more complicated and slower. It also requires figuring out all keys and non-key attributes; this tends to be a large effort.

Basically, everywhere you have a relation ("ted owns the GM stock certificate with id 789") you add a time: "ted owns the GM stock certificate with id 789 now" so that you can simultaneously say, "fred owns the GM stock certificate with id 789 from 3 Feb 2000 to yesterday". Obviously these relations are many-to-many, (ted can own more than one certificate now, and more than one over his lifetime, too, and fred can have previously owned the certificate jack owns now).

So we have a table of owners, and a table of stock certificates, and a many-to-many table that relates owners and certificates by id. To the many-to-many table, we add a start_date and an end_date.

Now, imagine that each state/province/land taxes the dividends on stock certificates, so for tax purposes to record the stock certificate's owner's state of residency.

Where the owner resides can obviously change independently with stock ownership; ted can live in Nebraska, buy 10 shares, get a dividend that Nebraska taxes, move to Nevada, sells 5 shares to fred, buy 10 more shares.

But for us, it's ted can move to Nebraska at some time, buy 10 shares at some time, get a dividend at some time, which Nebraska taxes, move to Neveda at some time, sell 5 shares to fred at some time, buy 10 more shares at some time.

We need all of that if we want to calculate what taxes ted owes in Nebraska and in Nevada, joining up on the matching/overlapping date ranges in person_stockcertificate and person_address. A person's address is no longer one-to-one, it's one-to-many because it's address during time range.

If ted buys ten shares, do we model a buy event with a single purchase date, or do we add a date_bought to each share? Depends on the question we need the model to answer.

We did this once by creating separate database tables that contained the data we wanted to snapshot, but denormalized, i.e. every record contained all data required to make sense, not references to id's that may or may no longer exist. It also added a date to each row.

Then we produced triggers for specific inserts or updates that did a join on all affected tables, and inserted it into the snapshot tables.

This way it would be trivial to write something that restored the users' data to a point in time.

If you have a table:

user:

id, firstname, lastname, department_id

department:

id, name, departmenthead_id

your snapshot of the user table could look like this:

user_id, user_firstname, user_lastname, department_id, department_name, deparmenthead_id, deparmenthead_firstname, departmenthead_lastname, snapshot_date

and a query something like

INSERT INTO usersnapshot
SELECT user.id AS user_id, user.firstname AS user_firstname, user.lastname AS user_lastname,
department.id AS department_id, department.name AS department_name
departmenthead.id AS departmenthead_id, departmenthead.firstname AS departmenthead_firstname, departmenthead.lastname AS departmenthead_lastname,
GETDATE() AS snapshot_date
FROM user
INNER JOIN department ON user.department_id = department.id
INNER JOIN user departmenthead ON department.departmenthead_id = departmenthead.id

This ensures each row in the snapshot is true for that moment in time, even if department or department head has changed in the meantime.

Having snapshots and/or an audit trail is a common database requirement. For many applications, creating 'shadow' or audit tables is an easy and straight forward task. While database level backups and transaction logs are good to have, they are not a version control system.

Basically, you need create a shadow table with all the same columns as the base table, and then setup triggers on the base table to place a copy of the row in the shadow table when ever it is updated or deleted.

Through some logic you can recreate what the data looked like at a given point in time. For an easy way to set this up in Sybase see: http://www.theeggeadventure.com/wikimedia/index.php/Sybase_Tips#create_.27audit.27_columns

If you need to do lots of historical snapshots, then you can keep the data in the same table. Basically, create two columns - an added and deleted column. The downside is for every query you must add a where clause. Of course you can create a view, which shows just the active records. This gets a bit more complicated if you have a normalized database with multiple tables, all with history.

However, it does work. You simply have the 'added' and 'deleted' columns on each table, and then your query has the point in time of interest. Whenever data is modified you must copy the current row, and mark it as deleted.

Use Log Triggers

All data changes are captured, giving the ability to query as if at any point in time.

Database design for point in time "snapshot" of data?

Tags:

Database Design

Related

Recent Posts