Selecting most recent and specific version in each group of records, for multiple groups
To get only latest revisions:
SELECT * from t t1
WHERE t1.rev =
(SELECT max(rev) FROM t t2 WHERE t2.id = t1.id)
To get a specific revision, in this case 1 (and if an item doesn't have the revision yet the next smallest revision):
SELECT * from foo t1
WHERE t1.rev =
(SELECT max(rev)
FROM foo t2
WHERE t2.id = t1.id
AND t2.rev <= 1)
It might not be the most efficient way to do this, but right now I cannot figure a better way to do this.
Here's an alternative solution that incurs an update cost but is much more efficient for reading the latest data rows as it avoids computing MAX(rev)
. It also works when you're doing bulk updates of subsets of the table. I needed this pattern to ensure I could efficiently switch to a new data set that was updated via a long running batch update without any windows of time where we had partially updated data visible.
Aging
- Replace the
rev
column with anage
column - Create a view of the current latest data with filter:
age = 0
- To create a new version of your data ...
- INSERT: new rows with
age = -1
- This was my slow long running batch process. - UPDATE:
UPDATE table-name SET age = age + 1
for all rows in the subset. This switches the view to the new latest data (age = 0) and also ages older data in a single transaction. - DELETE: rows having
age > N
in the subset - Optionally purge old data
- INSERT: new rows with
Indexing
- Create a composite index with
age
and thenid
so the view will be nice and fast and can also be used to look up by id. Although this key is effectively unique, its temporarily non-unique when you're ageing the rows (duringUPDATE SET age=age+1
) so you'll need to make it non-unique and ideally the clustered index. If you need to find all versions of a givenid
ordered byage
, you may need an additional non-unique index onid
thenage
.
Rollback
Finally ... Lets say you're having a bad day and the batch processing breaks. You can quickly revert to a previous data set version by running:
UPDATE table-name SET age = age - 1
-- Roll back a versionDELETE table-name WHERE age < 0
-- Clean up bad stuff
Existing Table
Suppose you have an existing table that now needs to support aging. You can use this pattern by first renaming the existing table, then add the age
column and indexing and then create the view that includes the age = 0
condition with the same name as the original table name.
This strategy may or may not work depending on the nature of technology layers that depended on the original table but in many cases swapping a view for a table should drop in just fine.
Notes
I recommend naming the age
column to RowAge
in order to indicate this pattern is being used, since it's clearer that its a database related value and it complements SQL Server's RowVersion
naming convention. It also won't conflict with a column or view that needs to return a person's age.
Unlike other solutions, this pattern works for non SQL Server databases.
This is how I would do it. ROW_NUMBER()
requires SQL Server 2005 or later
Sample data:
DECLARE @foo TABLE (
id int,
rev int,
field nvarchar(10)
)
INSERT @foo VALUES
( 1, 1, 'test1' ),
( 2, 1, 'fdsfs' ),
( 3, 1, 'jfds' ),
( 1, 2, 'test2' )
The query:
DECLARE @desiredRev int
SET @desiredRev = 2
SELECT * FROM (
SELECT
id,
rev,
field,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) rn
FROM @foo WHERE rev <= @desiredRev
) numbered
WHERE rn = 1
The inner SELECT
returns all relevant records, and within each id
group (that's the PARTITION BY
), computes the row number when ordered by descending rev
.
The outer SELECT
just selects the first member (so, the one with highest rev
) from each id
group.
Output when @desiredRev = 2
:
id rev field rn
----------- ----------- ---------- --------------------
1 2 test2 1
2 1 fdsfs 1
3 1 jfds 1
Output when @desiredRev = 1
:
id rev field rn
----------- ----------- ---------- --------------------
1 1 test1 1
2 1 fdsfs 1
3 1 jfds 1