Data Warehouse design for reporting against data for many time zones
I've solved this by having a very simple calendar table - each year has one row per supported time zone, with the standard offset and the start datetime / end datetime of DST and its offset (if that time zone supports it). Then an inline, schema-bound, table-valued function that takes the source time (in UTC of course) and adds/subtracts the offset.
This will obviously never perform extremely well if you are reporting against a large portion of data; partitioning might seem to help, but you will still have cases where the last few hours in one year or the first few hours in the next year actually belong to a different year when converted to a specific time zone - so you can never get true partition isolation, except when your reporting range does not include December 31 or January 1.
There are a couple of weird edge cases you need to consider:
2014-11-02 05:30 UTC and 2014-11-02 06:30 UTC both convert to 01:30 AM in the Eastern time zone, for example (one for the first time 01:30 was hit locally, and then one for the second time when the clocks rolled back from 2:00 AM to 1:00 AM, and another half hour elapsed). So you need to decide how to handle that hour of reporting - according to UTC, you should see double the traffic or volume of whatever you're measuring once those two hours get mapped to a single hour in a time zone that observes DST. This can also play fun games with sequencing of events, since something that logically had to happen after something else could appear to happen before it once the timing is adjusted to a single hour instead of two. An extreme example is a page view that happened at 05:59 UTC, then a click that happened at 06:00 UTC. In UTC time these happened a minute apart, but when converted to Eastern time, the view happened at 1:59 AM, and the click happened an hour earlier.
2014-03-09 02:30 never happens in the USA. This is because at 2:00 AM we roll the clocks forward to 3:00 AM. So likely you will want to raise an error if the user enters such a time and asks you to convert that to UTC, or design your form so that users can't pick such a time.
Even with those edge cases in mind, I still think you have the right approach: store the data in UTC. Much easier to map data to other time zones from UTC than from some time zone to some other time zone, especially when different time zones start / end DST on different dates, and even the same time zone can switch using different rules in different years (for example the U.S. changed the rules 6 years ago or so).
You will want to use a calendar table for all of this, not some gargantuan CASE
expression (not statement). I just wrote a three-part series for MSSQLTips.com on this; I think the 3rd part will be the most useful for you:
- http://www.mssqltips.com/sqlservertip/3173/handle-conversion-between-time-zones-in-sql-server--part-1/
- http://www.mssqltips.com/sqlservertip/3174/handle-conversion-between-time-zones-in-sql-server--part-2/
- http://www.mssqltips.com/sqlservertip/3175/handle-conversion-between-time-zones-in-sql-server--part-3/
A real live example, in the meantime
Let's say you have a very simple fact table. The only fact I care about in this case is the event time, but I'll add a meaningless GUID just to make the table wide enough to care about. Again, to be explicit, the fact table stores events in UTC time and UTC time only. I've even suffixed the column with _UTC
so there is no confusion.
CREATE TABLE dbo.Fact
(
EventTime_UTC DATETIME NOT NULL,
Filler UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID()
);
GO
CREATE CLUSTERED INDEX x ON dbo.Fact(EventTime_UTC);
GO
Now, let's load our fact table with 10,000,000 rows - representing every 3 seconds (1,200 rows per hour) from 2013-12-30 at midnight UTC until sometime after 5 AM UTC on 2014-12-12. This ensures that the data straddles a year boundary, as well as DST forward and back for multiple time zones. This looks really scary, but took ~9 seconds on my system. Table should end up being about 325 MB.
;WITH x(c) AS
(
SELECT TOP (10000000) DATEADD(SECOND,
3*(ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1),
'20131230')
FROM sys.all_columns AS s1
CROSS JOIN sys.all_columns AS s2
ORDER BY s1.[object_id]
)
INSERT dbo.Fact WITH (TABLOCKX) (EventTime_UTC)
SELECT c FROM x;
And just to show what a typical seek query will look like against this 10MM row table, if I run this query:
SELECT DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0),
COUNT(*)
FROM dbo.Fact
WHERE EventTime_UTC >= '20140308'
AND EventTime_UTC < '20140311'
GROUP BY DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0);
I get this plan, and it returns in 25 milliseconds*, doing 358 reads, to return 72 hourly totals:
* Duration as measured by the free SentryOne Plan Explorer, which discards results, so this does not include network transfer time of the data, rendering, etc.
It takes a little longer, obviously, if I make my range too large - a month of data takes 258ms, two months takes over 500ms, and so on. Parallelism may kick in:
This is where you start thinking about other, better solutions to satisfy reporting queries, and it has nothing to do with what time zone your output will display. I won't get into that, I just want to demonstrate that time zone conversion is not really going to make your reporting queries suck all that much more, and they may already suck if you are getting large ranges that aren't supported by proper indexes. I'm going to stick to small date ranges to show that the logic is correct, and let you worry about making sure your range-based reporting queries perform adequately, with or without time zone conversions.
Okay, now we need tables to store our time zones (with offsets, in minutes, since not everybody is even hours off UTC) and DST change dates for each supported year. For simplicity, I'm only going to enter a few time zones and a single year to match the data above.
CREATE TABLE dbo.TimeZones
(
TimeZoneID TINYINT NOT NULL PRIMARY KEY,
Name VARCHAR(9) NOT NULL,
Offset SMALLINT NOT NULL, -- minutes
DSTName VARCHAR(9) NOT NULL,
DSTOffset SMALLINT NOT NULL -- minutes
);
Included a few time zones for variety, some with half hour offsets, some that don't observe DST. Note that Australia, in southern hemisphere observes DST during our winter, so their clocks go back in April and forward in October. (The above table flips the names, but I'm not sure how to make this any less confusing for southern hemisphere time zones.)
INSERT dbo.TimeZones VALUES
(1, 'UTC', 0, 'UTC', 0),
(2, 'GMT', 0, 'BST', 60),
-- London = UTC in winter, +1 in summer
(3, 'EST', -300, 'EDT', -240),
-- East coast US (-5 h in winter, -4 in summer)
(4, 'ACDT', 630, 'ACST', 570),
-- Adelaide (Australia) +10.5 h Oct - Apr, +9.5 Apr - Oct
(5, 'ACST', 570, 'ACST', 570);
-- Darwin (Australia) +9.5 h year round
Now, a calendar table to know when TZs change. I'm only going to insert rows of interest (each time zone above, and only DST changes for 2014). For ease of calculations back and forth, I store both the moment in UTC where a time zone changes, and the same moment in local time. For time zones that don't observe DST, it's standard all year long, and DST "starts" on January 1.
CREATE TABLE dbo.Calendar
(
TimeZoneID TINYINT NOT NULL FOREIGN KEY
REFERENCES dbo.TimeZones(TimeZoneID),
[Year] SMALLDATETIME NOT NULL,
UTCDSTStart SMALLDATETIME NOT NULL,
UTCDSTEnd SMALLDATETIME NOT NULL,
LocalDSTStart SMALLDATETIME NOT NULL,
LocalDSTEnd SMALLDATETIME NOT NULL,
PRIMARY KEY (TimeZoneID, [Year])
);
You can definitely populate this with algorithms (and the upcoming tip series uses some clever set-based techniques, if I do say so myself), rather than loop, populate manually, what have you. For this answer I decided to just manually populate one year for the five time zones, and I'm not going to bother any fancy tricks.
INSERT dbo.Calendar VALUES
(1, '20140101', '20140101 00:00','20150101 00:00','20140101 00:00','20150101 00:00'),
(2, '20140101', '20140330 01:00','20141026 00:00','20140330 02:00','20141026 01:00'),
(3, '20140101', '20140309 07:00','20141102 06:00','20140309 03:00','20141102 01:00'),
(4, '20140101', '20140405 16:30','20141004 16:30','20140406 03:00','20141005 02:00'),
(5, '20140101', '20140101 00:00','20150101 00:00','20140101 00:00','20150101 00:00');
Okay, so we have our fact data, and our "dimension" tables (I cringe when I say that), so what's the logic? Well, I presume you're going to have users select their time zone and enter the date range for the query. I will also assume that the date range will be full days in their own timezone; no partial days, never mind partial hours. So they will pass in a start date, an end date, and a TimeZoneID. From there we will use a scalar function to convert the start/end date from that time zone to UTC, which will allow us to filter the data based on the UTC range. Once we've done that, and performed our aggregations on it, we can then apply the conversion of the grouped times back to the source time zone, before displaying to the user.
The scalar UDF:
CREATE FUNCTION dbo.ConvertToUTC
(
@Source SMALLDATETIME,
@SourceTZ TINYINT
)
RETURNS SMALLDATETIME
WITH SCHEMABINDING
AS
BEGIN
RETURN
(
SELECT DATEADD(MINUTE, -CASE
WHEN @Source >= src.LocalDSTStart
AND @Source < src.LocalDSTEnd THEN t.DSTOffset
WHEN @Source >= DATEADD(HOUR,-1,src.LocalDSTStart)
AND @Source < src.LocalDSTStart THEN NULL
ELSE t.Offset END, @Source)
FROM dbo.Calendar AS src
INNER JOIN dbo.TimeZones AS t
ON src.TimeZoneID = t.TimeZoneID
WHERE src.TimeZoneID = @SourceTZ
AND t.TimeZoneID = @SourceTZ
AND DATEADD(MINUTE,t.Offset,@Source) >= src.[Year]
AND DATEADD(MINUTE,t.Offset,@Source) < DATEADD(YEAR, 1, src.[Year])
);
END
GO
And the table-valued function:
CREATE FUNCTION dbo.ConvertFromUTC
(
@Source SMALLDATETIME,
@SourceTZ TINYINT
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT
[Target] = DATEADD(MINUTE, CASE
WHEN @Source >= trg.UTCDSTStart
AND @Source < trg.UTCDSTEnd THEN tz.DSTOffset
ELSE tz.Offset END, @Source)
FROM dbo.Calendar AS trg
INNER JOIN dbo.TimeZones AS tz
ON trg.TimeZoneID = tz.TimeZoneID
WHERE trg.TimeZoneID = @SourceTZ
AND tz.TimeZoneID = @SourceTZ
AND @Source >= trg.[Year]
AND @Source < DATEADD(YEAR, 1, trg.[Year])
);
And a procedure that uses it (edit: updated to handle 30-minute offset grouping):
CREATE PROCEDURE dbo.ReportOnDateRange
@Start SMALLDATETIME, -- whole dates only please!
@End SMALLDATETIME, -- whole dates only please!
@TimeZoneID TINYINT
AS
BEGIN
SET NOCOUNT ON;
SELECT @Start = dbo.ConvertToUTC(@Start, @TimeZoneID),
@End = dbo.ConvertToUTC(@End, @TimeZoneID);
;WITH x(t,c) AS
(
SELECT DATEDIFF(MINUTE, @Start, EventTime_UTC)/60,
COUNT(*)
FROM dbo.Fact
WHERE EventTime_UTC >= @Start
AND EventTime_UTC < DATEADD(DAY, 1, @End)
GROUP BY DATEDIFF(MINUTE, @Start, EventTime_UTC)/60
)
SELECT
UTC = DATEADD(MINUTE, x.t*60, @Start),
[Local] = y.[Target],
[RowCount] = x.c
FROM x OUTER APPLY
dbo.ConvertFromUTC(DATEADD(MINUTE, x.t*60, @Start), @TimeZoneID) AS y
ORDER BY UTC;
END
GO
(You may want to have a go at short circuiting there, or a separate stored procedure, in the event that the user wants reporting in UTC - obviously translating to and from UTC is going to be wasteful busy work.)
Sample call:
EXEC dbo.ReportOnDateRange
@Start = '20140308',
@End = '20140311',
@TimeZoneID = 3;
Returns in 41ms*, and generates this plan:
* Again, with discarded results.
For 2 months, it returns in 507ms, and the plan is identical other than rowcounts:
While slightly more complex and increasing run time a little bit, I am fairly confident that this type of approach will work out much, much better than the bridge table approach. And this is an off-the cuff example for a dba.se answer; I'm sure my logic and efficiency could be improved by folks much smarter than me.
You can peruse the data to see the edge cases I talk about - no row of output for the hour where clocks roll forward, two rows for the hour where they rolled back (and that hour happened twice). You can also play with bad values; if you pass in 20140309 02:30 Eastern time, for example, it's not going to work too well.
I might not have all of the assumptions right about how your reporting will work, so you may have to make some adjustments. But I think this covers the basics.