Separate month and year columns, or date with day always set to 1?
Personally if it's a date, or can be a date, I suggest always storing it as one. It's just easier to work with as a rule of thumb.
- A date is 4 bytes.
- A smallint is 2 bytes (we need two)
- ... 2 bytes: one smallint for year
- ... 2 bytes: one smallint for month
You can have one date which will support day if you ever need it, or one smallint
for year and month which will never support the extra precision.
Sample data
Let's look at an example now.. Let's create 1 million dates for our sample. This is approximately 5,000 rows for 200 years between 1901, and 2100. Every year should have something for every month.
CREATE TABLE foo
AS
SELECT
x,
make_date(year,month,1)::date AS date,
year::smallint,
month::smallint
FROM generate_series(1,1e6) AS gs(x)
CROSS JOIN LATERAL CAST(trunc(random()*12+1+x-x) AS int) AS month
CROSS JOIN LATERAL CAST(trunc(random()*200+1901+x-x) AS int) AS year
;
CREATE INDEX ON foo(date);
CREATE INDEX ON foo (year,month);
VACUUM FULL ANALYZE foo;
Testing
Simple WHERE
Now we can test these theories of not using date.. I ran each of these a few times so as to warm things up.
EXPLAIN ANALYZE SELECT * FROM foo WHERE date = '2014-1-1'
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=11.56..1265.16 rows=405 width=14) (actual time=0.164..0.751 rows=454 loops=1)
Recheck Cond: (date = '2014-04-01'::date)
Heap Blocks: exact=439
-> Bitmap Index Scan on foo_date_idx (cost=0.00..11.46 rows=405 width=0) (actual time=0.090..0.090 rows=454 loops=1)
Index Cond: (date = '2014-04-01'::date)
Planning time: 0.090 ms
Execution time: 0.795 ms
Now, let's try the other method with them separate
EXPLAIN ANALYZE SELECT * FROM foo WHERE year = 2014 AND month = 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=12.75..1312.06 rows=422 width=14) (actual time=0.139..0.707 rows=379 loops=1)
Recheck Cond: ((year = 2014) AND (month = 1))
Heap Blocks: exact=362
-> Bitmap Index Scan on foo_year_month_idx (cost=0.00..12.64 rows=422 width=0) (actual time=0.079..0.079 rows=379 loops=1)
Index Cond: ((year = 2014) AND (month = 1))
Planning time: 0.086 ms
Execution time: 0.749 ms
(7 rows)
In fairness, they're not all 0.749.. some are a bit more or less, but it doesn't matter. They're all relatively the same. It simply isn't needed.
Within one month
Now, let's have fun with it.. Let's say you want to find all intervals within 1 month of Jan 2014 (the same month we used above).
EXPLAIN ANALYZE
SELECT *
FROM foo
WHERE date
BETWEEN
('2014-1-1'::date - '1 month'::interval)::date
AND ('2014-1-1'::date + '1 month'::interval)::date;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=21.27..2310.97 rows=863 width=14) (actual time=0.384..1.644 rows=1226 loops=1)
Recheck Cond: ((date >= '2013-12-01'::date) AND (date <= '2014-02-01'::date))
Heap Blocks: exact=1083
-> Bitmap Index Scan on foo_date_idx (cost=0.00..21.06 rows=863 width=0) (actual time=0.208..0.208 rows=1226 loops=1)
Index Cond: ((date >= '2013-12-01'::date) AND (date <= '2014-02-01'::date))
Planning time: 0.104 ms
Execution time: 1.727 ms
(7 rows)
Compare that to the combined method
EXPLAIN ANALYZE
SELECT *
FROM foo
WHERE year = 2013 AND month = 12
OR ( year = 2014 AND ( month = 1 OR month = 2) );
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=38.79..2999.66 rows=1203 width=14) (actual time=0.664..2.291 rows=1226 loops=1)
Recheck Cond: (((year = 2013) AND (month = 12)) OR (((year = 2014) AND (month = 1)) OR ((year = 2014) AND (month = 2))))
Heap Blocks: exact=1083
-> BitmapOr (cost=38.79..38.79 rows=1237 width=0) (actual time=0.479..0.479 rows=0 loops=1)
-> Bitmap Index Scan on foo_year_month_idx (cost=0.00..12.64 rows=421 width=0) (actual time=0.112..0.112 rows=402 loops=1)
Index Cond: ((year = 2013) AND (month = 12))
-> BitmapOr (cost=25.60..25.60 rows=816 width=0) (actual time=0.218..0.218 rows=0 loops=1)
-> Bitmap Index Scan on foo_year_month_idx (cost=0.00..12.62 rows=420 width=0) (actual time=0.108..0.108 rows=423 loops=1)
Index Cond: ((year = 2014) AND (month = 1))
-> Bitmap Index Scan on foo_year_month_idx (cost=0.00..12.38 rows=395 width=0) (actual time=0.108..0.108 rows=401 loops=1)
Index Cond: ((year = 2014) AND (month = 2))
Planning time: 0.256 ms
Execution time: 2.421 ms
(13 rows)
It's both slower, and uglier.
GROUP BY
/ORDER BY
Combined method,
EXPLAIN ANALYZE
SELECT date, count(*)
FROM foo
GROUP BY date
ORDER BY date;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Sort (cost=20564.75..20570.75 rows=2400 width=4) (actual time=286.749..286.841 rows=2400 loops=1)
Sort Key: date
Sort Method: quicksort Memory: 209kB
-> HashAggregate (cost=20406.00..20430.00 rows=2400 width=4) (actual time=285.978..286.301 rows=2400 loops=1)
Group Key: date
-> Seq Scan on foo (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.012..70.582 rows=1000000 loops=1)
Planning time: 0.094 ms
Execution time: 286.971 ms
(8 rows)
And again with the composite method
EXPLAIN ANALYZE
SELECT year, month, count(*)
FROM foo
GROUP BY year, month
ORDER BY year, month;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Sort (cost=23064.75..23070.75 rows=2400 width=4) (actual time=336.826..336.908 rows=2400 loops=1)
Sort Key: year, month
Sort Method: quicksort Memory: 209kB
-> HashAggregate (cost=22906.00..22930.00 rows=2400 width=4) (actual time=335.757..336.060 rows=2400 loops=1)
Group Key: year, month
-> Seq Scan on foo (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.010..70.468 rows=1000000 loops=1)
Planning time: 0.098 ms
Execution time: 337.027 ms
(8 rows)
Conclusion
Generally, let the smart people do the hard work. Datemath is hard, my clients don't pay me enough. I used to do these tests. I was hard pressed to ever conclude that I could get better results than date
. I stopped trying.
UPDATES
@a_horse_with_no_name suggested for my within one month test WHERE (year, month) between (2013, 12) and (2014,2)
. In my opinion, while cool that's a more complex query and I'd rather avoid it unless there was a gain. Alas, it was still slower though it's close -- which is more of the take away from this test. It simply doesn't matter much.
EXPLAIN ANALYZE
SELECT *
FROM foo
WHERE (year, month) between (2013, 12) and (2014,2);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo (cost=5287.16..15670.20 rows=248852 width=14) (actual time=0.753..2.157 rows=1226 loops=1)
Recheck Cond: ((ROW(year, month) >= ROW(2013, 12)) AND (ROW(year, month) <= ROW(2014, 2)))
Heap Blocks: exact=1083
-> Bitmap Index Scan on foo_year_month_idx (cost=0.00..5224.95 rows=248852 width=0) (actual time=0.550..0.550 rows=1226 loops=1)
Index Cond: ((ROW(year, month) >= ROW(2013, 12)) AND (ROW(year, month) <= ROW(2014, 2)))
Planning time: 0.099 ms
Execution time: 2.249 ms
(7 rows)
As an alternative to Evan Carroll proposed method, which I consider probably the best option, I've used in some occasions (and not specially when using PostgreSQL) just a year_month
column, of type INTEGER
(4 bytes), computed as
year_month = year * 100 + month
That is, you encode the month on the two rightmost decimal digits (digit 0, and digit 1) of the integer number, and the year on the digits 2 through 5 (or more, if needed).
This is, to some extent, a poor man's alternative to building your own year_month
type and operators. It's got some advantages, mostly "clarity of intent", and some space savings (not in PostgreSQL, I think), and also some inconveniences, over having two separate columns.
You can guarantee that values are valid by just adding a
CHECK ((year_date % 100) BETWEEN 1 AND 12) /* % = modulus operator */
You can have a WHERE
clause looking like:
year_month BETWEEN 201610 and 201702
and it works efficiently (if the year_month
column is properly indexed, of course).
You can group by year_month
the same way you could do it with a date, and with the same efficiency (at least).
If you need to separate year
and month
, the computation is straightforward:
month = year_month % 100 -- % is modulus operator
year = year_month / 100 -- / is integer division
What's inconvenient: if you want to add 15 months to a year_month
you have to compute (if I've not made a mistake or oversight):
year_month + delta (months) = ...
/* intermediate calculations */
year = year_month/100 + delta/12 /* years we had + new years */
+ (year_month % 100 + delta%12) / 12 /* extra months make 1 more year? */
month = ((year_month%10) + (delta%12) - 1) % 12 + 1
/* final result */
... = year * 100 + month
If you're not careful, this can be error prone.
If you want to get the number of months between two year_months, you need to do some similar computations. That's (with a lot of simplifications) what really happens under the hood with date arithmetic, that is luckily hidden from us through already defined functions and operators.
If you need lots of these operations, using year_month
is not too practical. If you don't, it is a very clear way of making your intent clear.
As an alternative, you could define a year_month
type, and define an operator year_month
+ interval
, and also another year_month
- year_month
... and hide the calculations. I've actually never made such a heavy use as to feel the need in practice. A date
-date
is actually hiding you something similar.
As an alternative to joanolo's method =) (sorry I was busy but wanted to write this)
BIT JOY
We're going to do the same thing, but with bits. One int4
in PostgreSQL is a signed integer, ranging from -2147483648 to +2147483647
Here is an overview, of our structure.
bit
----------------------------------
YYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMM
Storing month.
- A month requires 12 options
pow(2,4)
is 4 bits. - The rest we devote to the year, 32-4 = 28 bits.
Here is our bit map of where months are stored.
bit
----------------------------------
00000000000000000000000000001111
Months, 1-Jan - 12 Dec
bit
----------------------------------
00000000000000000000000000000001
bit
----------------------------------
00000000000000000000000000001100
Years. The remaining 28 bits allow us to store our year-information
SELECT (pow(2,28)-1)::int;
int4
-----------
268435455
(1 row)
At this point we need to decide how we want to do this. For our purposes, we could use a static offset, if we only need to cover 5,000 AD, we could go back to 268,430,455 BC
which pretty much covers the entirety of Mesozoic and everything useful moving forward.
SELECT (pow(2,28)-1)::int4::bit(32) << 4;
year
----------------------------------
11111111111111111111111111110000
And, now we have the rudiments of our type, set to expire in 2,700 years.
So let's get to work on making some functions.
CREATE DOMAIN year_month AS int4;
CREATE OR REPLACE FUNCTION to_year_month (cstring text)
RETURNS year_month
AS $$
SELECT (
( ((date[1]::int4 - 5000) * -1)::bit(32) << 4 )
| date[2]::int4::bit(32)
)::year_month
FROM regexp_split_to_array(cstring,'-(?=\d{1,2}$)')
AS t(date)
$$
LANGUAGE sql
IMMUTABLE;
CREATE OR REPLACE FUNCTION year_month_to_text (ym year_month)
RETURNS text
AS $$
SELECT ((ym::bit(32) >>4)::int4 * -1 + 5000)::text ||
'-' ||
(ym::bit(32) <<28 >>28)::int4::text
$$ LANGUAGE sql
IMMUTABLE;
A quick test shows this working..
SELECT year_month_to_text( to_year_month('2014-12') );
SELECT year_month_to_text( to_year_month('-5000-10') );
SELECT year_month_to_text( to_year_month('-8000-10') );
SELECT year_month_to_text( to_year_month('-84398-10') );
Now we have functions that we can use on our binary types..
We could have cut off one more bit from the signed portion, stored the year as positive, and then had it sort naturally as a signed int. If speed was a higher priority than storage space, that would have been the route we go down. But for now, we have a date that works with the Mesozoic.
I may update later with that, just for fun.