Best database and table design for billions of rows of data
This is exactly what I do every day, except instead of using the hourly data, I use the 5 minute data. I download about 200 million records everyday, so the amount you talk about here is not a problem. The 5 minute data is about 2 TB in size and I have weather data going back 50 years at an hourly level by location. So let me answer you questions based on my experience:
- Don't use NoSQL for this. The data is highly structured and fits a relational database perfectly.
- I personally use SQL Server 2016 and I have no problems applying computations across that volume of data. It was originally on a PostgreSQL instance when I started my job and it couldn't handle the volume of data as it was on a small AWS instance.
- I would highly recommend extracting the hour portion of the date and storing it separate from the date itself. Believe me, learn from my mistakes!
- I store the majority of data list-wise (DATE,TIME,DATAPOINT_ID,VALUE) but that is not how people will want to interpret the data. Be prepared for some horrendous queries against the data and vast amounts of pivoting. Don't be afraid to create a de-normalized table for result sets that are just too large to compute on the fly.
General tip: I store most of the data between two databases, the first is straight-up time series data and is normalized. My second database is very de-normalized and contains pre-aggregated data. As fast as my system is, I am not blind to the fact that users don't even want to wait 30 seconds for a report to load – even if I personally think 30 seconds to crunch 2 TB of data is extremely fast.
To elaborate on why I recommend storing the hour separate from the date, here are a few reasons why I do it that way:
- The way that the electrical data is presented is by Hour Ending – therefore, 01:00 is actually the average of the electrical power for the previous hour and 00:00 is Hour Ending 24. (This is important because you actually have to search for two dates to include the 24 hour value – the day you are looking for plus the first mark of the following day.) However, the weather data is actually presented in a forward manner (actual and forecast for the next hour). In my experience with this data, consumers wish to analyze the effect that the weather has on the power price/demand. If you were to use a straight-up date comparison, you would actually be comparing the average price for the previous hour versus the average temperature for the following hour, even though the time stamps are the same. Storing the hour separate from the date allows you to apply transformations to the time with less performance impact than you would see applying a calculation to a
DATETIME
column. - Performance. I would say at least 90% of the reports that I generate are graphs, normally plotting the price against the hour either for a single date or for a range of dates. Having to split out the time from the date can bog down the speed of the query used to generate the report depending on the date range that you want to see. It is not uncommon for consumers to want to see a single date, Year-on-Year for the past 30 years (in fact for weather this is required to generate the 30 year normals) – this can be slow. Of course you can optimize your query and add indexes, and trust me I have some insane indexes that I would rather not have but it makes the system run fast.
- Productivity. I hate having to write the same piece of code more than once. I used to store the date and time in the same column, until I had to write the same query over and over again to extract the time portion. After a while I just got sick of having to do this and extracted it to its own column. The less code you have to write the less chance there is of an error in it. Also, having to write less code means that you can get your reports out faster, nobody wants to be waiting all day for reports.
- End users. Not all end users are power users (i.e. know how to write SQL). Having the data already stored in a format that they can bring into Excel (or other similar tool) with minimal effort will make you a hero in the office. If the users cannot access or manipulate the data easily, they will not use your system. Believe me, I designed the perfect system a couple of years ago and nobody used it because of this reason. Database design is not just about adhering to a predefined set of rules/guidelines, it is about making the system usable.
As I said above, this is all based on my personal experience, and let me tell you, it has been a hard few years and a lot of redesigns to get to where I am now. Don't do what I did, learn from my mistakes and make sure you involve the end users of your system (or developers, report authors etc...) when making decisions about your database.
PostgreSQL and BRIN indexes
Test it for yourself. This isn't a problem on a 5 year old laptop with an ssd.
EXPLAIN ANALYZE
CREATE TABLE electrothingy
AS
SELECT
x::int AS id,
(x::int % 20000)::int AS locid, -- fake location ids in the range of 1-20000
now() AS tsin, -- static timestmap
97.5::numeric(5,2) AS temp, -- static temp
x::int AS usage -- usage the same as id not sure what we want here.
FROM generate_series(1,1728000000) -- for 1.7 billion rows
AS gs(x);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series gs (cost=0.00..15.00 rows=1000 width=4) (actual time=173119.796..750391.668 rows=1728000000 loops=1)
Planning time: 0.099 ms
Execution time: 1343954.446 ms
(3 rows)
So it took 22min to create the table. Largely, because the table is a modest 97GB. Next we create the indexes,
CREATE INDEX ON electrothingy USING brin (tsin);
CREATE INDEX ON electrothingy USING brin (id);
VACUUM ANALYZE electrothingy;
It took a good long while to create the indexes too. Though because they're BRIN they're only 2-3 MB and they store easily in ram. Reading 96 GB isn't instantaneous, but it's not a real problem for my laptop at your workload.
Now we query it.
explain analyze
SELECT max(temp)
FROM electrothingy
WHERE id BETWEEN 1000000 AND 1001000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=5245.22..5245.23 rows=1 width=7) (actual time=42.317..42.317 rows=1 loops=1)
-> Bitmap Heap Scan on electrothingy (cost=1282.17..5242.73 rows=993 width=7) (actual time=40.619..42.158 rows=1001 loops=1)
Recheck Cond: ((id >= 1000000) AND (id <= 1001000))
Rows Removed by Index Recheck: 16407
Heap Blocks: lossy=128
-> Bitmap Index Scan on electrothingy_id_idx (cost=0.00..1281.93 rows=993 width=0) (actual time=39.769..39.769 rows=1280 loops=1)
Index Cond: ((id >= 1000000) AND (id <= 1001000))
Planning time: 0.238 ms
Execution time: 42.373 ms
(9 rows)
Update with timestamps
Here we generate a table with different timestamps in order to satisify the request to index and search on a timestamp column, creation takes a bit longer because to_timestamp(int)
is substantially more slow than now()
(which is cached for the transaction)
EXPLAIN ANALYZE
CREATE TABLE electrothingy
AS
SELECT
x::int AS id,
(x::int % 20000)::int AS locid,
-- here we use to_timestamp rather than now(), we
-- this calculates seconds since epoch using the gs(x) as the offset
to_timestamp(x::int) AS tsin,
97.5::numeric(5,2) AS temp,
x::int AS usage
FROM generate_series(1,1728000000)
AS gs(x);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series gs (cost=0.00..17.50 rows=1000 width=4) (actual time=176163.107..5891430.759 rows=1728000000 loops=1)
Planning time: 0.607 ms
Execution time: 7147449.908 ms
(3 rows)
Now we can run a query on a timestamp value instead,,
explain analyze
SELECT count(*), min(temp), max(temp)
FROM electrothingy WHERE tsin BETWEEN '1974-01-01' AND '1974-01-02';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=296073.83..296073.84 rows=1 width=7) (actual time=83.243..83.243 rows=1 loops=1)
-> Bitmap Heap Scan on electrothingy (cost=2460.86..295490.76 rows=77743 width=7) (actual time=41.466..59.442 rows=86401 loops=1)
Recheck Cond: ((tsin >= '1974-01-01 00:00:00-06'::timestamp with time zone) AND (tsin <= '1974-01-02 00:00:00-06'::timestamp with time zone))
Rows Removed by Index Recheck: 18047
Heap Blocks: lossy=768
-> Bitmap Index Scan on electrothingy_tsin_idx (cost=0.00..2441.43 rows=77743 width=0) (actual time=40.217..40.217 rows=7680 loops=1)
Index Cond: ((tsin >= '1974-01-01 00:00:00-06'::timestamp with time zone) AND (tsin <= '1974-01-02 00:00:00-06'::timestamp with time zone))
Planning time: 0.140 ms
Execution time: 83.321 ms
(9 rows)
Result:
count | min | max
-------+-------+-------
86401 | 97.50 | 97.50
(1 row)
So in 83.321 ms we can aggregate 86,401 records in a table with 1.7 Billion rows. That should be reasonable.
Hour ending
Calculating the hour ending is pretty easy too, truncate the timestamps down and then simply add an hour.
SELECT date_trunc('hour', tsin) + '1 hour' AS tsin,
count(*),
min(temp),
max(temp)
FROM electrothingy
WHERE tsin >= '1974-01-01'
AND tsin < '1974-01-02'
GROUP BY date_trunc('hour', tsin)
ORDER BY 1;
tsin | count | min | max
------------------------+-------+-------+-------
1974-01-01 01:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 02:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 03:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 04:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 05:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 06:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 07:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 08:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 09:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 10:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 11:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 12:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 13:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 14:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 15:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 16:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 17:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 18:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 19:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 20:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 21:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 22:00:00-06 | 3600 | 97.50 | 97.50
1974-01-01 23:00:00-06 | 3600 | 97.50 | 97.50
1974-01-02 00:00:00-06 | 3600 | 97.50 | 97.50
(24 rows)
Time: 116.695 ms
It's important to note, that it's not using an index on the aggregation, though it could. If that's your typically query you probably want a BRIN on date_trunc('hour', tsin)
therein lies a small problem in that date_trunc
is not immutable so you'd have to first wrap it to make it so.
Partitioning
Another important point of information on PostgreSQL is that PG 10 bring partitioning DDL. So you can, for instance, easily create partitions for every year. Breaking down your modest database into minor ones that are tiny. In doing so, you should be able to use use and maintain btree indexes rather than BRIN which would be even faster.
CREATE TABLE electrothingy_y2016 PARTITION OF electrothingy
FOR VALUES FROM ('2016-01-01') TO ('2017-01-01');
Or whatever.
It amazes me me that nobody here has mentioned benchmarking - that is until @EvanCarroll came along with his excellent contribution!
If I were you, I would spend some time (and yes, I know it's a precious commodity!) setting up systems, running what you think will be (get end-user input here!), say, your 10 most common queries.
My own thoughts:
NoSQL solutions can work very well for particular use cases but are frequently inflexible for ad-hoc queries. For an amusing take on NoSQL by Brian Aker - former chief architect of MySQL, see here!
I agree with @Mr.Brownstone that your data is eminently suited to a relational solution (and this opinion has been confirmed by Evan Carroll)!
If I were to commit to any expenditure, it would be to my disk technology! I would be spending any money I had at my disposal on NAS or SAN or maybe some SSD disks to hold my rarely written aggregate data!
First I would look at what I have available now. Run some tests and show the results to the decision makers. You already have a proxy in the form of EC's work! But, a quick test or two whipped together on your own hardware would be more convincing!
Then think about spending money! If you are going to spend money, look at hardware first rather than software. AFAIK, you can hire out disk technology for a trial period, or better yet, spin up a couple of proofs-of-concept on the cloud.
My own personal first port of call for a project like this would be PostgreSQL. That is not to say that I would rule out a proprietary solution, but the laws of physics and disks are the same for everyone! "Yae cannae beet the laws o' physics Jim" :-)