How do explicit table partitions in Databricks affect write performance?

You should partition your data by date because it sounds like you are continually adding data as time passes chronologically. This is the generally accepted approach to partitioning time series data. It means that you will be writing to one date partition each day, and your previous date partitions are not updated again (a good thing).

You can of course use a secondary partition key if your use case benefits from it (i.e. PARTITIONED BY (date, entity_id))

Partitioning by date will necessitate that your reading of this data will always be made by date as well, to get the best performance. If this is not your use case, then you would have to clarify your question.

How many partitions?

No one can give you answer on how many partitions you should use because every data set (and processing cluster) is different. What you do want to avoid is "data skew", where one worker is having to process huge amounts of data, while other workers are idle. In your case that would happen if one clientid was 20% of your data set, for example. Partitioning by date has to assume that each day has roughly the same amount of data, so each worker is kept equally busy.

I don't know specifically about how Databricks writes to disk, but on Hadoop I would want to see each worker node writing it's own file part, and therefore your write performance is paralleled at this level.

How do explicit table partitions in Databricks affect write performance?

Tags:

Amazon S3

Hive

Apache Spark Sql

Databricks

Delta Lake

Related

Recent Posts