What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
It is simple:
partitionColumn
is a column which should be used to determine partitions.lowerBound
andupperBound
determine range of values to be fetched. Complete dataset will use rows corresponding to the following query:SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound
numPartitions
determines number of partitions to be created. Range betweenlowerBound
andupperBound
is divided intonumPartitions
each with stride equal to:upperBound / numPartitions - lowerBound / numPartitions
For example if:
lowerBound
: 0upperBound
: 1000numPartitions
: 10
Stride is equal to 100 and partitions correspond to following queries:
SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100
SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
...
SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000
Actually the list above misses a couple of things, specifically the first and the last query.
Without them you would loose some data (the data before the lowerBound
and that after upperBound
). From the example is not clear because the lower bound is 0.
The complete list should be:
SELECT * FROM table WHERE partitionColumn < 100
SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100
SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
...
SELECT * FROM table WHERE partitionColumn > 9000
Creating partitions doesn't result in loss of data due to filtering.
The upperBound
, lowerbound
along with numPartitions
just defines how the partitions are to be created. The upperBound
and lowerbound
don't define the range (filter) for the values of the partitionColumn to be fetched.
For a given input of lowerBound (l), upperBound (u) and numPartitions (n)
The partitions are created as follows:
stride, s= (u-l)/n
**SELECT * FROM table WHERE partitionColumn < l+s or partitionColumn is null**
SELECT * FROM table WHERE partitionColumn >= l+s AND <2s
SELECT * FROM table WHERE partitionColumn >= l+2s AND <3s
...
**SELECT * FROM table WHERE partitionColumn >= l+(n-1)s**
For instance, for upperBound = 500
, lowerBound = 0
and numPartitions = 5
. The partitions will be as per the following queries:
SELECT * FROM table WHERE partitionColumn < 100 or partitionColumn is null
SELECT * FROM table WHERE partitionColumn >= 100 AND <200
SELECT * FROM table WHERE partitionColumn >= 200 AND <300
SELECT * FROM table WHERE partitionColumn >= 300 AND <400
...
SELECT * FROM table WHERE partitionColumn >= 400
Depending on the actual range of values of the partitionColumn
, the result size of each partition will vary.