DISTRIBUTE BY clause in HIVE
In addition to @Dudu's answer, the Distribute By only distributes the rows among the reducers which is determined from the input size.
The number of reducers to be used for a Hive job will be determined by this property hive.exec.reducers.bytes.per.reducer
which is dependent on the input.
As of Hive 0.14, if the input is < 256MB, only one reducer (one reducer per 256MB of input) will be used unless the number of reducers is overridden by hive.exec.reducers.max
or mapred.reduce.tasks
properties.
The only thing DISTRIBUTE BY (city)
says is that records with the same city
will go to the same reducer. Nothing else.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
A question by the OP:
Then what is the point of this DISTRIBUTE BY ? There's no guarantee that each (city) would go to a different reducer then why use it ?
For 2 reasons:
In the beginning of hive
DISTRIBUTE BY
,SORT BY
andCLUSTER BY
where used to process data in a way that today is being done automatically (e.g. analytic functions https://oren.lederman.name/?p=32)You might want to stream you data through a script (Hive "Transform") and you want your script to process your data in certain groups and order. For that you can use
DISTRIBUTE BY
+SORT BY
orCLUSTER BY
. WithDISTRIBUTE BY
it is guaranteed that you'll have the whole group in the same reducer. WithSORT BY
that you'll get all the records of a group continuously.