Spark Dataset unique id performance - row_number vs monotonically_increasing_id
monotically_increasing_id
is distributed which performs according to partition of the data.
whereas
row_number()
using Window
function without partitionBy
(as in your case) is not distributed. When we don't define partitionBy
, all the data are sent to one executor for generating row number.
Thus, it is certain that monotically_increasing_id()
will perform better than row_number()
without partitionBy
defined.
TL;DR It is not even a competition.
Never use:
row_number().over(Window.orderBy("a column"))
for anything else than summarizing results, that already fit in a single machine memory.
To apply window function without PARTITION BY
Spark has to shuffle all data into a single partition. On any large dataset this will just crash the application. Sequential and not distributed won't even matter.