Spark Dataset unique id performance - row_number vs monotonically_increasing_id

monotically_increasing_id is distributed which performs according to partition of the data.


row_number() using Window function without partitionBy (as in your case) is not distributed. When we don't define partitionBy, all the data are sent to one executor for generating row number.

Thus, it is certain that monotically_increasing_id() will perform better than row_number() without partitionBy defined.

TL;DR It is not even a competition.

Never use:

row_number().over(Window.orderBy("a column"))

for anything else than summarizing results, that already fit in a single machine memory.

To apply window function without PARTITION BY Spark has to shuffle all data into a single partition. On any large dataset this will just crash the application. Sequential and not distributed won't even matter.