Is dataframe.show() an action in spark?
show
is indeed an action, but it is smart enough to know when it doesn't have to run everything. If you had an orderBy
it would take very long too, but in this case all your operations are map operations and so there's no need to calculate the whole final table. However, count
needs to physically go through the whole table in order to count it and that's why it's taking so long. You could test what I'm saying by adding an orderBy
to df1
's definition - then it should take long.
EDIT: Also, the 40k tasks are likely due to the amount of partitions your DF is partitioned into. Try using df1.repartition(<a sensible number here, depending on cluster and DF size>)
and trying out count again.
show()
by default shows only 20 rows. If the 1st partition returned more than 20 rows, then the rest partitions will not be executed.
Note show
has a lot of variations. If you run show(false)
which means show all results, all partitions will be executed and may take more time. So, show()
equals show(20)
which is a partial action.