How to force Spark to evaluate DataFrame operations inline
I agree with you that at some point you want to do the action when YOU NEED IT. For .e.g if you are streaming data with Spark streaming, and you want to evaluate transformations done on every RDD, rather than accumulating transformations for every RDD, and all of a sudden run a action on this large set of data.
Now, lets say if you have a DataFrame, and you have done all transformations on it, then you can use sparkContext.sql("CACHE table <table-name>")
.
This cache is eager cache, this will trigger action on this DataFrame , and evaluate all transformations on this DataFrame.
No.
You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of the reasons to love spark.
By the way, I am pretty sure that spark knows very well when something must be done "right here and now", so probably you are focusing on the wrong point.
Can you just confirm that
count()
andshow()
are considered "actions"
You can see some of the action functions of Spark in the documentation, where count()
is listed. show()
is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? :)
Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)?
Yes! :)
spark remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time!
Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation(s), until the action takes place!