spark access first n rows - take vs limit

Although it still is answered, I want to share what I learned.

myDataFrame.take(10)

-> results in an Array of Rows. This is an action and performs collecting the data (like collect does).

myDataFrame.limit(10)

-> results in a new Dataframe. This is a transformation and does not perform collecting the data.

I do not have an explanation why then limit takes longer, but this may have been answered above. This is just a basic answer to what the difference is between take and limit.

This is because predicate pushdown is currently not supported in Spark, see this very good answer.

Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.

spark access first n rows - take vs limit

Tags:

Limit

Apache Spark

Apache Spark Sql

Related

Recent Posts