Sampling a large distributed data set using pyspark / spark

Using sample instead of takeSample appears to make things reasonably fast:

textFile.sample(False, .0001, 12345)

the problem with this is that it's hard to know the right fraction to choose unless you have a rough idea of the number of rows in your data set.


Try using textFile.sample(false,fraction,seed) instead. takeSample will generally be very slow because it calls count() on the RDD. It needs to do this because otherwise it wouldn't take evenly from each partition, basically it uses the count along with the sample size you asked for to compute the fraction and calls sample internally. sample is fast because it just uses a random boolean generator that returns true fraction percent of the time and thus doesn't need to call count.

In addition, I don't think this is happening to you but if the sample size returned is not big enough it calls sample again which can obviously slow it down. Since you should have some idea of the size of your data I would recommend calling sample and then cutting the sample down to size yourself, since you know more about your data than spark does.