Sampling a large distributed data set using pyspark / spark
Using sample instead of takeSample appears to make things reasonably fast:
textFile.sample(False, .0001, 12345)
the problem with this is that it's hard to know the right fraction to choose unless you have a rough idea of the number of rows in your data set.
Try using textFile.sample(false,fraction,seed)
instead. takeSample
will generally be very slow because it calls count()
on the RDD. It needs to do this because otherwise it wouldn't take evenly from each partition, basically it uses the count along with the sample size you asked for to compute the fraction and calls sample
internally. sample
is fast because it just uses a random boolean generator that returns true fraction
percent of the time and thus doesn't need to call count
.
In addition, I don't think this is happening to you but if the sample size returned is not big enough it calls sample
again which can obviously slow it down. Since you should have some idea of the size of your data I would recommend calling sample and then cutting the sample down to size yourself, since you know more about your data than spark does.