How to get a sample with an exact sample size in Spark RDD?
Another way can be to first takeSample and then make RDD. This might be slow with large datasets.
sc.makeRDD(a.takeSample(false, 1000, 1234))
If you want an exact sample, try doing
a.takeSample(false, 1000)
But note that this returns an Array and not an RDD
.
As for why the a.sample(false, 0.1)
doesn't return the same sample size: it's because spark internally uses something called Bernoulli sampling for taking the sample. The fraction
argument doesn't represent the fraction of the actual size of the RDD. It represent the probability of each element in the population getting selected for the sample, and as wikipedia says:
Because each element of the population is considered separately for the sample, the sample size is not fixed but rather follows a binomial distribution.
And that essentially means that the number doesn't remain fixed.
If you set the first argument to true
, then it will use something called Poisson sampling, which also results in a non-deterministic resultant sample size.
Update
If you want stick with the sample
method, you can probably specify a larger probability for the fraction
param and then call take
as in:
a.sample(false, 0.2).take(1000)
This should, most of the time, but not necessarily always, result in the sample size of 1000. This could work if you have a large enough population.