Python random sample with a generator / iterable / iterator
You can't.
You have two options: read the whole generator into a list, then sample from that list, or use a method that reads the generator one by one and picks the sample from that:
import random
def iterSample(iterable, samplesize):
results = []
for i, v in enumerate(iterable):
r = random.randint(0, i)
if r < samplesize:
if i < samplesize:
results.insert(r, v) # add first samplesize items in random order
else:
results[r] = v # at a decreasing rate, replace random items
if len(results) < samplesize:
raise ValueError("Sample larger than population.")
return results
This method adjusts the chance that the next item is part of the sample based on the number of items in the iterable so far. It doesn't need to hold more than samplesize
items in memory.
The solution isn't mine; it was provided as part of another answer here on SO.
While the answer of Martijn Pieters is correct, it does slow down when samplesize
becomes large, because using list.insert
in a loop may have quadratic complexity.
Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:
def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
try:
for _ in xrange(samplesize):
results.append(iterator.next())
except StopIteration:
raise ValueError("Sample larger than population.")
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items
return results
The difference slowly starts to show for samplesize
values above 10000
. Times for calling with (1000000, 100000)
:
- iterSample: 5.05s
- iter_sample_fast: 2.64s
Just for the heck of it, here's a one-liner that samples k elements without replacement from the n items generated in O(n lg k) time:
from heapq import nlargest
def sample_from_iterable(it, k):
return (x for _, x in nlargest(k, ((random.random(), x) for x in it)))