python -- callable iterator size?

EDIT 3: The answer by @hynekcer is much much better than this.

EDIT 2: This will not work if you have an infinite iterator, or one which consumes too many Gigabytes (in 2010 1 Gigabyte is still a large amount of ram/ disk space) of RAM/disk space.

You have already seen a good answer, but here is an expensive hack that you can use if you want to eat a cake and have it too :) The trick is that we have to clone the cake, and when you are done eating, we put it back into the same box. Remember, when you iterate over the iterator, it usually becomes empty, or at least loses previously returned values.

>>> def getIterLength(iterator):
    temp = list(iterator)
    result = len(temp)
    iterator = iter(temp)
    return result

>>>
>>> f = xrange(20)
>>> f
xrange(20)
>>> 
>>> x = getIterLength(f)
>>> x
20
>>> f
xrange(20)
>>> 

EDIT: Here is a safer version, but using it still requires some discipline. It does not feel quite Pythonic. You would get the best solution if you posted the whole relevant code sample that you are trying to implement.

>>> def getIterLenAndIter(iterator):
    temp = list(iterator)
    return len(temp), iter(temp)

>>> f = iter([1,2,3,7,8,9])
>>> f
<listiterator object at 0x02782890>
>>> l, f = getIterLenAndIter(f)
>>> 
>>> l
6
>>> f
<listiterator object at 0x02782610>
>>> 

Nope sorry iterators are not meant to know length they just know what's next which makes them very efficient at going through Collections. Although they are faster they do no allow for indexing which including knowing the length of a collection.


You can get the number of elements in an iterator by doing:

len( [m for m in re.finditer(pattern, text) ] )

Iterators are iterators because they have not generated the sequence yet. This above code is basically extracting each item from the iterator until it wants to stop into a list, then taking the length of that array. Something that would be more memory efficient would be:

count = 0
for item in re.finditer(pattern, text):
    count += 1

A tricky approach to the for-loop is to use reduce to effectively count the items in the iterator one by one. This is effectively the same thing as the for loop:

reduce( (lambda x, y : x + 1), myiterator, 0)

This basically ignores the y passed into reduce and just adds one. It initializes the running sum to 0.


This solution uses less memory, because it does not save intermediate results, as do other solutions that use list:

sum(1 for _ in re.finditer(pattern, text))

All older solutions have the disadvantage of consuming a lot of memory if the pattern is very frequent in the text, like pattern '[a-z]'.

Test case:

pattern = 'a'
text = 10240000 * 'a'

This solution with sum(1 for ...) uses approximately only the memory for the text as such, that is len(text) bytes. The previous solutions with list can use approximately 58 or 110 times more memory than is necessary. It is 580 MB for 32-bit resp. 1.1 GB for 64-bit Python 2.7.