Counting consecutive positive value in Python array

>>> y = pandas.Series([0,0,1,1,1,0,0,1,0,1,1])

The following may seem a little magical, but actually uses some common idioms: since pandas doesn't yet have nice native support for a contiguous groupby, you often find yourself needing something like this.

>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64

Some explanation: first, we compare y against a shifted version of itself to find when the contiguous groups begin:

>>> y != y.shift()
0      True
1     False
2      True
3     False
4     False
5      True
6     False
7      True
8      True
9      True
10    False
dtype: bool

Then (since False == 0 and True == 1) we can apply a cumulative sum to get a number for the groups:

>>> (y != y.shift()).cumsum()
0     1
1     1
2     2
3     2
4     2
5     3
6     3
7     4
8     5
9     6
10    6
dtype: int32

We can use groupby and cumcount to get us an integer counting up in each group:

>>> y.groupby((y != y.shift()).cumsum()).cumcount()
0     0
1     1
2     0
3     1
4     2
5     0
6     1
7     0
8     0
9     0
10    1
dtype: int64

Add one:

>>> y.groupby((y != y.shift()).cumsum()).cumcount() + 1
0     1
1     2
2     1
3     2
4     3
5     1
6     2
7     1
8     1
9     1
10    2
dtype: int64

And finally zero the values where we had zero to begin with:

>>> y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     1
8     0
9     1
10    2
dtype: int64

If something is clear, it is "pythonic". Frankly, I cannot even make your original solution work. Also, if it does work, I am curious if it is faster than a loop. Did you compare?

Now, since we've started discussing efficiency, here are some insights.

Loops in Python are inherently slow, no matter what you do. Of course, if you are using pandas, you are also using numpy underneath, with all the performance advantages. Just don't destroy them by looping. This is not to mention that Python lists take a lot more memory than you may think; potentially much more than 8 bytes * length, as every integer may be wrapped into a separate object and placed into a separate area in memory, and pointed at by a pointer from the list.

Vectorization provided by numpy should be sufficient IF you can find some way to express this function without looping. In fact, I wonder if there some way to represent it by using expressions such as A+B*C. If you can construct this function out of functions in Lapack, then you can even potentially beat ordinary C++ code compiled with optimization.

You can also use one of the compiled approaches to speed-up your loops. See a solution with Numba on numpy arrays below. Another option is to use PyPy, though you probably can't properly combine it with pandas.

In [140]: import pandas as pd
In [141]: import numpy as np
In [143]: a=np.random.randint(2,size=1000000)

# Try the simple approach
In [147]: def simple(L):
              for i in range(len(L)):
                  if L[i]==1:
                      L[i] += L[i-1]


In [148]: %time simple(L)
CPU times: user 255 ms, sys: 20.8 ms, total: 275 ms
Wall time: 248 ms


# Just-In-Time compilation
In[149]: from numba import jit
@jit          
def faster(z):
    prev=0
    for i in range(len(z)):
        cur=z[i]
        if cur==0:
             prev=0
        else:
             prev=prev+cur
             z[i]=prev

In [151]: %time faster(a)
CPU times: user 51.9 ms, sys: 1.12 ms, total: 53 ms
Wall time: 51.9 ms


In [159]: list(L)==list(a)
Out[159]: True

In fact, most of the time in the second example above was spent on Just-In-Time compilation. Instead (remember to copy, as the function changes the array).

b=a.copy()
In [38]: %time faster(b)
CPU times: user 55.1 ms, sys: 1.56 ms, total: 56.7 ms
Wall time: 56.3 ms

In [39]: %time faster(c)
CPU times: user 10.8 ms, sys: 42 µs, total: 10.9 ms
Wall time: 10.9 ms

So for subsequent calls we have a 25x-speedup compared to the simple version. I suggest you read High Performance Python if you want to know more.