Pandas merge dataframes on ip-address by range

assume you have the following data frames:

In [5]: df1
Out[5]:
  company           ip actions
0   comp1    10.10.1.2    act1
1   comp2   10.10.2.20    act2
2   comp3   10.10.3.50    act3
3   comp4  10.10.4.100    act4

In [6]: df2
Out[6]:
  ip_range_start ip_range_end   country   state   city
0      10.10.2.1  10.10.2.254  country2  state2  city2
1      10.10.3.1  10.10.3.254  country3  state3  city3
2      10.10.4.1  10.10.4.254  country4  state4  city4

we can create a vectorized function which will calculate numerical IP representation similar to int(netaddr.IPAddress('192.0.2.1')):

def ip_to_int(ip_ser):
    ips = ip_ser.str.split('.', expand=True).astype(np.int16).values
    mults = np.tile(np.array([24, 16, 8, 0]), len(ip_ser)).reshape(ips.shape)
    return np.sum(np.left_shift(ips, mults), axis=1)

let's convert all IPs to their numerical representations:

df1['_ip'] = ip_to_int(df1.ip)
df2[['_ip_range_start','_ip_range_end']] = df2.filter(like='ip_range').apply(lambda x: ip_to_int(x))

In [10]: df1
Out[10]:
  company           ip actions        _ip
0   comp1    10.10.1.2    act1  168427778
1   comp2   10.10.2.20    act2  168428052
2   comp3   10.10.3.50    act3  168428338
3   comp4  10.10.4.100    act4  168428644

In [11]: df2
Out[11]:
  ip_range_start ip_range_end   country   state   city  _ip_range_start  _ip_range_end
0      10.10.2.1  10.10.2.254  country2  state2  city2        168428033      168428286
1      10.10.3.1  10.10.3.254  country3  state3  city3        168428289      168428542
2      10.10.4.1  10.10.4.254  country4  state4  city4        168428545      168428798

now let's add a new column to the df1 DF, which will contain an index of the first matching IP interval from the df2 DF:

In [12]: df1['x'] = (df1._ip.apply(lambda x: df2.query('_ip_range_start <= @x <= _ip_range_end')
   ....:                                       .index
   ....:                                       .values)
   ....:                   .apply(lambda x: x[0] if len(x) else -1))

In [14]: df1
Out[14]:
  company           ip actions        _ip  x
0   comp1    10.10.1.2    act1  168427778 -1
1   comp2   10.10.2.20    act2  168428052  0
2   comp3   10.10.3.50    act3  168428338  1
3   comp4  10.10.4.100    act4  168428644  2

finally we can merge both DFs:

In [15]: (pd.merge(df1.drop('_ip',1),
   ....:           df2.filter(regex=r'^((?!.?ip_range_).*)$'),
   ....:           left_on='x',
   ....:           right_index=True,
   ....:           how='left')
   ....:    .drop('x',1)
   ....: )
Out[15]:
  company           ip actions   country   state   city
0   comp1    10.10.1.2    act1       NaN     NaN    NaN
1   comp2   10.10.2.20    act2  country2  state2  city2
2   comp3   10.10.3.50    act3  country3  state3  city3
3   comp4  10.10.4.100    act4  country4  state4  city4

Let's compare the speed of the standard int(IPAddress) to our function (we will use 4M rows DF for the comparison):

In [21]: big = pd.concat([df1.ip] * 10**6, ignore_index=True)

In [22]: big.shape
Out[22]: (4000000,)

In [23]: big.head(10)
Out[23]:
0      10.10.1.2
1     10.10.2.20
2     10.10.3.50
3    10.10.4.100
4      10.10.1.2
5     10.10.2.20
6     10.10.3.50
7    10.10.4.100
8      10.10.1.2
9     10.10.2.20
Name: ip, dtype: object

In [24]: %timeit
%timeit  %%timeit

In [24]: %timeit big.apply(lambda x: int(IPAddress(x)))
1 loop, best of 3: 1min 3s per loop

In [25]: %timeit ip_to_int(big)
1 loop, best of 3: 25.4 s per loop

Conclusion: our function is approx. 2.5 times faster

Pandas merge dataframes on ip-address by range

Tags:

Pandas

Left Join

Python 3.X

Merge

Ip Address

Related

Recent Posts