Finding clusters of numbers in a list
import itertools
import numpy as np
numbers = np.array([123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430])
nd = [0] + list(np.where(np.diff(numbers) > 15)[0] + 1) + [len(numbers)]
a, b = itertools.tee(nd)
next(b, None)
res = {}
for j, (f, b) in enumerate(itertools.izip(a, b)):
res[j] = numbers[f:b]
If you can use itertools and numpy. Adapted pairwise
for the iterator tricks. The +1
is needed to shift the index, adding the 0
and len(numbers)
onto the list makes sure the first and last entries are included correctly.
You can obviously do this with out itertools
, but I like tee
.
Using the generator to separate the logic: (one function does one thing)
numbers = [123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430]
def cut_indices(numbers):
# this function iterate over the indices that need to be 'cut'
for i in xrange(len(numbers)-1):
if numbers[i+1] - numbers[i] > 15:
yield i+1
def splitter(numbers):
# this function split the original list into sublists.
px = 0
for x in cut_indices(numbers):
yield numbers[px:x]
px = x
yield numbers[px:]
def cluster(numbers):
# using the above result, to form a dict object.
cluster_ids = xrange(1,len(numbers))
return dict(zip(cluster_ids, splitter(numbers)))
print cluster(numbers)
The above codes give me
{1: [123, 124, 128], 2: [160, 167], 3: [213, 215, 230, 245, 255, 257], 4: [400, 401, 402], 5: [430]}
Not strictly necessary if your list is small, but I'd probably approach this in a "stream-processing" fashion: define a generator that takes your input iterable, and yields the elements grouped into runs of numbers differing by <= 15. Then you can use that to generate your dictionary easily.
def grouper(iterable):
prev = None
group = []
for item in iterable:
if prev is None or item - prev <= 15:
group.append(item)
else:
yield group
group = [item]
prev = item
if group:
yield group
numbers = [123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430]
dict(enumerate(grouper(numbers), 1))
prints:
{1: [123, 124, 128],
2: [160, 167],
3: [213, 215, 230, 245, 255, 257],
4: [400, 401, 402],
5: [430]}
As a bonus, this lets you even group your runs for potentially-infinite lists (as long as they're sorted, of course). You could also stick the index generation part into the generator itself (instead of using enumerate
) as a minor enhancement.
You can achieve that with no (explicit) loops using numpy / pandas:
import pandas as pd
import numpy as np
n = 15
numbers = [123, 124, 128, 160, 167, 213, 215, 230, 245, 255, 257, 400, 401, 402, 430]
nnumbers = np.array(numbers)
clusters = pd.DataFrame({
'numbers': numbers,
'segment': np.cumsum([0] + list(1*(nnumbers[1:] - nnumbers[0:-1] > n))) + 1
}).groupby('segment').agg({'numbers': set}).to_dict()['numbers']
The trick is to shift the list of numbers and compare the difference with your threshold (15) to find 'breaks' between segments. Of course, first element will not be a break. Then use cumsum function to get the segments and do the group by using a set function (in case there are duplicates). Hope this is helpful even though many years have passed since posting this question.