How to efficiently get count for item in list of lists in python
I would suggest going with Apache Spark or Apache Hadoop if you have a word-count-like example, in fact, these frameworks specialize in that.
Both have frameworks have can be used with python.
But if you want to stick with only python.
I would suggest parallelization:
Split my_list
into n
sublist my_sub_lists
my_list = ["my cat", "little dog", "fish", "rat", "my cat","little dog" ]
# split my_list into n=2 sublists
my_sub_lists = [["my cat", "little dog", "fish"], ["rat", "my cat","little dog"]]
Compute item counts for my_sub_lists
in parallel
Process 1: Counter(["my cat", "little dog", "fish"])
Process 2 : Counter("rat", "my cat","little dog"])
You would get some intermediate aggregation. my_sub_counts
my_sub_counts = [{"my cat":1, "little dog":1, "fish":1}, {"rat":1, "my cat":1,"little dog":1}]
Merge intermediate result to get the final item count.
result = {"my cat":2, "little dog":2, "fish":1, "rat":1}
Combining intermediate aggregation would be easier since it would be smaller.
try this:
from collections import Counter
req={}
for i in myconcepts:
x=sum([j[1] for j in mylist if i in j[1]],[])
x=[i for i in x if i not in hatedconcepts]
req[i]=dict(Counter(x))
print(req)
output:
{'my cat': {'my cat': 2, 'little dog': 2, 'fish': 1}, 'little dog': {'my cat': 2, 'little dog': 3, 'fish': 2, 'duck': 1}}
I have tried to make it fast, avoided some repeated loops. Please check if this speeds things up.
from itertools import chain
from collections import Counter, defaultdict
database = defaultdict(set)
output = {}
# created a map for different concepts, so we only search the indices where a certain concept is
for index, (_, concepts) in enumerate(mylist):
for concept in concepts:
database[concept].add(index)
for concept in myconcepts:
search_indices = database[concept]
all_counts = Counter(chain.from_iterable(mylist[i][1] for i in search_indices))
for hc in hatedconcepts:
if hc in all_counts: all_counts.pop(hc)
output[concept] = sorted(all_counts.items(), key=lambda x: x[1], reverse=True)
As other comments and answers have indicated, this operation is better handled by Spark or a database. That said, here's my take on it, I introduced some sets operations and minimized repeated loops.
from collections import defaultdict
def get_counts(lst, concepts, hated_concepts):
result = {concept: defaultdict(int) for concept in concepts}
concepts_set = set(concepts)
hated_concepts_set = set(hated_concepts)
for _, inner_list in lst:
# ignore hated concepts
relevant = set(inner_list).difference(hated_concepts_set)
# determine which concepts need to be updated
to_update = relevant.intersection(concepts_set)
for concept in to_update:
for word in relevant:
result[concept][word] += 1
return result
Output is below. You mention the output "must be sorted", but it's unclear to me what the desired sorting is. Some timing tests indicate this is 9x faster than the code you provided on your sample data.
{
'my cat': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 1, 'little dog': 2}),
'little dog': defaultdict(<class 'int'>, {'my cat': 2, 'fish': 2, 'little dog': 3, 'duck': 1})
}
Performance Improvement
emj_functn avg 0.9355s
get_counts avg 0.1141s
Performance testing script:
import random
import string
import time
words = list({
''.join(random.choice(string.ascii_lowercase) for _ in range(5))
for _ in range(1000)
})
test_list = [[random.randint(1e6, 1e7), [random.choice(words) for _ in range(100)]] for _ in range(1000)]
test_concepts = [random.choice(words) for _ in range(100)]
test_hated_concepts = [random.choice(words) for _ in range(50)]
def emj_functn(lst, concepts, hated_concepts):
...
def get_counts(lst, concepts, hated_concepts):
...
TEST_CASES = 10
start_time = time.time()
for _ in range(TEST_CASES):
emj_functn(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'emj_functn avg {avg:.4}s')
start_time = time.time()
for _ in range(TEST_CASES):
get_counts(test_list, test_concepts, test_hated_concepts)
end_time = time.time()
avg = (end_time - start_time) / TEST_CASES
print(f'get_counts avg {avg:.4}s')