Python: Check if string and its substring are existing in the same list
If you have a large list of words, it might be a good idea to use a suffix tree.
Here's a package on PyPI.
Once you created the tree, you can call find_all(word)
to get the index of every occurence of word
. You simply need to keep the strings which only appear once:
from suffix_trees import STree
# https://pypi.org/project/suffix-trees/
# pip install suffix-trees
words = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure'] + ['sleep', 'anxiety', 'lack of sleep']
st = STree.STree(words)
st.find_all('blood')
# [0, 20, 26, 46]
st.find_all('high blood pressure')
# [41]
[word for word in words if len(st.find_all(word)) == 1]
# ['high blood pressure', 'anxiety', 'lack of sleep']
words
needs to be a unique list of strings, so you might need to call list(set(words))
before generating the suffix-tree.
As far as I can tell, the whole script should run in O(n)
, with n
being the total length of the strings.
You could use this one liner:
b = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
result = [ i for i in b if not any( [ i in a for a in b if a != i] )]
I admit this is O(n2) and maybe will be slow in performance for large inputs.
This is basically a list comprehension of the following:
word_list = ['blood', 'pressure', 'high blood', 'blood pressure', 'high blood pressure']
result = []
for this_word in word_list:
words_without_this_word = [ other_word for other_word in word_list if other_word != this_word]
found = False
for other_word in words_without_this_word:
if this_word in other_word:
found = True
if not found:
result.append(this_word)
result