Python 3.5 - Get counter to report zero-frequency items
You can just pre-initialize the counter, something like this:
freq_iter = collections.Counter()
freq_iter.update({x:0 for x in bad})
freq_iter.update(pattern.findall(review_processed))
One nice thing about Counter
is that you don't actually have to pre-initialize it - you can just do c = Counter(); c['key'] += 1
, but nothing prevents you from pre-initializing some values to 0 if you want.
For the debt
/debts
thing - that is just an insufficiently specified problem. What do you want the code to do in that case? If you want it to match on the longest pattern matched, you need to sort the list longest-first, that will solve it. If you want both reported, you may need to do multiple searches and save all the results.
Updated to add some information on why it can't find debts
: That has more to do with the regex findall than anything else. re.findall
always looks for the shortest match, but also once it finds one, it doesn't include it in subsequent matches:
In [2]: re.findall('(debt|debts)', 'debtor debts my debt')
Out[2]: ['debt', 'debt', 'debt']
If you really want to find all instances of every word, you need to do them separately:
In [3]: re.findall('debt', 'debtor debts my debt')
Out[3]: ['debt', 'debt', 'debt']
In [4]: re.findall('debts', 'debtor debts my debt')
Out[4]: ['debts']
However, maybe what you are really looking for is words. in this case, use the \b
operator to require a word break:
In [13]: re.findall(r'\bdebt\b', 'debtor debts my debt')
Out[13]: ['debt']
In [14]: re.findall(r'(\b(?:debt|debts)\b)', 'debtor debts my debt')
Out[14]: ['debts', 'debt']
I don't know whether this is what you want or not... in this case, it was able to differentiate debt
and debts
correctly, but it missed debtor
because it only matches a substring, and we asked it not to.
Depending on your use case, you may want to look into stemming the text... I believe there is one in nltk that is pretty simple (used it only once, so I won't try to post an example... this question Combining text stemming and removal of punctuation in NLTK and scikit-learn may be useful), it should reduce debt
, debts
, and debtor
all to the same root word debt
, and do similar things for other words. This may or may not be helpful; I don't know what you are doing with it.