real word count in NLTK
Removing Punctuation (with no regex)
Use the same solution as dhg, but test that a given token is alphanumeric instead of using a regex pattern.
from collections import Counter
>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> filtered = [w for w in text if w.isalnum()]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})
Advantages:
- Works better with non English languages as
"À".isalnum()
isTrue
while bool(nonPunct.match("à")) isFalse
(an "à" is not a punctuation mark at least in French). - Does not need to use the
re
package.
Tokenization with nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
text = "This is my text. It icludes commas, question marks? and other stuff. Also U.S.."
tokens = tokenizer.tokenize(text)
Returns
['This', 'is', 'my', 'text', 'It', 'icludes', 'commas', 'question', 'marks', 'and', 'other', 'stuff', 'Also', 'U', 'S']
Removing Punctuation
Use a regular expression to filter out the punctuation
import re
from collections import Counter
>>> text = ['this', 'is', 'a', 'sentence', '.']
>>> nonPunct = re.compile('.*[A-Za-z0-9].*') # must contain a letter or digit
>>> filtered = [w for w in text if nonPunct.match(w)]
>>> counts = Counter(filtered)
>>> counts
Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})
Average Number of Characters
Sum the lengths of each word. Divide by the number of words.
>>> float(sum(map(len, filtered))) / len(filtered)
3.75
Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.
>>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
3.75
Removing punctuation
from string import punctuation
punctuations = list(punctuation)
punctuations.append("''")
punctuations.append("--")
punctuations.append("``")
from string import punctuation
text = [word for word in text if word not in punctuations]
The average number of character in a word on a text
from collections import Counter
from nltk import word_tokenize
word_count = Counter(word_tokenize(text))
sum(len(x)* y for x, y in word_count.items()) / len(text)