Removing non-English words from text using Python
In MAC OSX it still can show an exception if you try this code. So make sure you download the words corpus manually. Once you import
your nltk
library, make you might as in mac os it does not download the words corpus automatically. So you have to download it potentially otherwise you will face exception.
import nltk
nltk.download('words')
words = set(nltk.corpus.words.words())
Now you can perform same execution as previous person directed.
sent = "Io andiamo to the beach with my amico."
sent = " ".join(w for w in nltk.wordpunct_tokenize(sent) if w.lower() in words or not w.isalpha())
According to NLTK documentation it doesn't say so. But I got a issue over github and solved that way and it really works. If you don't put the word
parameter there, you OSX can logg off and happen again and again.
You can use the words
corpus from NLTK:
import nltk
words = set(nltk.corpus.words.words())
sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'
Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.