NLTK available languages for stopwords
When you import the stopwords using:
from nltk.corpus import stopwords
english_stopwords = stopwords.words(language)
you are retrieving the stopwords based upon the fileid (language). In order to see all available stopword languages, you can retrieve the list of fileids using:
from nltk.corpus import stopwords
print(stopwords.fileids())
in the case of nltk v3.4.5, this returns 23 languages:
['arabic',
'azerbaijani',
'danish',
'dutch',
'english',
'finnish',
'french',
'german',
'greek',
'hungarian',
'indonesian',
'italian',
'kazakh',
'nepali',
'norwegian',
'portuguese',
'romanian',
'russian',
'slovene',
'spanish',
'swedish',
'tajik',
'turkish']
os.listdir('/root/nltk_data/corpora/stopwords/')
['hungarian',
'swedish',
'kazakh',
'norwegian',
'finnish',
'arabic',
'indonesian',
'portuguese',
'turkish',
'azerbaijani',
'slovene',
'spanish',
'danish',
'nepali',
'romanian',
'greek',
'dutch',
'README',
'tajik',
'german',
'english',
'russian',
'french',
'italian']
First check if you have downloaded nltk
packages.
If not you can download it using below:
import nltk
nltk.download()
After this you can find stopword language files in below path.
C:/Users/username/AppData/Roming/nltk_data/corpora/stopwords
There are 21 languages supported by it (I installed nltk
few days back, so this number must be up to date). You can pass filename as parameter in
nltk.corpus.stopwords.words('langauage')