User Warning: Your stop_words may be inconsistent with your preprocessing

I had the same problem and for me the following worked:

include stopwords into tokenize function and then
remove stopwords parameter from tfidfVectorizer

Like so:

stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)

    #exclude stopwords from stemmed words
    stems = [stemmer.stem(t) for t in filtered_tokens if t not in stopwords]

    return stems

Delete stopwords parameter from vectorizer:

tfidf_vectorizer = TfidfVectorizer(
    max_df=0.8, max_features=200000, min_df=0.2,
    use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)
)

The warning is trying to tell you that if your text contains "always" it will be normalised to "alway" before matching against your stop list which includes "always" but not "alway". So it won't be removed from your bag of words.

The solution is to make sure that you preprocess your stop list to make sure that it is normalised like your tokens will be, and pass the list of normalised words as stop_words to the vectoriser.

User Warning: Your stop_words may be inconsistent with your preprocessing

Tags:

Text Processing

Stop Words

Vectorization

Stemming

Tf Idf

Related

Recent Posts