dict = CountVectorizer(stop_words='english') dict.fit(X_train) X_train_vocabs_dict = dict.get_feature_names() len(X_train_vocabs_dict) code example
Example: countvectorizer with list of list
corpus = [["this is spam, 'SPAM'"],["this is ham, 'HAM'"],["this is nothing, 'NOTHING'"]]
from sklearn.feature_extraction.text import CountVectorizer
bag_of_words = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(splited_labels_from_corpus)