Using scikit-learn vectorizers and vocabularies with gensim
I am also running some code experiments using these two. Apparently there's a way to construct the dictionary from corpus now
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary.from_corpus(corpus_vect_gensim,
id2word=dict((id, word) for word, id in vect.vocabulary_.items()))
Then you can use this dictionary for tfidf, LSI or LDA models.
Gensim doesn't require Dictionary
objects. You can use your plain dict
as input to id2word
directly, as long as it maps ids (integers) to words (strings).
In fact anything dict-like will do (including dict
, Dictionary
, SqliteDict
...).
(Btw gensim's Dictionary
is a simple Python dict
underneath.
Not sure where your remarks on Dictionary
performance come from, you can't get a mapping much faster than a plain dict
in Python. Maybe you're confusing it with text preprocessing (not part of gensim), which can indeed be slow.)
Just to provide with a final example, scikit-learn's vectorizers objects can be transformad into gensim's corpus format with Sparse2Corpus
while the vocabulary dict
can be recycled by simply swapping keys and values:
# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
# transform scikit vocabulary into gensim dictionary
vocabulary_gensim = {}
for key, val in vect.vocabulary_.items():
vocabulary_gensim[val] = key