Generator is not an iterator?
Other answers have pointed out that Gensim requires two passes to build the Word2Vec
model: once to build the vocabulary (self.build_vocab
), and the second to train the model (self.train
). You can still pass a generator to the train
method (e.g., if you're streaming data) by breaking apart the build_vocab
and train
methods.
from gensim.models import Word2Vec
model = Word2Vec()
sentences = my_generator() # first pass
model.build_vocab(sentences)
sentences = my_generator() # second pass of same data
model.train(sentences2,
total_examples=num_sentences, # total number of documents to process
epochs=model.epochs)
It seems gensim throws a misleading error message.
Gensim wants to iterate over your data multiple times. Most libraries just build a list from the input, so the user doesn't have to care about supplying a multiple iterable sequence. Of course, generating an in-memory list can be very resource-consuming, while iterating over a file for example, can be done without storing the whole file in memory.
In your case, just changing the generator to a list comprehesion should solve the problem.
Generator is exhausted after one loop over it. Word2vec simply needs to traverse sentences multiple times (and probably get item for a given index, which is not possible for generators which are just a kind of stacks where you can only pop), thus requiring something more solid, like a list.
In particular in their code they call two different functions, both iterate over sentences (thus if you use generator, the second one would run on an empty set)
self.build_vocab(sentences, trim_rule=trim_rule)
self.train(sentences)
It should work with anything implementing __iter__
which is not GeneratorType
. So wrap your function in an iterable interface and make sure that you can traverse it multiple times, meaning that
sentences = your_code
for s in sentences:
print s
for s in sentences:
print s
prints your collection twice
As previous posters are mentioned, generator acts similarly to iterator with two significant differences: generators get exhausted, and you can't index one.
I quickly looked up the documentation, on this page -- https://radimrehurek.com/gensim/models/word2vec.html
The documentation states that
gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001, sg=1, hs=1, negative=0, cbow_mean=0, hashfxn=, iter=1, null_word=0, trim_rule=None, sorted_vocab=1) ...
Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.
I'm venture to guess that the logic inside of the function inherently requires one or more list properties such as item indexing, there might be an explicit assert statement or if statement that raises an error.
A simple hack that can solve your problem is turning your generator into list comprehension. Your program is going to sustain CPU performance penalty and will increase its memory usage, but this should at least make the code work.
my_iterator = [x for x in generator_obj]