Ensure the gensim generate the same Word2Vec model for different runs on the same data
Yes, default random seed is fixed to 1
, as described by the author in https://radimrehurek.com/gensim/models/word2vec.html. Vectors for each word are initialised using a hash of the concatenation of word + str(seed).
Hashing function used, however, is Python’s rudimentary built in hash function and can produce different results if two machines differ in
- 32 vs 64 bit, reference
- python versions, reference
- different Operating Systems/ Interpreters, reference1, reference2
Above list is not exhaustive. Does it cover your question though?
EDIT
If you want to ensure consistency, you can provide your own hashing function as an argument in word2vec
A very simple (and bad) example would be:
def hash(astring):
return ord(astring[0])
model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=4, hashfxn=hash)
print model[sentences[0][0]]
As per the docs of Gensim, for executing a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling.
A simple parameter edit to your code should do the trick.
model = Word2Vec(sentences, size=10, window=5, min_count=5, workers=1)
Just a remark on the randomness.
If one is working with gensim's W2V model and is using Python version >= 3.3, keep in mind that hash randomisation is turned on by default. If you're seeking consistency between two executions, make sure to set the PYTHONHASHSEED
environment variable. E.g. when running your code like so
PYTHONHASHSEED=123 python3 mycode.py
, next time you generate a model (using the same hash seed) it would be the same as previously generated model (provided, that all other randomness control steps are followed, as mentioned above - random state and single worker).
See gensim's W2V source and Python docs for details.