gensim word2vec: Find number of words in vocabulary
One more way to get the vocabulary size is from the embedding matrix itself as in:
In [33]: from gensim.models import Word2Vec
# load the pretrained model
In [34]: model = Word2Vec.load(pretrained_model)
# get the shape of embedding matrix
In [35]: model.wv.vectors.shape
Out[35]: (662109, 300)
# `vocabulary_size` is just the number of rows (i.e. axis 0)
In [36]: model.wv.vectors.shape[0]
Out[36]: 662109
In recent versions, the model.wv
property holds the words-and-vectors, and can itself can report a length – the number of words it contains. So if w2v_model
is your Word2Vec
(or Doc2Vec
or FastText
) model, it's enough to just do:
vocab_len = len(w2v_model.wv)
If your model is just a raw set of word-vectors, like a KeyedVectors
instance rather than a full Word2Vec
/etc model, it's just:
vocab_len = len(kv_model)
Other useful internals in Gensim 4.0+ include model.wv.index_to_key
, a plain list of the key (word) in each index position, and model.wv.key_to_index
, a plain dict mapping keys (words) to their index positions.
In pre-4.0 versions, the vocabulary was in the vocab
field of the Word2Vec model's wv
property, as a dictionary, with the keys being each token (word). So there it was just the usual Python for getting a dictionary's length:
len(w2v_model.wv.vocab)
In very-old gensim versions before 0.13 vocab
appeared directly on the model. So way back then you would use w2v_model.vocab
instead of w2v_model.wv.vocab
.
But if you're still using anything from before Gensim 4.0, you should definitely upgrade! There are big memory & performance improvements, and the changes required in calling code are relatively small – some renamings & moves, covered in the 4.0 Migration Notes.
Gojomo's answer raises an AttributeError
for Gensim 4.0.0+.
For these versions, you can get the length of the vocabulary as follows:
len(w2v_model.wv.index_to_key)
(which is slightly faster than: len(w2v_model.wv.key_to_index)
)