How does Pyspark Calculate Doc2Vec from word2vec word embeddings?
One simple way to go from word-vectors, to a single vector for a range-of-text, is to average the vectors together. And, that often works well-enough for some tasks.
However, that's not how the Doc2Vec
class in gensim
does it. That class implements the 'Paragraph Vectors' technique, where separate document-vectors are trained in a manner analogous to word-vectors.
The doc-vectors participate in training a bit like a floating synthetic word, involved in every sliding window/target-word-prediction. They're not composed-up or concatenated-from preexisting word-vectors, though in some modes they may be simultaneously trained alongside word-vectors. (However, the fast and often top-performing PV-DBOW mode, enabled in gensim with the parameter dm=0
, doesn't train or use input-word-vectors at all. It just trains doc-vectors that are good for predicting the words in each text-example.)
Since you've mentioned multiple libraries (both Spark MLib and gensim), but you've not shown your code, it's not certain exactly what your existing process is doing.