Get bigrams and trigrams in word2vec Gensim

from gensim.models import Phrases

from gensim.models.phrases import Phraser

documents = [
  "the mayor of new york was there", 
  "machine learning can be useful sometimes",
  "new york mayor was present"
  ]

sentence_stream = [doc.split(" ") for doc in documents]
print(sentence_stream)

bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')

bigram_phraser = Phraser(bigram)


print(bigram_phraser)

for sent in sentence_stream:
    tokens_ = bigram_phraser[sent]

    print(tokens_)

First of all you should use gensim's class Phrases in order to get bigrams, which works as pointed in the doc

>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']

To get trigrams and so on, you should use the bigram model that you already have and apply Phrases to it again, and so on. Example:

trigram_model = Phrases(bigram_sentences)

Also there is a good notebook and video that explains how to use that .... the notebook, the video

The most important part of it is how to use it in real life sentences which is as follows:

// to create the bigrams
bigram_model = Phrases(unigram_sentences)

// apply the trained model to a sentence
 for unigram_sentence in unigram_sentences:                
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])

// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)

Hope this helps you, but next time give us more information on what you are using and etc.

P.S: Now that you edited it, you are not doing anything in order to get bigrams just splitting it, you have to use Phrases in order to get words like New York as bigrams.

Get bigrams and trigrams in word2vec Gensim

Tags:

Python

Tokenize

N Gram

Gensim

Word2Vec

Related

Recent Posts