Python NLTK: Bigrams trigrams fourgrams
I do it like this:
def words_to_ngrams(words, n, sep=" "):
return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]
This takes a list of words as input and returns a list of ngrams (for given n), separated by sep
(in this case a space).
Try everygrams
from nltk import everygrams
list(everygrams('hello', 1, 5))
('h', 'e'),
('e', 'l'),
('l', 'l'),
('l', 'o'),
('h', 'e', 'l'),
('e', 'l', 'l'),
('l', 'l', 'o'),
('h', 'e', 'l', 'l'),
('e', 'l', 'l', 'o'),
('h', 'e', 'l', 'l', 'o')]
Word tokens:
from nltk import everygrams
list(everygrams('hello word is a fun program'.split(), 1, 5))
('hello', 'word'),
('word', 'is'),
('is', 'a'),
('a', 'fun'),
('fun', 'program'),
('hello', 'word', 'is'),
('word', 'is', 'a'),
('is', 'a', 'fun'),
('a', 'fun', 'program'),
('hello', 'word', 'is', 'a'),
('word', 'is', 'a', 'fun'),
('is', 'a', 'fun', 'program'),
('hello', 'word', 'is', 'a', 'fun'),
('word', 'is', 'a', 'fun', 'program')]
If you apply some set theory (if I'm interpreting your question correctly), you'll see that the trigrams you want are simply elements [2:5], [4:7], [6:8], etc. of the token
You could generate them like this:
>>> new_trigrams = []
>>> c = 2
>>> while c < len(token) - 2:
... new_trigrams.append((token[c], token[c+1], token[c+2]))
... c += 2
>>> print new_trigrams
[('are', 'you', '?'), ('?', 'i', 'am'), ('am', 'fine', 'and')]