How can I print the entire contents of Wordnet (preferably with NLTK)?

For wordnet, it's a word sense resources so elements in the resource are indexed by senses (aka synsets).

To iterate through synsets:

>>> from nltk.corpus import wordnet as wn
>>> for ss in wn.all_synsets():
...     print ss
...     print ss.definition()
...     break
... 
Synset('able.a.01')
(usually followed by `to') having the necessary means or skill or know-how or authority to do something

For each synset (sense/concept), there is a list of words attached to it, called lemmas: lemmas are the canonical ("root") form of the words we use to when we check a dictionary.

To get a full list of lemmas in wordnet using a one-liner:

>>> lemmas_in_wordnet = set(chain(*[ss.lemma_names() for ss in wn.all_synsets()]))

Interestingly, wn.words() will also return all the lemma_names:

>>> lemmas_in_words  = set(i for i in wn.words())
>>> len(lemmas_in_wordnet)
148730
>>> len(lemmas_in_words)
147306

But strangely there're some discrepancies as to the total number of words collected using wn.words().

"Printing the full content" of wordnet into text seems to be something too ambitious, because wordnet is structured sort of like a hierarchical graph, with synsets interconnected to each other and each synset has its own properties/attributes. That's why the wordnet files are not kept simply as a single textfile.

To see what a synset contains:

>>> first_synset = next(wn.all_synsets())
>>> dir(first_synset)
['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_hypernyms', '_definition', '_examples', '_frame_ids', '_hypernyms', '_instance_hypernyms', '_iter_hypernym_lists', '_lemma_names', '_lemma_pointers', '_lemmas', '_lexname', '_max_depth', '_min_depth', '_name', '_needs_root', '_offset', '_pointers', '_pos', '_related', '_shortest_hypernym_paths', '_wordnet_corpus_reader', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'unicode_repr', 'usage_domains', 'verb_groups', 'wup_similarity']

Going through this howto would be helpful in knowing how to access the information you need in wordnet: http://www.nltk.org/howto/wordnet.html

Please try the following:

for word in wn.words():
    print word

This should work because wn.words() is actually an iterator that generates a sequence of strings, rather than a list of strings like b.words. The for loop causes the iterator to generate the words one at a time.

How can I print the entire contents of Wordnet (preferably with NLTK)?

Tags:

Python

Nlp

Nltk

Wordnet

Corpus

Related

Recent Posts