Hierarchical Dirichlet Process Gensim topic number independent of corpus size
@Aaron's code above is broken due to gensim API changes. I rewrote and simplified it as follows. Works as of June 2017 with gensim v2.1.0
import pandas as pd
def topic_prob_extractor(gensim_hdp):
shown_topics = gensim_hdp.show_topics(num_topics=-1, formatted=False)
topics_nos = [x[0] for x in shown_topics ]
weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights})
@Aron's and @Roko Mijic's approaches neglect the fact that the function show_topics
returns by default the top 20 words of each topic only. If one returns all the words that compose a topic, all the approximated topic probabilities in that case will be 1 (or 0.999999). I experimented with the following code, which is an adaptation of @Roko Mijic's:
def topic_prob_extractor(gensim_hdp, t=-1, w=25, isSorted=True):
"""
Input the gensim model to get the rough topics' probabilities
"""
shown_topics = gensim_hdp.show_topics(num_topics=t, num_words=w ,formatted=False)
topics_nos = [x[0] for x in shown_topics ]
weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
if (isSorted):
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights}).sort_values(by = "weight", ascending=False);
else:
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights});
A better, yet I'm not sure if 100% valid, approach is the one mentioned here. You can get the topics' true weights (alpha vector) of the HDP model as:
alpha = hdpModel.hdp_to_lda()[0];
Examining the topics' equivalent alpha values is more logical than tallying up the weights of the first 20 words of each topic to approximate its probability of usage in the data.
I think you misunderstood the operation performed by the called method. Directly from the documentation you can see:
Alias for show_topics() that prints the top n most probable words for topics number of topics to log. Set topics=-1 to print all topics.
You trained the model without specifying the truncation level on the number of topics and the default one is 150. Calling the print_topics
with topics=-1
you'll get the top 20 words for each topic , in your case 150 topics.
I'm still a newbie of the library, so maybe I' wrong