LDA model generates different topics everytime i train on the same corpus

Why does the same LDA parameters and corpus generate different topics everytime?

Because LDA uses randomness in both training and inference steps.

And how do i stabilize the topic generation?

By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed:

SOME_FIXED_SEED = 42

# before training/inference:
np.random.seed(SOME_FIXED_SEED)

(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)

Set the random_state parameter in the initialization of LdaModel() method.

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=num_topics,
                                            random_state=1,
                                            passes=num_passes,
                                            alpha='auto')

I had the same problem, even with about 50,000 comments. But you can get much more consistent topics by increasing the number of iterations the LDA runs for. It is initially set to 50 and when I raise it to 300, it usually gives me the same results, probably because it is much closer to convergence.

Specifically, you just add the following option:

ldamodel.LdaModel(corpus, ..., iterations = <your desired iterations>):

This is due to the probabilistic nature of LDA as noted by others. However, I don't believe setting the random_seed argument to a fixed number is the proper solution.

Definitely try increasing the number of iterations first to make sure your algorithm is converging. Even then, each starting point may land you on a different local minimum. So you can run LDA multiple times without setting random_seed, and then comparing the results using the coherence score of each model. This helps you avoid the suboptimal local minima.

Gensim's CoherenceModel already has the most common coherence metrics implemented for you, such as c_v, u_mass, and c_npmi.

You might realize these will make the results more stable, but they won't actually guarantee the same results from run to run. However, it's better to get to the global optimum as much as possible instead of being stuck on the same local minimum because of a fixed random_seed IMO.

LDA model generates different topics everytime i train on the same corpus

Tags:

Python

Nlp

Lda

Topic Modeling

Gensim

Related

Recent Posts