Doesn't seem to work for Gensim Topic Models #32

avisekksarma · 2024-02-29T11:06:34Z

I have trained LDA model using Gensim, and now want to use topicwizard for visualization.
But even after following the Readme for using Gensim topic model case, it doesn't seem to work.

Note: I am doing in Nepali Language. The lda model is also trained in Nepali Language.
Code :

from gensim.corpora.dictionary import Dictionary
from topicwizard.compatibility import gensim_pipeline
import topicwizard

dictionary, and lda_model are loaded from my training by use of Gensim.

I have checked and there is no problem in any of that.

dictionary_form_data = Dictionary(dictionary)
pipeline = gensim_pipeline(dictionary_form_data, model=lda_model)


corpus = [" ".join(tokenized_news) for tokenized_news in dictionary]

no problem till now, as corpus is shown as:

print(corpus[10:12])

Now fitting the corpus

pipeline.fit(corpus)

printing topic_names:

So, everything seems to be fine but when doing visualization:

So, what is the problem here? Is it that its not working in Gensim topic model ?

The text was updated successfully, but these errors were encountered:

x-tabdeveloping · 2024-02-29T13:56:21Z

I think I wrote this down in the documentation, but only pretrained Gensim models are supported. To me it seems like you were trying to train the wrapper pipeline. Try fitting the model first and then pack it in a pipeline. I would also encourage using scikit-learn wherever possible, as it's easier to work with (in my humble opinion).

The way I would do it in Gensim goes something like this:

from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel
from gensim.utils import tokenize

tokenized_corpus = [list(tokenize(text, lower=True)) for text in corpus]
dictionary = Dictionary(tokenized_corpus)
bow_corpus = [dictionary.doc2bow(text) for text in tokenized_corpus]
lda = LdaModel(bow_corpus, num_topics=10)
pipeline = gensim_pipeline(dictionary, model=lda)
topic_data = pipeline.prepare_topic_data(corpus)

I know for a fact that this works, because it's in the library's test suite.
I hope I could be of help, if you experience further issues feel free to write again.

avisekksarma · 2024-03-08T08:14:22Z

I am sorry, I couldn't get back to this sooner.
My previous code as said in this issue, is exactly same like you said ,except that:
1. Point 1:
i have just loaded lda_model which was trained and saved in disk. i.e. by
lda_model = models.ldamodel.LdaModel.load('./results/models/40_topics')

I don't think loading a saved model or training on the go makes any difference in model ,except that training makes slower everytime to run topicwizard.
You mentioned training on the go by this code :
lda = LdaModel(bow_corpus, num_topics=10)

2. Point 2:
Also, I loaded the tokenized_corpus , which is in same format as yours, just language is different, in my code tokenized_corpus was written as dictionary ( just different name only ) i.e.
dictionary_form_data = Dictionary(dictionary)
Then i did:
pipeline = gensim_pipeline(dictionary_form_data, model=lda_model)
and then like you said ,now I did:
`topic_data = pipeline.prepare_topic_data(corpus)'

but it throws following error:

3. Point 3:
This time i also trained like you said, where i have loaded bow_corpus variable ( i checked it )

And, still got same above error.

Conclusion
So, in conclusion, I have tokenized corpus, bow_corpus, and lda_model ( also trained lda_model on the go in point 3 ), I checked format/shape of those variables and they are similar to english language case. I couldn't use that tokenize() function of gensim ,as it is Nepali Language so tokenization and stemming is different.
Now, on running that pipeline.prepare_topic_data(corpus), it throws error and also topicwizard.visualize(corpus, model = pipeline)

Note: I wanted to be clear if we have misunderstandings, so I have posted this detailed comment, let me know if you want to know anything more.

x-tabdeveloping · 2024-03-08T11:58:14Z

Thanks for the effort and sorry for the troubles, I will mark this as a bug and will try to fix it as soon as possible!

x-tabdeveloping · 2024-03-11T07:59:17Z

@avisekksarma version 1.0.2 should fix this issue, can you please confirm that it works on your end?

avisekksarma · 2024-04-10T11:17:24Z

I apologize for not getting back to this project sooner since I was going through exams.
Now, I tested according to same code as I said in first part in this issue thread. And , it is still not working.

The change now is : Its not throwing NotFittedError like previous case, but now the thrown error while i did following is :
topicwizard.visualize(corpus,model=pipeline)
Error: IndexError: index 18580 is out of bounds for axis 1 with size 18580

Just to be clear, My all code [ Like said above ] is:

from load_variables import load_data
processed_data, bow_corpus, id2word, lda_model = load_data() 
# above line loads all data since i trained already using Gensim i.e. same code as "lda = LdaModel(bow_corpus, num_topics=10)" for model training
dictionary = processed_data['body'].to_numpy().tolist()
print(dictionary[0]) 
# prints - ['सर्वोच्च', 'अदालत', 'प्रस्तावित', 'न्यायाधीश', 'अब्दुल', 'अजिज', ...]   i.e. tokenized forms in Nepali Language ( like in English )
corpus = [" ".join(tokenized_news) for tokenized_news in dictionary]

from gensim.corpora.dictionary import Dictionary
from topicwizard.compatibility import gensim_pipeline
import topicwizard

dictionary_form_data = Dictionary(dictionary)
# not done below two lines of code since i loaded already
# Note: Even though I do below code, the error is same only difference is I just have to redo training again.
# bow_corpus = [dictionary.doc2bow(text) for text in texts] 
# current_lda_model = LdaModel(bow_corpus, num_topics=40)

pipeline = gensim_pipeline(dictionary_form_data, model=lda_model)

pipeline.fit(corpus)

# No error till now, but now , below line causes error
topicwizard.visualize(corpus,model=pipeline)

Above last line threw below error :

And, like you said if i do following then also get following error:

Code : topic_data = pipeline.prepare_topic_data(corpus)

Error:

Shouldn't I be able to use LDA model that i trained with gensim and just loading it rather than training it here again like you said, since it takes lots of time to train lda model, I feel training here ( i.e. like you said - lda = LdaModel(bow_corpus, num_topics=10) ) and loading trained model should have no impact as its the same model
underneath.

Can you provide me with code snippet that you feel should work with gensim lda models, so that I can tweak around that to see if that works ?

x-tabdeveloping · 2024-04-10T13:28:14Z

Hmm your code should be fine. I will look more into this. Very strange behaviour considering that I have a test case for just this and it passes. Loading the model from disk should in theory be fine. In the meantime, like I said above, you can try using sklearn instead of gensim.

avisekksarma · 2024-04-10T15:30:25Z

By using sklearn do you mean using that trained lda_model by gensim into some function of sklearn? Can you provide me some code for how to use sklearn on this aspect of utilizing the trained lda_model ? Or , can you clarify and give some code part for what you meant by using sklearn here?
Thank you.

x-tabdeveloping · 2024-04-16T09:04:47Z

If you don't mind training a new model you can do so with sklearn like this:

import joblib
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import topicwizard
from topicwizard.pipeline import make_pipeline

pipeline = make_pipeline(CountVectorizer(), LatentDirichletAllocation())
topic_data = pipeline.prepare_topic_data(corpus)

# I recommend persisting this data to disk
joblib.dump(topic_data, "topic_data.joblib")

topicwizard.visualize(topic_data=topic_data)

Though I personally would NOT recommend using LDA unless you have very good reasons to do so.
My experience is that it's incredibly slow and gives subpar results most of the time. If you wanna keep it classic and don't want to mess with contextual models I would just recommend a good preprocessing pipeline and NMF, because it's waaaay faster and gives nicer results.
If you want the best results possible you can try KeyNMF.

x-tabdeveloping · 2024-04-16T09:05:42Z

Also big sorries for not fixing this issue for a while, I've just been working on a publication and have been ill for a little while, I will hopefully get to it quickly :D

x-tabdeveloping added the bug Something isn't working label Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doesn't seem to work for Gensim Topic Models #32

Doesn't seem to work for Gensim Topic Models #32

avisekksarma commented Feb 29, 2024 •

edited

Loading

x-tabdeveloping commented Feb 29, 2024

avisekksarma commented Mar 8, 2024 •

edited

Loading

x-tabdeveloping commented Mar 8, 2024

x-tabdeveloping commented Mar 11, 2024

avisekksarma commented Apr 10, 2024

x-tabdeveloping commented Apr 10, 2024

avisekksarma commented Apr 10, 2024

x-tabdeveloping commented Apr 16, 2024

x-tabdeveloping commented Apr 16, 2024

Doesn't seem to work for Gensim Topic Models #32

Doesn't seem to work for Gensim Topic Models #32

Comments

avisekksarma commented Feb 29, 2024 • edited Loading

dictionary, and lda_model are loaded from my training by use of Gensim.

I have checked and there is no problem in any of that.

no problem till now, as corpus is shown as:

Now fitting the corpus

printing topic_names:

So, everything seems to be fine but when doing visualization:

So, what is the problem here? Is it that its not working in Gensim topic model ?

x-tabdeveloping commented Feb 29, 2024

avisekksarma commented Mar 8, 2024 • edited Loading

x-tabdeveloping commented Mar 8, 2024

x-tabdeveloping commented Mar 11, 2024

avisekksarma commented Apr 10, 2024

x-tabdeveloping commented Apr 10, 2024

avisekksarma commented Apr 10, 2024

x-tabdeveloping commented Apr 16, 2024

x-tabdeveloping commented Apr 16, 2024

avisekksarma commented Feb 29, 2024 •

edited

Loading

avisekksarma commented Mar 8, 2024 •

edited

Loading