-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doesn't seem to work for Gensim Topic Models #32
Comments
I think I wrote this down in the documentation, but only pretrained Gensim models are supported. To me it seems like you were trying to train the wrapper pipeline. Try fitting the model first and then pack it in a pipeline. I would also encourage using scikit-learn wherever possible, as it's easier to work with (in my humble opinion). The way I would do it in Gensim goes something like this: from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel
from gensim.utils import tokenize
tokenized_corpus = [list(tokenize(text, lower=True)) for text in corpus]
dictionary = Dictionary(tokenized_corpus)
bow_corpus = [dictionary.doc2bow(text) for text in tokenized_corpus]
lda = LdaModel(bow_corpus, num_topics=10)
pipeline = gensim_pipeline(dictionary, model=lda)
topic_data = pipeline.prepare_topic_data(corpus) I know for a fact that this works, because it's in the library's test suite. |
Thanks for the effort and sorry for the troubles, I will mark this as a bug and will try to fix it as soon as possible! |
@avisekksarma version 1.0.2 should fix this issue, can you please confirm that it works on your end? |
Hmm your code should be fine. I will look more into this. Very strange behaviour considering that I have a test case for just this and it passes. Loading the model from disk should in theory be fine. In the meantime, like I said above, you can try using sklearn instead of gensim. |
By using sklearn do you mean using that trained lda_model by gensim into some function of sklearn? Can you provide me some code for how to use sklearn on this aspect of utilizing the trained lda_model ? Or , can you clarify and give some code part for what you meant by using sklearn here? |
If you don't mind training a new model you can do so with sklearn like this: import joblib
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import topicwizard
from topicwizard.pipeline import make_pipeline
pipeline = make_pipeline(CountVectorizer(), LatentDirichletAllocation())
topic_data = pipeline.prepare_topic_data(corpus)
# I recommend persisting this data to disk
joblib.dump(topic_data, "topic_data.joblib")
topicwizard.visualize(topic_data=topic_data) Though I personally would NOT recommend using LDA unless you have very good reasons to do so. |
Also big sorries for not fixing this issue for a while, I've just been working on a publication and have been ill for a little while, I will hopefully get to it quickly :D |
I have trained LDA model using Gensim, and now want to use topicwizard for visualization.
But even after following the Readme for using Gensim topic model case, it doesn't seem to work.
Note: I am doing in Nepali Language. The lda model is also trained in Nepali Language.
Code :
dictionary, and lda_model are loaded from my training by use of Gensim.
I have checked and there is no problem in any of that.
no problem till now, as corpus is shown as:
print(corpus[10:12])
Now fitting the corpus
pipeline.fit(corpus)
printing topic_names:
So, everything seems to be fine but when doing visualization:
So, what is the problem here? Is it that its not working in Gensim topic model ?
The text was updated successfully, but these errors were encountered: