-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perform bad on classifiction task #17
Comments
I notice that although top selected words in each topic is clearer than other topic models like NVDM, topics related to some classes is less likely to appear. For example, only few from 100 topics indicate music/movie. The classes I use are ['car', 'game', 'food', 'movie', 'music', 'news', 'show', 'sports', 'tech', 'travel'] edit: |
@worldchanger6666 I am not associated with ETM paper, but here is my 2 cents on why you see poor classification performance. When performing topic modelling you are throwing away lot of information. I personally wouldn't use LDA representations for downstream tasks. I see it more as a way of finding and visualizing a manageable set of themes/topics. In your case average title length is just 10 words, so probably there are lot of very subtle or rare words that you want to capture as part of classification, but LDA effectively smooths these over. Just from the classes I can see potentially lot of overlap between "movie", "music" and "show". So you're better off using SVM with BoW feature representation, or if you want to use embeddings, then you can try Deep Averaging Networks (DANs). |
You should never use representation from a generative model like LDA-based to perform a classification task. |
I apply ETM on the dataset of youtube titles, each title belongs one of 10 classes. And there are 10000 titles for each class, the average length of title is 10. I set num_topics = 200.
After 1000 epochs training, I use theta (topic distribution for each title, which is a 200*1 vector) gotten from ETM as the input of SVM and try to do classification. But the result is bad, F1 score is 0.62.
However, simply using counter vector as input (43020*1 vector) will have F1 around 0.75.
Does anyone know the potential reason to explain it? Thanks!
The text was updated successfully, but these errors were encountered: