perform bad on classifiction task #17

yilunzhao · 2020-04-27T05:13:54Z

I apply ETM on the dataset of youtube titles, each title belongs one of 10 classes. And there are 10000 titles for each class, the average length of title is 10. I set num_topics = 200.

After 1000 epochs training, I use theta (topic distribution for each title, which is a 200*1 vector) gotten from ETM as the input of SVM and try to do classification. But the result is bad, F1 score is 0.62.

However, simply using counter vector as input (43020*1 vector) will have F1 around 0.75.
Does anyone know the potential reason to explain it? Thanks!

yilunzhao · 2020-04-27T05:17:11Z

I notice that although top selected words in each topic is clearer than other topic models like NVDM, topics related to some classes is less likely to appear. For example, only few from 100 topics indicate music/movie.

The classes I use are ['car', 'game', 'food', 'movie', 'music', 'news', 'show', 'sports', 'tech', 'travel']

edit:
I notice that we use topic embedding in ETM.
Will topic embedding encourage the topic model to discover those topics similar with each other and ignore those independent topic? For example, in my experiments, about 30% topics extracted are talking about news and very few topics relates to music.

acatovic · 2021-02-20T13:11:00Z

@worldchanger6666 I am not associated with ETM paper, but here is my 2 cents on why you see poor classification performance. When performing topic modelling you are throwing away lot of information. I personally wouldn't use LDA representations for downstream tasks. I see it more as a way of finding and visualizing a manageable set of themes/topics. In your case average title length is just 10 words, so probably there are lot of very subtle or rare words that you want to capture as part of classification, but LDA effectively smooths these over. Just from the classes I can see potentially lot of overlap between "movie", "music" and "show". So you're better off using SVM with BoW feature representation, or if you want to use embeddings, then you can try Deep Averaging Networks (DANs).

bui-thanh-lam · 2021-08-10T20:12:20Z

You should never use representation from a generative model like LDA-based to perform a classification task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perform bad on classifiction task #17

perform bad on classifiction task #17

yilunzhao commented Apr 27, 2020 •

edited

Loading

yilunzhao commented Apr 27, 2020 •

edited

Loading

acatovic commented Feb 20, 2021

bui-thanh-lam commented Aug 10, 2021

perform bad on classifiction task #17

perform bad on classifiction task #17

Comments

yilunzhao commented Apr 27, 2020 • edited Loading

yilunzhao commented Apr 27, 2020 • edited Loading

acatovic commented Feb 20, 2021

bui-thanh-lam commented Aug 10, 2021

yilunzhao commented Apr 27, 2020 •

edited

Loading

yilunzhao commented Apr 27, 2020 •

edited

Loading