Marcos V. O. Assis ([email protected])
Generate an algorithm capable of classifying articles based on their titles.
For development and testing, words in Portuguese-BR were used.
- Download Word2Vec pre-trained CBOW model from NILC (cbow_s300)
- Load the model using the Gensin library.
- Vectorize the article's titles using the NLTK library and CBOW model.
- Training a Logistic Regression classification model using the Scikit Learn library.
- Mundo (World)
- Cotidiano (Daily life)
- Mercado (Market)
- Esporte (Sports)
- Ilustrada (Illustrated)
- Colunas (Column)
- 0.8 Accuracy score using CBOW (against 0.3 Accuracy from Dummy Classifier)
- Regarding individual categories, the proposed model achieved F1-Score:
- colunas - 0.78
- cotidiano - 0.69
- esporte - 0.90
- ilustrada - 0.23
- mercado - 0.81
- mundo - 0.79
- CBOW is slightly better than Skipgram for this problem.