nlp-news-categorization

Assignment Option Four - News Categorization using PyTorch

Link to our GitHub repository: https://github.com/lennartmoritz/nlp-news-categorization

Authors

Carlotta Mahncke
Lennart Joshua Moritz
Timon Engelke
Christian Schuler

Setup

sudo apt install python3-venv
python3 -m venv venv
source venv/bin/activate
Automatic: ./script-caller.bh`
Or Manual:
- pip install -r requirements.txt
- python -m spacy download "en_core_web_sm"

Use

Multiclass

python news-categorization.py -l b t e m

Binary classification

python news-categorization.py -l b t

Specify embedding

python news-categorization.py -l b t -e word2vec
Choose from: word, lemma, word2vec, glove

Task

Text categorization using PyTorch

A typical workflow in PyTorch.
The task is to classify news articles into one of the following categories:
- Business
- Science and Technology
- Entertainment
- Health

Results

As shown in our presentation, the lemma embedding performed the best, followed by the word embedding. Pretrained embeddings like word2vec and glove performed worse than the embeddings we trained ourselves. This is probably because the self-trained embeddings are more specific to our dataset.

For the binary classification tasks, we saw that differentiating between business and science and technology was the hardest task, while health and entertainment were the easiest. This is probably because the business and science and technology articles are very similar in their vocabulary, while the health and entertainment articles are very different.

Comparison to BERT

You can use the script bert_classification.py to fine-tune a pre-trained BERT classifier on our dataset. The results are slightly better than the results we got with our own models (around 96% accuracy compared to around 94% accuracy on our model for the multi-class classification task).

Dataset

News Aggregator Dataset https://www.kaggle.com/datasets/uciml/news-aggregator-dataset

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Aufgabenstellung		Aufgabenstellung
data		data
evaluations		evaluations
images		images
news_categorizer		news_categorizer
runs		runs
.gitignore		.gitignore
README.md		README.md
bert_classification.py		bert_classification.py
news-categorization.py		news-categorization.py
news-evaluation.py		news-evaluation.py
news-visualisation.py		news-visualisation.py
nlp-mlProject-01.ipynb		nlp-mlProject-01.ipynb
requirements.txt		requirements.txt
script-caller.bh		script-caller.bh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlp-news-categorization

Link to our GitHub repository: https://github.com/lennartmoritz/nlp-news-categorization

Authors

Setup

Use

Multiclass

Binary classification

Specify embedding

Task

Results

Comparison to BERT

Dataset

About

Releases

Packages

Contributors 4

Languages

lennartmoritz/nlp-news-categorization

Folders and files

Latest commit

History

Repository files navigation

nlp-news-categorization

Link to our GitHub repository: https://github.com/lennartmoritz/nlp-news-categorization

Authors

Setup

Use

Multiclass

Binary classification

Specify embedding

Task

Results

Comparison to BERT

Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages