The project is about searching the text mining for classification using bag of words #bagofwords and applying machine learning models on this.
Nowadays, a daily increase of online available data leads to a growing need for that data to be organized and regularized. Textual data is all around us starting from web pages, e-books, media articles to emails or user comments. There are a lot of cases where automatic text classification would accelerate processing time (for example, detection of spam pages, personal email sorting, tagging products or document filtering). We can say that all organizations (e.g. academia, marketing or government) that deal with a lot of unstructured text, could handle that data much easier if it was standardized by categories/tags. This Dataset is a collection newsgroup documents. The 4 newsgroups collection can be used for experiments in text applications of machine learning techniques, such as text classification and text clustering.
Text classification or text categorization is an activity of labelling natural language texts with relevant predefined categories. The idea is to automatically organize text in different classes. It can drastically simplify and speed-up your search through the documents or texts!
3 major steps in Text-Mining-in-R
code :
-
While training and building a model keep in mind that the first model is never the best one, so the best practice is the “trial and error” method. To make that process simpler, you should create a function for training and in each attempt save results and accuracies.
-
I decided to sort the EDA process into two categories: general pre-processing steps that were common across all vectorizers and models and certain pre-processing steps that I put as options to measure model performance with or without them
-
Accuracy was chosen as a measure of comparison between models since greater the accuracy, better the model performance on test data.
-
First of all, I've created a Bag of Words file. This file
clean_data.R
contains all the methods to preprocess and generate bag of words. We useCorpus
library to handle preprocessing and to generate Bag of Words . -
The following general pre-processing steps were carried out since any document being input to a model would be required to be in a certain format:
- Converting to lowercase
- Removal of stop words
- Removing alphanumeric characters
- Removal of punctuations
- Vectorization: TfVectorizer was used. The model accuracy was compared with those that used TfIDFVectorizer. In all cases, when TfVectorizer was used, it gave better results and hence was chosen as the default Vectorizer.
- The following steps were added to the pre-processing steps as optional to see how model performance changed with and without these steps:
1. Stemming
2. Lemmatization
3. Using Unigrams/Bigrams
Confusion Matrix for Support Vector Machine using Bag of Words Generated using clean_data.r
> confusionMatrix(table(predsvm,data.test$folder_class))
Confusion Matrix and Statistics
predsvm 1 2 3 4
1 31 0 0 0
2 0 29 6 0
3 0 3 28 0
4 0 0 0 23
Overall Statistics
Accuracy : 0.925
95% CI : (0.8624, 0.9651)
No Information Rate : 0.2833
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8994
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3 Class: 4
-The most interesting deduction is that the more specific the newsgroup topic is, the more accurate that the Naïve Bayes classifier can determine what newsgroup a document belongs to and the converse is also true where the less specific the newsgroup is, the accuracy rate plummets.
-We can see this in Accuracy where every newsgroup that isn’t a misc will always have an accuracy rate of at least 50%. The bottom newsgroups for terms of accuracy rate are all misc which includes a 0.25% accuracy rate for talk.politics.misc.
-A reason for this is that the posts that are written in misc newsgroups are rarely related to the actual root of the newsgroup. The misc section caters to other topics of discussion other than the “root newsgroup” meaning that it is much easier for the classifier to confuse a document from a misc newsgroup with another newsgroup and much harder for the classifier to even consider the root newsgroup since topics regarding the root newsgroup at posted there instead.
-For example, a post about guns is posted in talk.religion.misc can be easily classified as being talk.politics.guns because it would have to use similar words found in the posts found in talk.politics.guns. Likewise, posts about politics in talk.politics.misc are less likely because you are more likely to post in or talk.politics.guns (where wildcard is the relevant section for the type of politics to be discussed).
- Install randomForest using pip command:
install.packages("randomForest")
- Install caret using pip command:
install.packages("caret")
- Install mlr using pip command:
install.packages("mlr")
- Install MASS using pip command:
install.packages("MASS")
- Download for the report.
- Why Term Frequency is better than TF-IDF for text classification
- Naïve Bayes Classification for 20 News Group Dataset
- Analyzing word and document frequency: tf-idf
- Natural Language Processing
- K Nearest Neighbor in R
- MLR Package
Text Mining Analyzer - A Detailed Report on the Analysis
- Clone this repository:
git clone https://github.com/iamsivab/Text-Mining-in-R.git
-
Check out any issue from here.
-
Make changes and send Pull Request.
📧 Feel free to contact me @ [email protected]
MIT © Sivasubramanian