Skip to content

ML Methods

MOAAS edited this page Apr 7, 2022 · 4 revisions

Machine Learning Methods

Structure still WIP. Ideas for paper: Classification (FA, GA, Stub, etc), Regression (0-100%), Mix of both (Somehow use class confidences to estimate value)

RNN + LTSM (need to investigate more)

Uses whole article as input, but takes a long time to train. Unfeasible for real-time prediction.

Results: Achieved accuracy of 60-70%

Dataset: Wikimedia Foundation English/Russian/French

Support Vector Regression

Information gain measure (infogain) was used to evaluate the impact of the chosen features.

The performance of the method was evaluated using mean squared error (MSE) and Normalized Discounted Cumulative Gain at top k (NDCG@k).

Results: MSE of 0.82 when using all features

Dataset: English Wikipedia (2009, ~2683000 articles)

Decision Tree and Naive Bayes, with Naive Bayes achieving best results. However, they used "Concept" features, which is a much more manual and subjective measure.

Dataset: Thai Wikipedia (2014, ~85000 articles)

SVM with RBF kernel

Results: F-score of 0.8568

Dataset: ~12000 articles, 30/70 featured/low-quality

Method: 5-fold cross validation

Several ML approaches were experimented with. Accuracy and AUC were evaluated. Some hyperparameters are specified on the paper.

Method Accuracy
Linear Regression 25%
Multinomial Logistic Regression 60%
KNN 55%
CART 48%
SVM 61%
Random Forest 64% (58% w/o readability scores)

Some features are more important than others (See Fig. 3), difficult_words, content_length, num_references, num_page_links being the most important. For the full list check the article (Fig. 3)

Dataset: ~20000 Articles with Qualities of FA, GA, B, C, Start and Stub, close to evenly distributed.

Method Accuracy MSE
Decision Tree 47.4% 1.883
K-NN 42.4% 2.123
Logistic Regression 49.7% 1.359
Naive Bayes 30.4% 3.573
Random Forest 59.2% 1.167
Support Vector Classifier 50.6% 1.358
Neural Networks 50.3% 1.204
Gradient Boosting 61.8% 0.919

Dataset: 400 articles randomly chosen from each quality, total of 2800

Deep Learning Methods

Method Accuracy
Stacked LTSMs 71.9%
DNN 68.7%
CNN 63.4%
CNN + LTSM 67.3%
LTSM w/ Dropout 67.9%
Basic LTSM 69.0%
Bidirectional LTSM 69.7%

Non-Deep Learning Methods

Method Accuracy
Decision Tree 71.1%
SVM 70.8%
K-NN 66.3%
Naive Bayes 59.9%

Note that there were only three classes (high/medium/low quality)

Feature Set Accuracy


Dataset: 3294 articles from English Wikipedia (Wikimedia Downloads)

Method: Ten-fold Cross Validation, with Hyper-parameter optimization

Algorithms Experimented With: "logistic regression, C5.0, Adaboost, Bayesian networks, etc., and found that C5.0 gives the best results in this case."

Accuracy: 84.7% (Note that there was a reduced number of classes - 4)

Dataset: 4.7 Million articles (Entire English Wikipedia)

No ML methods were experimented with.

Deep Learning approach with four hidden layers, where authors feed the NNs the articles themselves.

"In this paper, we applied the unsupervised learning algorithm called Paragraph Vector, recently known as Doc2Vec that learns vector representations for variable-length pieces of texts and overcomes the disadvantages of bag-of-words by taking into account the order and semantics of words."

"In this approach every word and every paragraph are mapped to a unique vector."

Accuracy: 55.5%

Dataset: 30000 wikipedia articles (English Wikipedia)

Method Accuracy
None (>2000 words = Featured) 96.94%
MLP 97.15%
K-NN 96.94%
Random Forest 95.8%

Note that there were only two classes (featured/non-featured)

Dataset: 11067 articles (1554 Features / 9513 Random)

Ten-fold cross-validation was the used method. Domain transfer was also experimented with, but yielded overall worse results.

Method F-Score
SVM (character trigram) 0.964
SVM (POS trigram) 0.941
Naive-Bayes 0.904

Note that there were only two classes (featured/non-featured)

Dataset: 760 articles of English Wikipedia (400 from History, 360 from Biology)

The experimentation was mostly user-centered, and we're not interested in that part. However, the stabilized and controversial models were tested with an SVM classifier, achieving accuracies of ~78% for Stabilized and ~92% for Controversial.

Dataset: 96 Wikipedia articles for each classifier

C4.5 Decision Tree with 10-fold cross validation achieved ~90% precision and recall, with two classes (featured/random).

Dataset: 1070 (236/834 Featured/Random) Wikipedia articles

Clone this wiki locally