Skip to content

Intermediate Abstract

MOAAS edited this page Jan 20, 2022 · 1 revision

Real-time prediction of Wikipedia articles' quality

Wikipedia is the largest and most well-known online encyclopedia, containing over 58 million articles and averaging 1.9 user contributions per second. However, the lack of a centralized authority on its maintenance makes it challenging to ensure article quality throughout the entire website since vandalism and differing opinions may cause instability on specific articles. Moreover, there is often a quality discrepancy between the English Wikipedia and the other versions, which is concerning, as many users usually opt for articles in their native language.

Regardless of its scale, Wikipedia still contains many reliable articles. Wikipedia designed a quality scale that distinguishes the outstanding, well-written articles (Featured Articles and Good Articles) from the incomplete and incoherent ones (Starts and Stubs). Unfortunately, as the assessment process is primarily manual, most articles remain unrated.

In this work, we propose the creation of a browser extension that uses Machine Learning to predict, in real-time, the quality of Wikipedia articles. Initially, we review existing metrics for measuring article quality and compare their effectiveness. Another crucial part of the study will be collecting Wikipedia articles datasets for training and testing. Regarding the training data, there must be a reference point for what is considered a good or bad article, and thus we will take advantage of Wikipedia's quality scale. Finally, we must compare several Machine Learning algorithms to maximize the model accuracy.

With this solution, we expect to create something that allows any Wikipedia reader to quickly determine how good an article is, which will assist millions of users who use the website for any research purpose.

Keywords: Quality assessment, wikipedia, machine learning

Clone this wiki locally