-
Notifications
You must be signed in to change notification settings - Fork 0
Literature Notes
This section lists the notes taken from all of the references.
Each entry on the list will have the following format:
[Assigned ID] [First Author's Surname], [Year]. [Title]
[Notes]
TODO: Index? Think about it
Abstract describes usage of machine learning to evaluate Wikipedia quality automatically. They propose a solution with Recurrent Neural Networks. Experiments on English, French, Russian. Claims to outperform state-of-the-art solutions.
Paper thoroughly analyses several proposed metrics for evaluating wikipedia article quality. Discusses limitations and results of a RNN approach and compares it to previous, existing approaches.
TODO: Read articles about RNNs. [21] is example. Also Long-Term short memory. [22]
-
"according to Wikipedia Statistics, Wikipedia is being modified at an impressive speed of ten edits per second on average". For more (and updated) statistics: https://en.wikipedia.org/wiki/Wikipedia:Statistics https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia https://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia%27s_growth
-
"Currently, quality classes of Wikipedia articles are assigned by human reviewers" (https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wikipedia/Assessment)
-
"Several approaches on assessing quality of Wikipedia articles have been presented [29, 12, 14]. The ORES service for quality prediction is used since 2014." (https://www.mediawiki.org/wiki/ORES)
-
"Existing approaches differ in terms of classification features and learning algorithms they use such svm and random forest"
-
"Our proposed solution is not yet applicable for real-time suggestion". Unfortunately, this paper will not be as useful as expected, but some notes can still be taken.
-
"The problem we consider is measuring the quality of a given Wikipedia article, i.e. how well an article is written. We do not aim to measure the correctness of the information".
-
English article tier list: FA - Featured Article, GA - Good Article, B, C, Start, Stub. (For help: https://en.wikipedia.org/wiki/Template:Grading_scheme)
Editor-Based (User reputation) metrics:
-
"Betancourt et al. studied the team characteristics, such as how many FA or GA articles the team members have worked on before, to predict the quality class of Wikipedia articles"
-
"Another criteria used for assessing the quality of a text is the period of time the text remains stable or is modified by other authors/reviewers. If an article has not been modified significantly for a long time, this article can be consideredas mature and of high quality"
-
Other metrics suggested by other papers are then mentioned.
Article-Based (Content) metrics:
-
"One of the simplest solutions predicted the Wikipedia articles quality based on their length: the longer an article is, the better its quality" "This solution achieved a very high accuracy inseparating between FA and non-FA articles."
-
"Warncke-Wang et al. presented and analyzed a feature set composed of 17 features such as article length and the number of article headings for describing the content of the Wikipedia articles." Accuracy of 58% was achieved.
-
"Coming up with features is difficult, time-consuming,requires expert knowledge”. Each Wikipedia languagerequires a new feature set which is difficult to design withouta basic understanding of the language."
-
"Study on feature selection for text categorization [4] showed that using all features consistently produces high and the nominally best AUC performance for the majority of classifiers. This study suggests using the entire document content for obtaining best performances for a text classification task"
-
"Deep learning can avoid manual feature engineering by learning directly from raw data."
-
"In this study, instead of using Doc2Vec, we used Recurrent Neural Network (RNN) with Long-Short Term Memory (LSTM) to assess the quality of Wikipedia articles in end-to-end manner."
-
"The approach described in [14] can be directly applied to any language. However, this study has several shortcomings thatwe addressed in our proposed solution" Shortcomings are listed, include training time of days.
This article contains an explanation of RNNs, but requires background on deep learning. Essentially: "RNNs are defined as neural networks whose connections form cycles."
"LSTM replaces the activation unit inside a RNN cell", "LSTM allows the network to forget some old information while learning new knowledge."
For the output layer activation, "We used rectifier, which is the most popular activation function used in deep learning"
"One of the major issues in classifying documents is the different length of these documents. Most machine learning algorithms are designed to work on data where all instances have the same length [54]. However, by design, RNN is perfectly fit with varied-length data because it can accumulate the learned information by its rolling back feature."
Models were tested on English, French and Russian balanced datasets. "The input of the model is the raw text of Wikipedia articles". Hyper-parameters were described well, but succintly.
Accuracy (balanced dataset) and AUC were the main metrics used. In terms of accuracy, the RNN approach achieved 60-70%,around 10% higher than the others.
"In comparison with manual feature engineering approaches, it has a few disadvantages": Training model takes days, lack of interpretability, etc.
[2] Dalip, 2009. Automatic quality assessment of content created collaboratively by web communities: A case study of wikipedia
Abstract describes usage of quality indicators, some original, some not. They also explore machine learning techniques that take use of those indicators. It is mentioned that the most important indicators are the easiest to extract, and the less significant ones are more complex.
Very insightful discussion on several metrics, but little comparison of ML approaches. Experimentation and results are extremely detailed.
"Our contributions to the this topic include: 1) the proposition of a new method, based on regression analysis, to combine several quality indicators into a single value; 2) the proposal and adoption of new quality indicators for the problem; 3) and a detailed study about the impact of all these indicators on the task of automatic quality assessment"
"The solution we propose to this problem consists in: (1) determining the set of features {F1, F2, ..., Fm} used to represent the articles in A; and (2) applying a regression method to find the best combination of the features to predict the quality value qi for any given article ai. To accomplish this, we use the learning algorithm Support Vector Regression". For more information on the used algorithm, review the article.
The text features relate to length and structure of the aritcle, they are:
- Character Count (Includes spaces)
- Word and Phrase Count
- Section count
- Mean section size
- Images per section
- Among others...
Style features relate to how the authors write the articles, and complement readability scores. The Diction program was used.
- Size of the largest phrase (In number of words)
- Large phrase rate (Percentage of phrases whose length is ten words greater than the article average phrase length)
- Preposition rate
- "To be" verb rate
- Among others...
The features estimate "the age or US grade level necessary to comprehend a text. (...) good articles should be well written, understandable, and free of unnecessary complexity". The paper details several of these metrics, and their formulas are shown on the paper, taking into account the words, sentences, characters and syllables.
These take the review history into account, calculating metrics such as article age, age per review, review count, etc. "It can be expected that good quality articles have reached a maturity level in which no extensive corrections are necessary"
This paper also details a more complex measure, ProbReview that "tries to assess the quality of a Wikipedia article based on the quality of its reviewers. Recursively, the quality of the reviewers is based on the quality of the articles they reviewed."
"In this case, we see the collection as a graph, where nodes are articles and edges are the citations between them. The main motivation for using these features is that citations between articles can provide evidence about their importance". An article could be stable because noone cares about it, so we need to take its popularity into account.
Possible metrics:
- PageRank
- In/Out degree
- Translation count
- Among others
The complete data repository is available for download. The wikipedia quality scale is used (FA, A-Class, GA, etc.). Considering the fact that some features are very resource demanding, 874 articles were collected, as well as their entire review history (38 GB of text). "To compute the network features we obtained the complete page-to-page links from the Wikipedia. These links were imported using the page links dump file available at the Wikipedia database download site"
Experiments were made in order to assess the impact of each feature, as well as to compare the chosen approach with existing methods. As for the results, the classification performance was calculated using the Mean Squared Error (MSE) metric. Several combinations of metrics were tested, and their performances are listed.
For instance, using only the structure and style features results in MSE = 0.92, while using all the features results in MSE = 0.82.
"we first note that the Structure group of features, when used alone, yields the best performance,whereas the Readability group performs the worst. This suggests that indicators related to the organization of the article, sections, images, citations, and links are the most useful to distinguish the quality of the articles."
Length features appeared to have very little impact on results.
[5] Saengthongpattana, 2014. Assessing the quality of Thai Wikipedia articles using concept and statistical features
Considers two classes: "High-quality" and "Low-quality" actors. The chosen articles relate to the following categories: Biography, Animal, Places. Overall, the paper appears to compare metrics and machine learning approaches to assess article quality. The usage of the Thai language may not be ideal, but reviewing metrics in different languages may be helpful.
Proposes the usage of Decision Trees and Naive Bayes algorithms, and also discusses a few possible features related to article quality.
"Most researches focus on finding the suitable features that can effectively assess the quality of articles. There are three categories which are Review, Network, and Text features". These categories appear in many papers, and often have the same names.
The dataset is available here. "This means that these Thai Wikipedia articles are created and updated until July, 19th 2012."
The dataset was separated into 3 domains: Biography, Animal, Place. Over 99% of the ~18000 Thai articles were low-quality. This shows the inequality on different languages.
The paper describes several simpler features, such as #headings, #links, etc., but also a more complex one: #Concepts, which consists on the number of concepts related to that domain that are present in the article.
"Naïve Bayes is a well known approach that performs very well on text categorization problem."
AUC, TP rate and FP rate were measured and compared. Naive Bayes appears to have better results accross the board (~99 AUC vs ~80 AUC)
Paper mentions possible approaches to automatically assess quality in wikipedia articles, "by using a psycho-lexical resource, i.e., the Language Inquiry and Word Count (LIWC) dictionary".
Short article, but details results and experiments well. Discussion of ML approachess is scarce, but does display some interesting quality metrics.
This section discusses several metrics used for quality assessment, proposed in existing studies, both content-based (e.g. length, words, sentences) and non-content based (e.g. edits, editors).
This study is based a lot on the LIWC dictionary (Linguistic Inquiry and Word Count), a "collection of words and word stems" "which automatically detects the links between words and their psychology-relevant categories". There are 80 categories, across 4 dimensions, The text analysis program assigns each word to its LIWC categories.
"A support vector machine classifier with a RBF kernel function was used for the experiments.". The authors experiment with several combinations of the categories always doing binary classification (high vs low quality), and the one that yields the best results consists on just using the dimensions "Standard Linguistic Dimension" and "Psychological Processes", with an f-score of 0.8568. Using the entire LIWC will yield around the same f-score.
Experiments were also made with some basic features (e.g word count, word per sentence), which improved the results.
"the percentage of unique words in Wikipedia featured articles (modeled as token/type ratio) shows a strong central tendency around 30%"
Proposes a method to analyse wikipedia article quality automatically, by "analyzing their content in terms of their format features and readability scores".
Discusses several quality metrics and possible ML approaches with detail, as well as all results.
"In this paper we address the challenge of automaticallyrating the quality of a Wikipedia articles. We use the following quality class labels defined by Wikipedia ordered from low to high quality: Stub,Start,C,B,GA,FA". They removed A and B+ because it's a very small number of articles.
Mentions usage of random forest approach in many articles.
"Structure-based features of our model refer to the structure of the document", such as article length, number of references, number of links, number of citation templates, etc.
They also use content based metrics, such as the Flesch reading score (206.835 − (1.015 × avgsentencelen) − (84.6 × avgsyllablesperword)). All formulas are very well described, so it's better just to refer to the paper if needed.
"Several readability scores seem related but this is not a problem for the classification method we chose, i.e. the randomforest algorithm, as it can cope with multi-collinearity"
Several ML approaches were experimented with. Accuracy and AUC were evaluated. Some hyperparameters are specified on the paper.
Method | Accuracy |
---|---|
Linear Regression | 25% |
Multinomial Logistic Regression | 60% |
KNN | 55% |
Cart | 48% |
SVM | 61% |
Random Forest | 64% (58% w/o readability scores) |
Some features are more important than others, difficult_words, content_length, num_references, num_page_links being the most important. For the full list check the article (Fig. 3)
[12] Bassani, 2019. Automatically assessing the quality of Wikipedia contents
Proposes supervised learning approach to assess article quality, using features from preexisting literature, but also new "unconsidered aspects". The application of Gradient Boosting is also mentioned to have lead to promising results.
Proposes an absurdly high number of metrics (264), describing them in more detail on a different paper. Also provides results for several ML algorithms, comparing them with state-of-the-art approaches.
Mentions study 2 as the most complete in terms of metric discussion, but aims to improve that list. Paper makes distinction between text, readability, review, network features, but their description is detailed on another paper of the same authors(Feature Analysis for Assessing the Quality of Wikipedia Articles through Supervised Classification).
Several ML algorithms are experimented with, and the use of Gradient Boosting appears to bring the best results (61% accuracy).
[13] Wang, 2019. A deep learning-based quality assessment model of collaboratively edited documents: A case study of Wikipedia
Study reviews existing features to assess article quality, then compares several deep learning approaches.
Lists all features from previously studied papers, both content-based and non-content based. Appears to present the same solution as the one presented in Study 1 (Bi-directional RNN + LSTM), but also other approaches.
Usage of CNN did not appear to improve results, when experimenting with deep-learning. However, the experiment with decision trees yielded the highest accuracy. Tables 4 and 5 display the importance of each feature / feature category. Text Statistics and Structural Features appear to increase accuracy the most.
[14] Velichety, 2019. Quality assessment of peer-produced content in knowledge repositories using big data and social networks: The case of implicit collaboration in wikipedia
Abstract describes the paper as being more focused on collaboration, and its effects on article quality. It is mentioned that the study complements quality assessment approaches, but it is not entirely clear whether it also contributes to that field. Full text should be assessed.
Reading the full text shows that the article is closely related to our research goals. However it's very much focused on the concept of implicit collaboration. In fact, the results suggest that those new features improve the performance of the model significantly.
Several techniques were used: Regression, C5.0, Adaboost. Best results were achieved in C5.0.
Paper shows a lot of complex metrics based on implicit collaboration, but using article features plus implicit collaboration led to an accuracy of 90%. TODO: Investigate more on the implicit collaboration metrics
[16] Couto, 2021. Assessing the quality of health-related Wikipedia articles with generic and specific metrics
Study specialises on health-related articles, but is heavily focused on the use and evaluation of metrics that evaluate Wikipedia articles.
After assessing full-text, the impressions from the abstract were confirmed, even if the solution is applied to health articles. However, the paper does not appear to be focused on ML approaches that predict article quality.
This study is also based on wikipedia's quality scale (FA, GA, etc...).
Lists Stvilia et al.'s metrics (authority, completeness,complexity, informativeness, consistency, currency, and volatility), which use several features from wikipedia articles.
- "Authors define authority as “the degree of the reputation of an information object in a given community”"
- "Completeness is defined as “the granularity or precision of an information object’s model or content values according to some general-purpose IS-A ontology such as WordNet”."
- "The authors define complexity as “the degree of cognitive complexity of an information object relative to a particular activity”"
- "The definition of “Informativeness” is linked to the amount ofinformation that an information object contains."
- "Consistency is defined as “the extent to which similar attributesor elements of an information object are consistently represented with the same structure, format and precision"
- "Currency corresponds to “the age of an information object” in days"
- "Finally, volatility is defined as “the amount of time the informa-tion remains valid”."
Formulas for all the metrics are also described in the paper.
There are also some measures that are more related to the health articles, but could still be useful in a general-purpose scenario.
-
"Following the approach of Domingues and Teixeira Lopes [6], we used the MediaWiki API to collect the current state of the article’s contents and its metadata, revision history, language links, internal wiki links, and external links."
-
Features were tested for importance, and number of edits appeared to have the biggest correlation with quality, but most appear to have at least some impact.
[19] Dang, 2016. Quality assessment of Wikipedia articles without feature engineering
Proposes evaluation through analysing the article content, instead of the usual feature set approach. Explores NLP and Deep Learning approaches for evaluation.
Doesn't discuss/compare metrics and ML approaches with extreme detail, but is still evidently related to our research.
"There is no standard rule for selecting features, which is considered as one of the most difficult tasks in machine learning. Moreover, feature selection is language dependent."
This is a deep learning approach, so features are given less importance, we're feeding the NNs the articles themselves.
"In this paper, we applied the unsupervised learning algorithm called Paragraph Vector, recently known as Doc2Vec[13] that learns vector representations for variable-length pieces of texts and overcomes the disadvantages of bag-of-words by taking into account the order and semantics of words." "In this approach every word and every paragraph are mapped to a unique vector."
"In our approach, we used a DNN with four hidden layers to learn and classify the representation vectors of Wikipedia articles computed by Doc2Vec."
An accuracy of 55% was achieved (state-of-the-art had similar results).
[33] Blumenstock, 2008. Size matters: Word count as a measure of quality on Wikipedia
Study proposes simply using word count to measure quality, and claims it outperforms some approaches. Worth to check it out.
Not the most complex paper, it proposes one metric and a few ML algorithms, and discusses results succintly. It is dubious, the claim that it outperforms most solutions, however, it is indeed a simple metric to compute.
"By classifying articles with greater than 2,000 words as “featured” and those with fewer than 2,000 words as “random,” we achieved 96.31% accuracy in the binary classification task. The threshold was found by minimizing the error rate on the training set"
[45] Lipka, 2010. Identifying featured articles in Wikipedia: Writing style matters
Study tries to model a classification task: "Is this article featured or not?". It uses metrics more focused on the writing style, not so much on the content and edit history.
After assessing full text, paper describes a few ML approaches and discusses a couple of article quality metrics.
SVM and Naive bayes were the two approaches experimented with. As for metrics, it explains the analysis of character trigrams as a possible feature that could correlate with article quality.
For a classification task of distinguishing featured and non-featured articles, SVM yielded an accuracy of over 90%.
[52] De La Calzada, 2010. On measuring the quality of wikipedia articles
Evaluates wikipedia articles, seperating them on two categories: stabilized (haven't changed for a long time), and controversial (vandalism, reverts are common). Full text will be assessed, to verify if the paper gives enough emphasis on the evaluation metrics and approaches.
A few features are mentioned, but they are applied in a way different from the previous studies. Not much discussion regarding comparison of ML algorithms.
"Our study has used MediaWiki API to retrieve a variety of meta-data about Wikipedia articles. MediaWiki [21] is the open source wiki software platform used by Wikipedia. The MediaWiki API for Wikipedia is publicly available and is accessible through PHP via specially crafted URIs. The parameter list of such a URI determines the specifics of the query"
"First, we separate Wikipedia articles into a number of categories, based on their history and the nature of their topics". The authors don't go for the usual quality scale: "Overall, we have established six categories of Wikipedia articles: (1)stabilized articles, (2) controversial articles, (3) evolving articles, (4) list, (5) stub and (6) disambiguation page". More focus is given to categories (1) and (2).
Stabilized articles are ones that are not likely to change over time. Featured Articles are used as quality benchmarks for this category. The authors use a combination of 6 features, wrapped in a "mixture component" (uses mean and std. deviation) and each with their own "forgiveness factor" (default = 2).
Controversial articles are articles that could have a range of opinions on it. For modelling the quality, a similar approach to the stabilized model is used, but with different features.
The experimentation was mostly user-centered, and we're not interested in that part. However, the stabilized and controversial models were tested with an SVM classifier, achieving accuracies of ~78% for Stabilized and ~92% for Controversial.
[64] Stvilia, 2005. Information quality in a community-based encyclopedia
Paper proposes seven metrics to assess wikipedia article quality.
As assumed by the abstract, the study computes a few measures for a set of articles, also proposing a few metrics, and training a classifier to predict article quality.
-
"The Wikipedia context is rich with different roles. Trying to understand those roles and the processes they play in can help us to understand some of the sources of IQ variance in Wikipedia, and consequently the sources of relevant information for IQ metrics.". These roles can go from editors to vandals.
-
"Clearly, the main source of Wikipedia article IQ measurements is the article itself."
-
Several metrics were proposed: Authority/Reputation, Completeness, Informativeness, Consistency, Currency, Volatility.
- Several measures were computed for a set of articles, and the authors compared the median values for features/non-featured articles
- With a decision tree algorithm, the trained classifier reached ~90% precision/recall values.
[4] Chevalier, 2010. WikipediaViz: Conveying article quality for casual wikipedia readers
Abstract describes a more visual way to describe wikipedia quality, but does not go into much detail. However, mentions the suggestion of 5 different metrics for quality assessment.
Article takes 5 indicators of quality and displays it to the user on wikipedia. Suggested metrics relate to our research, but paper shows a lot of emphasis on the visualization aspect of it, instead of ML approaches.
"Wilkinson et al. have demonstrated later that high-quality vs. non-featured articles have indeed substantially more contributors involved"
The 5 metrics are simple, previously proposed ones:
- Word count: Very effective but not enough
- Number of contributors/Rate of contributions
- Number and length of edits
- Number of references and internal links
- Length of discussion page
[7] De La Robertie, 2015. Measuring article quality in Wikipedia using the collaboration network
Abstract mentions the focus on automatically assessing article quality, based on the interaction author-article. Not sure if will be of much use, but it will probably be worth to assess full text.
"This work gives a generic formulation of the Mutual Reinforcement principle held between articles quality and authors authority and take explicitly advantage of the co-edits graph generated by individuals".
Article relates to measuring article quality, but is completely based on modelling the collaboration network, not on using a feature set and quality metrics. Also did not discuss ML approaches.
TODO: Investigate Mutual Reinforcement Principle
Focuses on summarizing existing quality assessment features and comparing deep learning approaches.
After reviewing the full text, it was concluded that the study is the same as study 13.
[17] Teblunthuis, 2021. Measuring Wikipedia Article Quality in One Dimension by Extending ORES with Ordinal Regression
Extending Wikimedia ORES quality model, the paper's goal is to find accurate ways to measure article quality, with continous, one-dimensional metrics. Also focused on disproving the theory that «different levels of quality are "evenly spaced" from one another»
After assessing full text, paper appears to not be what was expected. Does not discuss article features, and seems to focus a lot more on disproving the theory above, regarding Wikipedia's quality scale.
[18] Hu, 2016. Automating assessment of collaborative writing quality in multiple stages: The case of wiki
Proposes the usage quality indicators related to academing writing and cognitive thinking, but abstract is unclear on how much they experiment with the automatic assessment of quality. Full text will need to be inspected.
Very short paper, doesn't explain used quality metrics and doesn't discuss the algorithm used. Even after assessing the text it is unclear on what the main focus of the paper is.
[21] Lex, 2012. Measuring the quality of web content using factual information
There is some focus on evaluating wikipedia article quality, although it seems to focus on a single measure, based on factual information, and using it with comparison to the word count.
After assessing full text, the paper appears to go for a more manual measurement of quality (number of facts), comparing it to a word count metric, so it's not very related to our research.
[25] Biancani, 2014. Measuring the quality of edits to Wikipedia
Proposes a scalable, intuitive and efficient measure of quality in wikipedia.
After reading the paper, it appears that the focus is on measuring the quality of edits, not the articles. Besides, the author does not propose any quality metrics or ML approaches, just uses Halfaker et al.'s devised Persistent Word Revisions (PWR), and validates it with human raters.
The authors propose three "article measurement models", focusing on interaction between wikipedia articles and contributors, based on their edit history.
Although the study discusses interesting models, it uses the metrics to directly output the article quality and not as a ML algorithm.
"Determining the quality of articles in Wikipedia is not an easy task to human users", because of large number of articles, wide range of topics, evolving content, and vandalism.
"In this section, we introduce our article quality measurement models, namely Basic,PeerReview and ProbReview."
"The users’ contribution to articles can be categorized into two types": Authorship (comes from the user) and Reviewership (comes from what the user reviews). Section 3.1 explains the notation used throughout the article.
- Basic Model: "is designed based on the principle that “the higher authority are the authors, the better quality is the article.”", this principle measures the quality of an article by the aggregation of authorities from all its authors.
- PeerReview: "If the content is approved by high authority reviewers, it is expected to be of better quality, even though the original authors of the reviewed content might be of low authority"
- ProbReview: "PeerReview's assumption that each user, who edits the article content, would review the entire article prior to his/her edit is not always true.". ProbReview attempts to fix this issue.
[28] Antunes, 2020. Proposal and Comparison of Health Specific Features for the Automatic Assessment of Readability
Study specialises on health-related articles, but still propose features that evaluate article quality on the Web. The features were tested with a machine learning approach, using a Wikipedia dataset the authors built.
It is similar to study 16, but seems to be more focused on the health aspect. Besides, the goal is much more focused on article readability, not quality, so the features don't seem to be in line with our research questions.
[29] Banerjee, 2011. Exploring wiki: Measuring the quality of social media using ant colony metaphor
Abstract does not mention the use of metrics/measures, but tries an ant colony approach. It is not entirely clear what the paper proposes, so will have to scan the article for relevance.
After assessing the article text, it can be concluded that this publication is a bit different from what we're looking for. Although an interesting approach (ant colony optimization to identify qualtity of wikipedia content), it is still rudimentary solution, and does not propose any quality metrics.
[36] Wilkinson, 2007. Cooperation and quality in Wikipedia
Appears to give some emphasis to possible criteria that define a quality Wikipedia article (e.g., number of edits, number of distinct authors). Will need to assess full text for a better understanding.
Little emphasis is given to predicting article quality in wikipedia. The goal of the paper is more to evaluate correlation between a few criteria (e.g number of edits) and content quality.
[38] Halfaker, 2009. A jury of your peers: quality, experience and ownership in Wikipedia
Abstract seems to note that the study proposes a "versatile metric" that will automatically measure a contribution to Wikipedia. Not entirely sure what is meant by "contribution", so full text will need to be assessed.
As suspected, paper is focused on measuring quality of edits/reviews, not wikipedia articles. Even so, the study does not show much detail on how to measure the contribution quality.
[39] Stein, 2007. Does it matter who contributes - A study on featured articles in the german wikipedia
Paper focuses on featured articles (articles that were marked as excellent quality, by the community) from the german wikipedia. Abstract may be describing a possible measure of quality based on number of contributors and the contributors themselves.
A couple of quality metrics are described, related to author contribution. An insightful paper, but no emphasis on machine learning approaches.
"The German Wikipedia has two types of featured articles: excellent (in German exzellent) and worth reading (in German lesenswert)"
[41] Wöhner, 2019. Assessing the quality of Wikipedia articles with lifecycle based metrics
The paper proposes new, efficient metrics based on the "lifecycles of low and high quality articles".
Paper describes unique concepts of "persistent" and "transient" contributions, and proposes a few metrics related to them. Not so much focus on the ML aspect of it, nor on the experimentation and results.
Mentions wikipedia's quality scale (FA, GA), but emphasizes the fact that most articles do not have an evaluation: "For example in January 2008 only about 3,500 of 650,000 articles altogether were evaluated in the German Wikipedia"
"Adler and de Alfaro calculate the reputation of the authors of the Wikipedia by using the survival time of their edits as the first step". They then use that reputation score in order to compute trustworthiness of each word.
Mentions Blumenstock's work: "Blumenstock demonstrates, with an accuracy of classification of 97%, that the number of words is the best current metric for distinguishing between Featured and Non-Featured Articles.", however "We believe that a particular portion of Non-Featured Articles is of high quality too", which is a reasonable hypothesis, some short articles can be of high quality, and long articles can be of poor quality, so we need "new, efficient and robust metrics for quality measurement".
Defines formulas for the concepts of "persistent" and "transient" contributions, which take into account the edit distance (in words) between revisions. However, the computation of these metrics is very time-consuming, so the sample sizes are small (100 low-quality articles, 100 high-quality articles).
Persistent contribution of a period consists of the editing distance between the last revision of the previous period and the last revision of the current period.
Transient contribution consists of the sum of the editing distances between all the revisions of that period, minus the persistent contribution of that period.
After that, several metrics are defined based on those concepts.
[42] Adler, 2008. Measuring author contributions to the Wikipedia
Publication reviews past work on estimating quality of author contribution, and compares different criteria that take into account not only the quantity, but also the quality of the contribution.
However, there is a lot of focus on the contribution quality, instead of the quality of the article itself. Could be an interesting read, but does not apply to the literature review.
[53] Adler, 2010. Detecting wikipedia vandalism using wikitrust
The focus appears to be on vandalism detection, but there is some mention of methods to assess article quality.
After reading the entire paper, it appears that the study does propose a few quality metrics, but they are destined for user contributions, not articles themselves.
[56] Kittur, 2008. Can you ever trust a Wiki? Impacting perceived trustworthiness in Wikipedia
Abstract is not entirely clear on whether the paper explores metrics that evaluate article quality, but it seems that it is more about the visualizing trustworthiness in Wikipedia. Still, full text should be assessed for confirmation.
As suspected, paper does not appear to relate much to our research goals.
[57] Khairova, 2017. Estimating the quality of articles in Russian wikipedia using the logical-linguistic model of fact extraction
Emphasis on russian articles, but proposes methods to estimate article quality. Full text needs to be assessed in order to determine whether the focus on the russian language will be an impediment.
Paper appears to be based on fact extraction and not so much on quality metrics to assess article quality. Besides, there is also some emphasis on russian grammar, which could hinder the research.
"Russian-language edition of Wikipedia is one of the major language versions of the online encyclopedia. For instance, the largest language version, which is English Wikipedia, contains five million articles, while Russian Wikipedia contains one million articles."
[63] Arazy, 2011. On the measurability of information quality
Abstract describes the study as being focused on measuring reliability of common quality indicators of wikipedia articles. Full paper will need to be assessed to clarify if the study discusses detailed metrics.
After reading the paper, it can be concluded that it does not really relate to quality features, and is more of a manual study of the existing wikipedia articles.
[65] Dondio, 2007. Computational trust in Web content quality: a comparative evalutation on the Wikipedia project
Paper presents method to evaluate trustworthiness of wikipedia articles.
Study gives some emphasis to domain-specific knowledge, and brings focus to subjective trustworthiness ideals, not quality measures.
"The difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three. Reviewers also found many factual errors, omissions or misleading statements: 162 and 123 in Wikipedia and Britannica respectively"
Proposes different methods to automatically identify controversial articles, which is not necessarily our goal, but could definitely be of some assistance.
Paper actually slightly deviates from our end goal, as it aims to predict article controversy (revert wars, frequent deletes, etc.), and designs metrics specifically for that goal.
[69] Stvilia, 2007. A framework for information quality assessment
This paper does not propose metrics that evaluate quality, but an actual Information Quality (IQ) framework, which takes into account more complex metrics. Will need to assess full text for viability.
After inspection, the study does not propose any metric, instead uses some measures from existing literature. It seems to be based on study 64, so it is somewhat of a duplicate.