Skip to content

Quality Features

MOAAS edited this page Apr 7, 2022 · 5 revisions

Quality Features

All papers define article quality features that fall within one of the following categories: [2]

  • Text features, which relate to the length and structure of the article, taking into account factors such as character/word count, sections, images, etc.
  • Style features, that measure how the authors write the articles, how long their phrases are, and what categories of words they use. (Maybe merge with text?)
  • Readability features estimate "the age or US grade level necessary to comprehend a text. (...) good articles should be well written, understandable, and free of unnecessary complexity" [2] , by measuring aspects such as words, characters and syllables.
  • History features, which analyse the review history of an article and other related factors such as the article's age and user contributions
  • Network features are a bit more complex, as they consider "the collection as a graph, where nodes are articles and edges are the citations between them", and measure their popularity

Some authors consider text, style and readability to be all Text Features, and distinguish the rest as subcategories.

For the style and readability features, the Diction software was used by [2]

Text Features

  • Character Count [1] [2] [5] [9] [12] [13] [52]
  • Word Count [2] [6] [12] [13] [14] [19] [33] [45]
  • Sentence Count [2] [12] [13]
  • Section Count [1] [2] [12] [13]
  • Mean Section Length (Characters) [2] [12] [52]
  • Mean Paragraph Length (Words) [2] [12] [13]
  • Largest Section Length [2] [12] [13]
  • Shortest Section Length [2] [12] [13]
  • Standard Deviation of Section Length [2] [12] [13]
  • Longest-Shortest Section Ratio [12]
  • Subsection Count [2] [12] [13]
  • Mean Subsection Count per Section [2] [12] [13]
  • Introduction Length (Characters) [2] [12] [13]
  • Introduction Length-Text Length ratio [12]
  • Number of references [5] [9] [14]
  • Citation Count [2] [12] [13]
  • Citation Count per Text Length [2] [12] [13] [52]
  • Citation Count per Section [2] [12] [13]
  • External Link Count [2] [5] [12] [13]
  • External Link Count per Section [2] [12] [13]
  • External Link Count per Text Length [2] [12] [13] [52]
  • Image Count [2] [5] [12] [14]
  • Images per Text Length [9] [12] [52]
  • Images per Section [2] [12] [13]
  • Internal Link Count [5] [9] [14]
  • Internal Link Count per Text Length [52]
  • Number of Headers [5]
  • Number of L2 Headings [9]
  • Number of L3 Headings [9]
  • Number of Information Boxes [5] [9]
  • Number of citation templates [9] (Not sure what this is)
  • Number of non-citation templates [9] (Not sure what this is)
  • Information noise score (Check [50])
  • Syllable Count [12]
  • Paragraph Count [12]

Style Features

  • Size of Largest Sentence (in words) [2] [12] [13]
  • Size of Shortest Sentence (in words) [12]
  • Mean Sentence Size [12]
  • Large Sentence Rate [2] [12] [13]: percentage of sentences whose length is ten words greater than the article average sentence length.
  • Short Sentence Rate [2] [12] [13]: percentage of sentences whose length is five words lesser than the article average sentence length.
  • Question count [2] [12] [13]
  • Question count per Sentence [12]
  • Exclamation Count [12]
  • Exclamation Count per Sentence [12]
  • Number of sentences starting with a pronoun [2] [12] [13]
  • Number of sentences starting with an article [2] [12] [13]
  • Number of sentences starting with a coordinate conjunction [2] [12] [13]
  • Number of sentences starting with a subordinate preposition or conjunction [2] [12] [13]
  • Number of sentences starting with an interrogative pronoun [2] [13]
  • Number of sentences starting with a preposition [2]
  • Number of sentences starting with a determiner [12]
  • Number of sentences starting with a adjective [12]
  • Number of sentences starting with a noun [12]
  • Number of sentences starting with an adverb [12]
  • Number of sentences starting with a pronoun per Sentence [12]
  • Number of sentences starting with an article per Sentence [12]
  • Number of sentences starting with a coordinate conjunction per Sentence [12]
  • Number of sentences starting with a subordinate preposition or conjunction per Sentence [12]
  • Number of sentences starting with a determiner per Sentence [12]
  • Number of sentences starting with a adjective per Sentence [12]
  • Number of sentences starting with a noun per Sentence [12]
  • Number of sentences starting with an adverb per Sentence [12]
  • Auxiliary verb count [2] [13]
  • Modal Verb Count [12]
  • Passive Voice Count [2] [12] [13]
  • "To be" Verb Count [12]
  • Unique Word Count [12]
  • Noun Count [12]
  • Unique Noun Count [12]
  • Verb Count [12]
  • Unique Verb Count [12]
  • Pronoun Count [2] [12]
  • Unique Pronoun Count [12]
  • Adjective Count [12]
  • Unique Adjective Count [12]
  • Adverb Count [12]
  • Unique Adverb Count [12]
  • Coordinating Conjunction Count [12]
  • Unique Coordinating Conjunction Count [12]
  • Subordinating Preposition or Conjunction Count [12]
  • Unique Subordinating Preposition or Conjunction Count [12]
  • Modal Verb Count per Word [12]
  • Passive Voice Count per Word [12]
  • "To be" Verb Count per Word [2] [12]
  • Unique Word Count per Word [12]
  • Noun Count per Word [12]
  • Unique Noun Count per Word [12]
  • Verb Count per Word [12]
  • Unique Verb Count per Word [12]
  • Pronoun Count per Word [12]
  • Adjective Count per Word [12]
  • Unique Adjective Count per Word [12]
  • Adverb Count per Word [12]
  • Unique Adverb Count per Word [12]
  • Coordinating Conjunction Count per Word [2] [12] [13]
  • Unique Coordinating Conjunction Count per Word [12]
  • Subordinating Preposition or Conjunction Count per Word [12]
  • Unique Subordinating Preposition or Conjunction Count per Word [12]
  • Preposition Count per Word [2] [13]
  • Nominalization Count per Word [2] [13]
  • "To be" Verb Count per Verb [12] [13]
  • Modal Verb Count per Passive Voice Count [12]
  • Unique Noun Count per Unique Word [12]
  • Unique Verb Count per Unique Word [12]
  • Unique Pronoun Count per Unique Word [12]
  • Unique Adjective Count per Unique Word [12]
  • Unique Adverb Count per Unique Word [12]
  • Unique Coordinating Conjunction Count per Unique Word [12]
  • Unique Subordinating Preposition or Conjunction Count per Unique Word [12]
  • Average Number of Syllables per Word [12]
  • Average Number of Characters per Word [12]
  • Word Per Sentence [6]
  • Words Larger than 6 Letters [6]
  • Percentage of unique words [6]
  • Top-m most discriminant character trigrams [45] [12] (Review Paper if needed)
  • Top-n most discriminant POS trigrams [45] [12] (Review Paper if needed)

Readability Features

  • Automated Readability Index [2] [9] [12] [13] [14]

ARI

  • Coleman-Liau [2] [9] [12] [13]

CL

  • Flesh reading ease [2] [9] [12] [13]: Computes a value between 0 and 100, where 0 indicates a text hard to understand.

FRE

  • Flesh-Kincaid [2] [9] [12] [13]: Same as [FRE], but provides US grade levels instead of values between 0 and 100.

FK

  • Gunning Fog Index [2] [9] [12] [13]

GFI

complexwords: number of words with three or more syllables

  • Lasbarhets index [2] [12] [13]: The higher its value, the more difficult is the text to read

LBI

complexwords: number of words with more than six characters

  • Smog-Grading [2] [9] [12] [13]

SG

polysyllables: average of polysyllabic words, excluding proper names, taken from a sample of 30 sentences

  • Difficult Word Score [9]: "The difficult wordscore of a given English text is calculated based on how many difficult words appear in a text. A word is considered difficult if it does not appear in a list of 3000 common English words that groups of fourth-grade American students could reliably understand." Review [9] for more information on how to calculate it.

  • Dale-Chall [9] [12]

DC

  • Linsear Write Formula [9]

LWF

History Features

  • Age [2] [12] [13] [14]: Age in days of the article. Recent articles are usually of worse quality.
  • Age per review [2] [12] [13]: "Used to verify the average length of time an article remains without revision"
  • Reviews per day [2] [12] [13]
  • Reviews per contributor [2] [12] [13]
  • Reviews per contributor standard deviation [2] [12] [13]: Important to confirm how balanced is the number of contributions per each contributor
  • Discussion count [2] [12] [13]
  • Review Count [2] [5] [12] [13] [14]
  • Contributor Count [12] [14]
  • Registered Contributor Count [12] [13]
  • Anonymous Contributor Count [12] [13] [14]
  • Registered Contributor Count Per Contributor [12] [52]
  • Anonymous Contributor Count Per Contributor [12] [52]
  • Registered-Anonymous Contributor Ratio [12] [52]
  • Anonymous review Count [2] [12]
  • Registered review Count [2] [12]
  • Anonymous review Count per Review [12]
  • Registered review Count per Review [12]
  • Registered-Anonymous Review Ratio [12]
  • Revert Count [12]
  • Revert Count per Review [12] [52]
  • Recent Modified Lines [2] [12]: Number of lines modified between current and 3-months-old version. Good indicator for stability.
  • Occasional Contributor Review Rate [2]: Percentage of reviews made by users which edited the article less than 4 times
  • Recent Review rate [2] [12] [13]: Percentage of reviews made in the last 3 months
  • Recent Review Count [12]: Number of reviews made in the last 3 months
  • Active contributor review rate [2] [12] [13]: Percentage of reviews made by the most active 5% reviewers
  • Active contributor review Count [12]: Number of reviews made by the most active 5% reviewers
  • ProbReview [2] [12] [52]: "the quality of the reviewers is based on the quality of the articles they reviewed". This metric has a complex formula and may be expensive to compute. Refer to the articles for a more lenghty explanation.
  • Number of Quality Articles authored by Team (HQAT) [1]: How many FA or GA articles were authored by the contributors
  • Time since Last Edit (HTLE) [1]: Total amount of time (seconds) since last article edit

Network Features

  • PageRank [2] [12] [13]
  • In-degree [2] [12] [13]: Number of articles that cite the article
  • Out-degree [2] [12] [13]: Number of articles cited by the article
  • Assortativity (in-in, in-out, out-in and out-out) [2] [12] [13]: Ratio between (in/out)-degree of the node and the average (in/out)-degree of its neighbors
  • Clustering coefficient [12] [13]: "Division of the edges of the current node and its nearest neighbors by the total number of possible edges. This metric is used to indicate if an article belongs to a group of articles related to each other." -> TODO: Investigate what this means exactly
  • Reciprocity [2] [12] [13]: Rate of articles that cite the article and are cited by it
  • Translation count [2] [12] [13]: Number of versions in other languages

Stvilia's Metrics

Some studies ([16], [64]) also use these metrics (improve text)

Authority

Authority = 0.2 * Unique Editors + 0.2 * Contributions + 0.1 * Connectivity + 0.3 * Reverts + 0.2 * External Links + 0.1 * Registered Contributions + 0.2 * Anonymous Contributions

"Authors define authority as "the degree of the reputation of an information object in a given community""

"Connectivity corresponds to the number of articles linked to thearticle through joint editors."

Completeness

0,4 * Num. Internal Broken Links + 0,4 * Num.Internal Links + 0,2 * Article Length

"Completeness is defined as "the granularity or precision of an information object's model or content values according to some general-purpose IS-A ontology such as WordNet"

Complexity

Complexity = 0,5 * Flesch readingease - 0,5 * Kincaid grade level

"The authors define complexity as "the degree of cognitive complexity of an information object relative to a particular activity""

Informativeness

Informativeness = 0,6 * InfoNoise - 0,6 * Diversity + 0,3 * Images

"The definition of "Informativeness" is linked to the amount ofinformation that an information object contains."

"InfoNoise is based on previous work and refers to the ratio between the information present in an article and its total size, where the so-called noise exists. It refers to the ratio between the size of the information content, in words, after stemming and stopping, and the object's total size. Diversity corresponds to the ratio between the number of unique edits and the number of total edits of an article."

Consistency

Consistency = 0,6 * Administrators Edit Share + 0,5 * Age (days)

"Consistency is defined as "the extent to which similar attributes or elements of an information object are consistently represented with the same structure, format and precision"

Currency

Currency = Collection Date - Date of Last Article Edition

"Currency corresponds to "the age of an information object" in days"

Volatility

Volatility = Median Revert Time

"Finally, volatility is defined as "the amount of time the informa-tion remains valid"."